Skip to content

entangelk/Q-PSA_Pr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

English 한국어

Switch language / 언어 전환

Q-PSA

Quantized Perturbation Sensitivity Analysis — layer importance scoring for quantized LLMs.

Existing layer importance methods (gradient-based, activation-based) were designed for continuous-weight models. They degrade on quantized (INT4/INT8) models — gradients require STE approximation, and activation similarity only measures redundancy, not criticality. Q-PSA directly perturbs weights within the quantization grid and measures the resulting perplexity change, producing an importance score native to discrete weight spaces.

Methodological roots: This project draws inspiration from the "observation in discrete space" step of Wave Function Collapse (WFC). However, Q-PSA does not follow WFC's collapse or propagation logic — it is independently defined as a discrete-space sensitivity analysis. WFC constructs a solution; Q-PSA measures the stability of an already-constructed one.

Status

Phase 1 — Failed. Q-PSA was not WFC-based space exploration as intended, but classical perturbation sensitivity analysis. It does not correlate with layer importance for pruning (3.65x PPL vs Layer Ablation's 1.05x at K=1), and is ~1,300x slower. Kill criteria: 2/3 failed. See full results.

How It Works

For each Transformer layer:
  1. Sample weights (stratified across Q, K, V, O, FFN sub-matrices)
  2. Perturb each sampled weight by ±1 quantization level
  3. Measure ΔPPL for each perturbation
  4. Aggregate → Q-PSA Score per layer

High sensitivity = layer is critical  (small change → large PPL shift)
Low sensitivity  = layer is redundant (changes don't matter)

Phase 1 Results Summary

Experiment Setup

Item Detail
Model Qwen2.5-0.5B-Instruct (Q4_K_M, GGUF, 24 layers)
GPU NVIDIA RTX 3060 (12GB VRAM)
Evaluation WikiText-2, 500 tokens
Baseline PPL 11.8881
Seed 42

Kill Criteria

Criterion Result Verdict
Differentiation (ρ < 0.9) ρ = 0.233 ✅ Pass
Pruning PPL retention Layer Ablation wins (1.05x vs 3.65x) ❌ Fail
Looping validation Invalid — GGUF cannot simulate looping ⚠️ Inconclusive
Speed (< 100x slower) ~1,300x slower (155 min vs 7 s) ❌ Fail

Pruning Validation (PPL ratio, lower = better)

Method K=1 K=2 K=4 K=8 Runtime
Layer Ablation 1.05x 1.14x 1.60x 2.09x 7 s
Weight Norm 1.23x 1.50x 4.59x 1,100x 12 s
Q-PSA 3.65x 6.24x 338x 17,612x 155 min
Random 302x 1,648x 38,039x 9.4e8x

Sample-Size Convergence

N Time Spearman's ρ vs N=400
50 78 min 0.919
100 155 min
200 324 min 0.822
400 596 min 1.000 (reference)
  • Top-4 (L0, L1, L2, L3) and Bottom-5 (L18–L21, L23): stable across all N.
  • N=100 is sufficient — scores converge from N=100 onward.

Why It Failed

1. Not WFC — classical perturbation analysis

Despite the framing, Q-PSA implemented no collapse, no propagation, no constraint reasoning. Each byte was perturbed independently — this is classical perturbation sensitivity analysis, not WFC.

2. Wrong hypothesis — sensitivity ≠ importance

Q-PSA measures local curvature (how much does a small change affect output?). Pruning requires functional importance (can the layer be removed?). These are different concepts:

Layer Q-PSA rank Ablation rank Pattern
L23 24th (least sensitive) 4th (very important) Robust but essential
L10 17th 24th (least important) Sensitive but expendable

Analogy: Wiggling a table leg slightly doesn't move it (low sensitivity). Removing it collapses the table (high importance).

3. Wrong platform — GGUF structural limitations

Choosing GGUF/llama.cpp created insurmountable constraints:

  • No gradient access — Cannot compute Hessian or Taylor importance. Presented as a feature ("gradient-free"), but was actually the root constraint.
  • Fixed computation graph — No mechanism for layer insertion, repetition, or re-routing. Layer looping validation was impossible from the start.
  • Independent block quantization — GGUF Q4_K blocks are self-contained. No inter-block dependencies for WFC-style constraint propagation.

The irony: Q-PSA's raison d'être ("analyze without gradients") was precisely what made it fail ("without gradients, meaningful analysis is impossible"). To do proper analysis, you need PyTorch-based quantization (GPTQ, AWQ) with gradient access — but then Q-PSA becomes unnecessary.

4. Layer looping experiment was doubly invalid

The looping validation had two independent failures:

  1. Design error — Copied top-K weights into bottom-K positions (= replacement/pruning), not looping (= repeating forward passes through important layers).
  2. Platform limitation — GGUF/llama.cpp cannot insert, repeat, or re-route layers regardless of implementation approach.

Result: nearly all configurations produced PPL = inf.

Positive Outcomes

Despite failure, this project produced:

  1. GGUF in-memory weight perturbation pipeline — ctypes-based tensor read/write via llama.cpp C++ internals. Reusable for other GGUF analysis.
  2. Layer Ablation as a fast importance metric — 7 seconds to produce pruning-optimal rankings. Simple and effective.
  3. Proof that WFC does not transfer to neural network weight analysis — Saves future effort on this direction.
  4. Kill criteria methodology — Pre-defined criteria, rigorously applied. Failure documented transparently.

Quick Start

# Setup
python -m venv .venv && source .venv/bin/activate
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
pip install -e .

# Download model
q-psa download Qwen/Qwen2.5-0.5B-Instruct-GGUF qwen2.5-0.5b-instruct-q4_k_m.gguf

# Q-PSA scoring (N=100, ~2.5 hours on RTX 3060)
q-psa score models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
  --samples-per-matrix 100 --max-tokens 500 --seed 42

# Baselines (~20 seconds)
q-psa baseline models/qwen2.5-0.5b-instruct-q4_k_m.gguf --max-tokens 500

# Pruning validation
q-psa prune models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
  --rankings experiments/pruning_rankings.json --k 1 2 4 8

CLI Commands

q-psa download <repo_id> <filename>          # Download GGUF model from HuggingFace
q-psa inspect <model_path> [--layer N]       # Inspect GGUF tensor structure
q-psa ppl <model_path> [--max-tokens N]      # Measure baseline perplexity (WikiText-2)
q-psa score <model_path> [--samples-per-matrix N] [--layers 0-5] [--seed 42]  # Q-PSA scoring
q-psa baseline <model_path> [--methods weight_norm layer_ablation random]      # Baseline methods
q-psa prune <model_path> --rankings <file> [--k 1 2 4 8]                       # Pruning validation
q-psa loop <model_path> --rankings <file> [--k 1 2 4]                          # Layer looping (invalid)

Technical Details

Item Detail
Tensor access C++ mangled _ZNK11llama_model10get_tensorEPKc via ctypes
Model loading use_mmap=False required for writable tensor access
Sub-matrices attn_q/k/v/output, ffn_gate/up/down (7 per layer)
Perturbation ±1 byte at random offsets, bidirectional average
nan/inf handling Scale byte perturbation → inf PPL → sample skipped, tracked in n_skipped
NVIDIA libs cli.py auto-sets LD_LIBRARY_PATH via re-exec

Environment: Python 3.12, llama-cpp-python 0.3.16, NVIDIA RTX 3060, Linux (WSL2).

Documentation

Prior Work

  • T-WFC — WFC-based gradient-free training of toy MLPs. Concluded that discrete-space search cannot replace SGD for nonlinear problems. The "observation" concept is repurposed here for layer importance analysis in quantized models.

About

A framework for analyzing Large Language Model (LLM) performance through Quantized PSA and structured weight pruning experiments

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages