Switch language / 언어 전환
Quantized Perturbation Sensitivity Analysis — layer importance scoring for quantized LLMs.
Existing layer importance methods (gradient-based, activation-based) were designed for continuous-weight models. They degrade on quantized (INT4/INT8) models — gradients require STE approximation, and activation similarity only measures redundancy, not criticality. Q-PSA directly perturbs weights within the quantization grid and measures the resulting perplexity change, producing an importance score native to discrete weight spaces.
Methodological roots: This project draws inspiration from the "observation in discrete space" step of Wave Function Collapse (WFC). However, Q-PSA does not follow WFC's collapse or propagation logic — it is independently defined as a discrete-space sensitivity analysis. WFC constructs a solution; Q-PSA measures the stability of an already-constructed one.
❌ Phase 1 — Failed. Q-PSA was not WFC-based space exploration as intended, but classical perturbation sensitivity analysis. It does not correlate with layer importance for pruning (3.65x PPL vs Layer Ablation's 1.05x at K=1), and is ~1,300x slower. Kill criteria: 2/3 failed. See full results.
For each Transformer layer:
1. Sample weights (stratified across Q, K, V, O, FFN sub-matrices)
2. Perturb each sampled weight by ±1 quantization level
3. Measure ΔPPL for each perturbation
4. Aggregate → Q-PSA Score per layer
High sensitivity = layer is critical (small change → large PPL shift)
Low sensitivity = layer is redundant (changes don't matter)
| Item | Detail |
|---|---|
| Model | Qwen2.5-0.5B-Instruct (Q4_K_M, GGUF, 24 layers) |
| GPU | NVIDIA RTX 3060 (12GB VRAM) |
| Evaluation | WikiText-2, 500 tokens |
| Baseline PPL | 11.8881 |
| Seed | 42 |
| Criterion | Result | Verdict |
|---|---|---|
| Differentiation (ρ < 0.9) | ρ = 0.233 | ✅ Pass |
| Pruning PPL retention | Layer Ablation wins (1.05x vs 3.65x) | ❌ Fail |
| Looping validation | Invalid — GGUF cannot simulate looping | |
| Speed (< 100x slower) | ~1,300x slower (155 min vs 7 s) | ❌ Fail |
| Method | K=1 | K=2 | K=4 | K=8 | Runtime |
|---|---|---|---|---|---|
| Layer Ablation | 1.05x | 1.14x | 1.60x | 2.09x | 7 s |
| Weight Norm | 1.23x | 1.50x | 4.59x | 1,100x | 12 s |
| Q-PSA | 3.65x | 6.24x | 338x | 17,612x | 155 min |
| Random | 302x | 1,648x | 38,039x | 9.4e8x | — |
| N | Time | Spearman's ρ vs N=400 |
|---|---|---|
| 50 | 78 min | 0.919 |
| 100 | 155 min | — |
| 200 | 324 min | 0.822 |
| 400 | 596 min | 1.000 (reference) |
- Top-4 (L0, L1, L2, L3) and Bottom-5 (L18–L21, L23): stable across all N.
- N=100 is sufficient — scores converge from N=100 onward.
Despite the framing, Q-PSA implemented no collapse, no propagation, no constraint reasoning. Each byte was perturbed independently — this is classical perturbation sensitivity analysis, not WFC.
Q-PSA measures local curvature (how much does a small change affect output?). Pruning requires functional importance (can the layer be removed?). These are different concepts:
| Layer | Q-PSA rank | Ablation rank | Pattern |
|---|---|---|---|
| L23 | 24th (least sensitive) | 4th (very important) | Robust but essential |
| L10 | 17th | 24th (least important) | Sensitive but expendable |
Analogy: Wiggling a table leg slightly doesn't move it (low sensitivity). Removing it collapses the table (high importance).
Choosing GGUF/llama.cpp created insurmountable constraints:
- No gradient access — Cannot compute Hessian or Taylor importance. Presented as a feature ("gradient-free"), but was actually the root constraint.
- Fixed computation graph — No mechanism for layer insertion, repetition, or re-routing. Layer looping validation was impossible from the start.
- Independent block quantization — GGUF Q4_K blocks are self-contained. No inter-block dependencies for WFC-style constraint propagation.
The irony: Q-PSA's raison d'être ("analyze without gradients") was precisely what made it fail ("without gradients, meaningful analysis is impossible"). To do proper analysis, you need PyTorch-based quantization (GPTQ, AWQ) with gradient access — but then Q-PSA becomes unnecessary.
The looping validation had two independent failures:
- Design error — Copied top-K weights into bottom-K positions (= replacement/pruning), not looping (= repeating forward passes through important layers).
- Platform limitation — GGUF/llama.cpp cannot insert, repeat, or re-route layers regardless of implementation approach.
Result: nearly all configurations produced PPL = inf.
Despite failure, this project produced:
- GGUF in-memory weight perturbation pipeline — ctypes-based tensor read/write via llama.cpp C++ internals. Reusable for other GGUF analysis.
- Layer Ablation as a fast importance metric — 7 seconds to produce pruning-optimal rankings. Simple and effective.
- Proof that WFC does not transfer to neural network weight analysis — Saves future effort on this direction.
- Kill criteria methodology — Pre-defined criteria, rigorously applied. Failure documented transparently.
# Setup
python -m venv .venv && source .venv/bin/activate
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
pip install -e .
# Download model
q-psa download Qwen/Qwen2.5-0.5B-Instruct-GGUF qwen2.5-0.5b-instruct-q4_k_m.gguf
# Q-PSA scoring (N=100, ~2.5 hours on RTX 3060)
q-psa score models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
--samples-per-matrix 100 --max-tokens 500 --seed 42
# Baselines (~20 seconds)
q-psa baseline models/qwen2.5-0.5b-instruct-q4_k_m.gguf --max-tokens 500
# Pruning validation
q-psa prune models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
--rankings experiments/pruning_rankings.json --k 1 2 4 8q-psa download <repo_id> <filename> # Download GGUF model from HuggingFace
q-psa inspect <model_path> [--layer N] # Inspect GGUF tensor structure
q-psa ppl <model_path> [--max-tokens N] # Measure baseline perplexity (WikiText-2)
q-psa score <model_path> [--samples-per-matrix N] [--layers 0-5] [--seed 42] # Q-PSA scoring
q-psa baseline <model_path> [--methods weight_norm layer_ablation random] # Baseline methods
q-psa prune <model_path> --rankings <file> [--k 1 2 4 8] # Pruning validation
q-psa loop <model_path> --rankings <file> [--k 1 2 4] # Layer looping (invalid)| Item | Detail |
|---|---|
| Tensor access | C++ mangled _ZNK11llama_model10get_tensorEPKc via ctypes |
| Model loading | use_mmap=False required for writable tensor access |
| Sub-matrices | attn_q/k/v/output, ffn_gate/up/down (7 per layer) |
| Perturbation | ±1 byte at random offsets, bidirectional average |
| nan/inf handling | Scale byte perturbation → inf PPL → sample skipped, tracked in n_skipped |
| NVIDIA libs | cli.py auto-sets LD_LIBRARY_PATH via re-exec |
Environment: Python 3.12, llama-cpp-python 0.3.16, NVIDIA RTX 3060, Linux (WSL2).
- Results (EN): docs/results.md
- Results (KO): docs/results.ko.md
- Concept (EN): docs/concept.en.md
- Concept (KO): docs/concept.md
- T-WFC — WFC-based gradient-free training of toy MLPs. Concluded that discrete-space search cannot replace SGD for nonlinear problems. The "observation" concept is repurposed here for layer importance analysis in quantized models.