Q-PSA

_{Switch language / 언어 전환}

Q-PSA

Quantized Perturbation Sensitivity Analysis — layer importance scoring for quantized LLMs.

Existing layer importance methods (gradient-based, activation-based) were designed for continuous-weight models. They degrade on quantized (INT4/INT8) models — gradients require STE approximation, and activation similarity only measures redundancy, not criticality. Q-PSA directly perturbs weights within the quantization grid and measures the resulting perplexity change, producing an importance score native to discrete weight spaces.

Methodological roots: This project draws inspiration from the "observation in discrete space" step of Wave Function Collapse (WFC). However, Q-PSA does not follow WFC's collapse or propagation logic — it is independently defined as a discrete-space sensitivity analysis. WFC constructs a solution; Q-PSA measures the stability of an already-constructed one.

Status

❌ Phase 1 — Failed. Q-PSA was not WFC-based space exploration as intended, but classical perturbation sensitivity analysis. It does not correlate with layer importance for pruning (3.65x PPL vs Layer Ablation's 1.05x at K=1), and is ~1,300x slower. Kill criteria: 2/3 failed. See full results.

How It Works

For each Transformer layer:
  1. Sample weights (stratified across Q, K, V, O, FFN sub-matrices)
  2. Perturb each sampled weight by ±1 quantization level
  3. Measure ΔPPL for each perturbation
  4. Aggregate → Q-PSA Score per layer

High sensitivity = layer is critical  (small change → large PPL shift)
Low sensitivity  = layer is redundant (changes don't matter)

Phase 1 Results Summary

Experiment Setup

Item	Detail
Model	Qwen2.5-0.5B-Instruct (Q4_K_M, GGUF, 24 layers)
GPU	NVIDIA RTX 3060 (12GB VRAM)
Evaluation	WikiText-2, 500 tokens
Baseline PPL	11.8881
Seed	42

Kill Criteria

Criterion	Result	Verdict
Differentiation (ρ < 0.9)	ρ = 0.233	✅ Pass
Pruning PPL retention	Layer Ablation wins (1.05x vs 3.65x)	❌ Fail
Looping validation	Invalid — GGUF cannot simulate looping	⚠️ Inconclusive
Speed (< 100x slower)	~1,300x slower (155 min vs 7 s)	❌ Fail

Pruning Validation (PPL ratio, lower = better)

Method	K=1	K=2	K=4	K=8	Runtime
Layer Ablation	1.05x	1.14x	1.60x	2.09x	7 s
Weight Norm	1.23x	1.50x	4.59x	1,100x	12 s
Q-PSA	3.65x	6.24x	338x	17,612x	155 min
Random	302x	1,648x	38,039x	9.4e8x	—

Sample-Size Convergence

N	Time	Spearman's ρ vs N=400
50	78 min	0.919
100	155 min	—
200	324 min	0.822
400	596 min	1.000 (reference)

Top-4 (L0, L1, L2, L3) and Bottom-5 (L18–L21, L23): stable across all N.
N=100 is sufficient — scores converge from N=100 onward.

Why It Failed

1. Not WFC — classical perturbation analysis

Despite the framing, Q-PSA implemented no collapse, no propagation, no constraint reasoning. Each byte was perturbed independently — this is classical perturbation sensitivity analysis, not WFC.

2. Wrong hypothesis — sensitivity ≠ importance

Q-PSA measures local curvature (how much does a small change affect output?). Pruning requires functional importance (can the layer be removed?). These are different concepts:

Layer	Q-PSA rank	Ablation rank	Pattern
L23	24th (least sensitive)	4th (very important)	Robust but essential
L10	17th	24th (least important)	Sensitive but expendable

Analogy: Wiggling a table leg slightly doesn't move it (low sensitivity). Removing it collapses the table (high importance).

3. Wrong platform — GGUF structural limitations

Choosing GGUF/llama.cpp created insurmountable constraints:

No gradient access — Cannot compute Hessian or Taylor importance. Presented as a feature ("gradient-free"), but was actually the root constraint.
Fixed computation graph — No mechanism for layer insertion, repetition, or re-routing. Layer looping validation was impossible from the start.
Independent block quantization — GGUF Q4_K blocks are self-contained. No inter-block dependencies for WFC-style constraint propagation.

The irony: Q-PSA's raison d'être ("analyze without gradients") was precisely what made it fail ("without gradients, meaningful analysis is impossible"). To do proper analysis, you need PyTorch-based quantization (GPTQ, AWQ) with gradient access — but then Q-PSA becomes unnecessary.

4. Layer looping experiment was doubly invalid

The looping validation had two independent failures:

Design error — Copied top-K weights into bottom-K positions (= replacement/pruning), not looping (= repeating forward passes through important layers).
Platform limitation — GGUF/llama.cpp cannot insert, repeat, or re-route layers regardless of implementation approach.

Result: nearly all configurations produced PPL = inf.

Positive Outcomes

Despite failure, this project produced:

GGUF in-memory weight perturbation pipeline — ctypes-based tensor read/write via llama.cpp C++ internals. Reusable for other GGUF analysis.
Layer Ablation as a fast importance metric — 7 seconds to produce pruning-optimal rankings. Simple and effective.
Proof that WFC does not transfer to neural network weight analysis — Saves future effort on this direction.
Kill criteria methodology — Pre-defined criteria, rigorously applied. Failure documented transparently.

Quick Start

# Setup
python -m venv .venv && source .venv/bin/activate
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
pip install -e .

# Download model
q-psa download Qwen/Qwen2.5-0.5B-Instruct-GGUF qwen2.5-0.5b-instruct-q4_k_m.gguf

# Q-PSA scoring (N=100, ~2.5 hours on RTX 3060)
q-psa score models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
  --samples-per-matrix 100 --max-tokens 500 --seed 42

# Baselines (~20 seconds)
q-psa baseline models/qwen2.5-0.5b-instruct-q4_k_m.gguf --max-tokens 500

# Pruning validation
q-psa prune models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
  --rankings experiments/pruning_rankings.json --k 1 2 4 8

CLI Commands

q-psa download <repo_id> <filename>          # Download GGUF model from HuggingFace
q-psa inspect <model_path> [--layer N]       # Inspect GGUF tensor structure
q-psa ppl <model_path> [--max-tokens N]      # Measure baseline perplexity (WikiText-2)
q-psa score <model_path> [--samples-per-matrix N] [--layers 0-5] [--seed 42]  # Q-PSA scoring
q-psa baseline <model_path> [--methods weight_norm layer_ablation random]      # Baseline methods
q-psa prune <model_path> --rankings <file> [--k 1 2 4 8]                       # Pruning validation
q-psa loop <model_path> --rankings <file> [--k 1 2 4]                          # Layer looping (invalid)

Technical Details

Item	Detail
Tensor access	C++ mangled `_ZNK11llama_model10get_tensorEPKc` via ctypes
Model loading	`use_mmap=False` required for writable tensor access
Sub-matrices	`attn_q/k/v/output`, `ffn_gate/up/down` (7 per layer)
Perturbation	±1 byte at random offsets, bidirectional average
nan/inf handling	Scale byte perturbation → inf PPL → sample skipped, tracked in `n_skipped`
NVIDIA libs	`cli.py` auto-sets `LD_LIBRARY_PATH` via re-exec

Environment: Python 3.12, llama-cpp-python 0.3.16, NVIDIA RTX 3060, Linux (WSL2).

Documentation

Results (EN): docs/results.md
Results (KO): docs/results.ko.md
Concept (EN): docs/concept.en.md
Concept (KO): docs/concept.md

Prior Work

T-WFC — WFC-based gradient-free training of toy MLPs. Concluded that discrete-space search cannot replace SGD for nonlinear problems. The "observation" concept is repurposed here for layer importance analysis in quantized models.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
experiments		experiments
src/q_psa		src/q_psa
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.ko.md		README.ko.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Q-PSA

Status

How It Works

Phase 1 Results Summary

Experiment Setup

Kill Criteria

Pruning Validation (PPL ratio, lower = better)

Sample-Size Convergence

Why It Failed

1. Not WFC — classical perturbation analysis

2. Wrong hypothesis — sensitivity ≠ importance

3. Wrong platform — GGUF structural limitations

4. Layer looping experiment was doubly invalid

Positive Outcomes

Quick Start

CLI Commands

Technical Details

Documentation

Prior Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Q-PSA

Status

How It Works

Phase 1 Results Summary

Experiment Setup

Kill Criteria

Pruning Validation (PPL ratio, lower = better)

Sample-Size Convergence

Why It Failed

1. Not WFC — classical perturbation analysis

2. Wrong hypothesis — sensitivity ≠ importance

3. Wrong platform — GGUF structural limitations

4. Layer looping experiment was doubly invalid

Positive Outcomes

Quick Start

CLI Commands

Technical Details

Documentation

Prior Work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages