A lightweight experimental implementation of several KV-cache compression strategies for causal language models, with reproducible evaluation scripts for perplexity and generation latency.
This project evaluates how different KV-cache compression methods affect:
- PPL: suffix perplexity after prefix prefill
- Prefill time: prompt forward latency
- TTFT: time to first generated token
- TPOT: average time per generated token after the first token
- Throughput: generated tokens per second
The current experiments use EleutherAI/pythia-70m on wikitext-2-raw-v1 and PG19.
KVPress/
├── press/
│ ├── base_press.py
│ ├── score_based.py
│ ├── knorm_press.py
│ ├── snapkv_press.py
│ ├── streaming_llm_press.py
│ ├── lagkv_press.py
│ └── utils.py
├── script/
│ ├── evaluation_metrics.py
│ ├── run_grid_metrics.py
│ └── visualization.py
├── results/
│ ├── EleutherAI_pythia-70m_wikitext_kv_metrics.csv
│ ├── EleutherAI_pythia-70m_pg19_kv_metrics.csv
│ └── plots/
├── requirements.txt
└── README.md
Create a Python environment and install dependencies:
pip install -r requirements.txtIf you use Hugging Face models or datasets frequently, logging in is recommended:
huggingface-cli loginPublic models and datasets can still be downloaded without login, but unauthenticated requests may be slower or rate-limited.
No KV-cache compression is applied. This row is used as the reference for PPL delta and speedup metrics.
Keeps KV pairs according to key-vector norm scores. In this implementation, lower key norm corresponds to higher retention priority by using negative norm as the score.
Uses attention statistics from a recent local window to select important KV pairs.
Keeps a fixed number of initial sink tokens and a recent-window style cache budget.
Keeps sink tokens and recent tokens while applying a lag-aware retention strategy.
The evaluation has two separate parts.
PPL is computed using a two-stage protocol:
input sequence length = max_length
prefix length = max_length * prefix_ratio
suffix length = max_length - prefix length
For example, with:
max_length = 1024
prefix_ratio = 0.5
we evaluate:
prefix = first 512 tokens
suffix = last 512 tokens
The model first performs prefill on the prefix. For compressed methods, KV compression is applied only during this prefix prefill stage. Then the suffix is evaluated using the resulting KV cache.
Therefore, reported PPL is suffix PPL under a cached-prefix setting, not the exact same value as standard full-sequence benchmark PPL. This protocol is designed to isolate the quality impact of KV-cache compression.
Generation metrics use a prompt-generation setup:
generation_prompt_tokens = max_length
generation_output_tokens = gen_tokens
generation_total_tokens = max_length + gen_tokens
For example:
max_length = 1024
gen_tokens = 32
means latency is measured by feeding a 1024-token prompt and generating 32 new tokens.
The reported latency metrics are:
| Metric | Meaning |
|---|---|
prefill_time |
Time for prompt forward with use_cache=True |
ttft |
Time to obtain the first generated token |
tpot |
Average time per output token after the first generated token |
throughput |
Generated tokens per second |
decode_time |
Total time spent decoding generated tokens after the first token |
total_generation_time |
TTFT + decode time |
The CSV also includes relative metrics:
| Metric | Meaning |
|---|---|
ppl_delta |
Current method PPL minus baseline PPL |
prefill_speedup |
Baseline prefill time divided by method prefill time |
ttft_speedup |
Baseline TTFT divided by method TTFT |
tpot_speedup |
Baseline TPOT divided by method TPOT |
throughput_ratio |
Method throughput divided by baseline throughput |
Example: evaluate KNorm with compression ratio 0.5 on WikiText.
python script/evaluation_metrics.py \
--method knorm \
--compression_ratio 0.5 \
--dataset wikitext \
--model EleutherAI/pythia-70m \
--max_length 1024 \
--max_samples 100 \
--n_timing 20 \
--n_ppl 20 \
--gen_tokens 32 \
--warmup_times 4 \
--output_dir ./resultsThe grid script first evaluates the baseline, then evaluates all compression methods under multiple compression ratios.
Default compression ratios:
0.1, 0.3, 0.5, 0.7
Run both WikiText and PG19:
python script/run_grid_metrics.py \
--model EleutherAI/pythia-70m \
--max_length 1024 \
--max_samples 100 \
--n_timing 20 \
--n_ppl 20 \
--gen_tokens 32 \
--warmup_times 4 \
--output_dir ./resultsRun only one dataset:
python script/run_grid_metrics.py --dataset wikitextor:
python script/run_grid_metrics.py --dataset pg19Output files are named by model and dataset:
results/EleutherAI_pythia-70m_wikitext_kv_metrics.csv
results/EleutherAI_pythia-70m_wikitext_kv_metrics.json
results/EleutherAI_pythia-70m_pg19_kv_metrics.csv
results/EleutherAI_pythia-70m_pg19_kv_metrics.json
If multiple datasets are evaluated together, an additional combined CSV/JSON is saved.
The visualization script reads a metrics CSV and produces two figures per dataset:
- PPL vs compression ratio
- A representative time metric vs compression ratio
By default, the time metric is tpot, because TPOT directly reflects the decode-time benefit of KV-cache compression.
python script/visualization.py \
--csv results/EleutherAI_pythia-70m_wikitext_kv_metrics.csvpython script/visualization.py \
--csv results/EleutherAI_pythia-70m_pg19_kv_metrics.csvpython script/visualization.py \
--csv results/EleutherAI_pythia-70m_wikitext_kv_metrics.csv \
--time_metric ttftSupported time metrics:
prefill_time
ttft
tpot
throughput
total_generation_time
decode_time
By default, figures are saved to:
<csv_parent>/plots/<dataset>/
Example output:
results/plots/wikitext/EleutherAI_pythia-70m_wikitext_ppl_vs_compression.png
results/plots/wikitext/EleutherAI_pythia-70m_wikitext_tpot_vs_compression.png
results/plots/pg19/EleutherAI_pythia-70m_pg19_ppl_vs_compression.png
results/plots/pg19/EleutherAI_pythia-70m_pg19_tpot_vs_compression.png
wikitext-2-raw-v1 consists of Wikipedia-style text with many short lines, headings, blank lines, and markup-like fragments. For language-model evaluation, the script tokenizes valid lines, concatenates them with EOS separators, and chunks the resulting token stream by token length.
Do not chunk WikiText by characters when computing PPL. Character-based chunks can produce much shorter and unstable token contexts, which often leads to much higher PPL.
PG19 is much larger. To avoid downloading the full dataset during quick experiments, the script uses streaming mode and only materializes a limited amount of text controlled by pg19_max_chars.
The included CSV files report results for:
model = EleutherAI/pythia-70m
max_length = 1024
prefix_ratio = 0.5
gen_tokens = 32
warmup_times = 4
compression ratios = 0.1, 0.3, 0.5, 0.7
The baseline suffix PPL on WikiText is about 77.25. This value is higher than the PG19 baseline because WikiText-2 raw contains many short lines, headings, and Wikipedia formatting artifacts. Since this project uses suffix PPL rather than standard full-sequence benchmark PPL, the most important number is the relative degradation, ppl_delta.
Observed trends:
- KNormPress shows a monotonic PPL increase as compression ratio grows. Its PPL delta rises from about +1.83 at compression ratio 0.1 to about +14.07 at compression ratio 0.7. This indicates that aggressive KNorm compression removes increasingly important context on WikiText.
- SnapKVPress is relatively stable at low compression ratios but degrades at higher compression ratios. In the included results, its PPL delta is small at compression ratio 0.1, but becomes much larger when the ratio reaches 0.5 and 0.7.
- StreamingLLMPress and LagKVPress are generally more stable in PPL across compression ratios. Their PPL degradation stays much smaller than KNorm at high compression rates, suggesting that sink-token and recent-token retention are strong heuristics for this setting.
For latency, TPOT is the most representative metric. KV-cache compression mainly helps during decoding because the model attends to a shorter cached context. In the included WikiText results, most compressed methods improve or roughly preserve TPOT compared with the baseline, while prefill and TTFT are noisier because they include compression overhead and Python/hook overhead.
The baseline suffix PPL on PG19 is about 29.21, which is much lower than WikiText in the included run. PG19 provides longer, more coherent book-style text, which is easier to evaluate under long-context chunking.
Observed trends:
- KNormPress degrades gradually as compression ratio increases. Its PPL delta grows from about +0.40 at compression ratio 0.1 to about +3.80 at compression ratio 0.7.
- SnapKVPress performs well at low compression ratios. Its PPL delta is very small around compression ratios 0.1 and 0.3, but becomes much larger at 0.5 and 0.7.
- StreamingLLMPress remains very stable across the tested compression ratios. Its PPL delta stays below about +0.5, even at high compression ratios in the included run.
- LagKVPress also remains stable, with small PPL degradation across all tested compression ratios.
For speed, the compressed methods usually reduce TPOT and improve throughput because the decode stage uses a smaller KV cache. However, prefill speedups are less consistent. On a small model such as Pythia-70M, compression overhead such as scoring, top-k selection, gathering, and hook management can be comparable to or larger than the saved compute. Therefore, TPOT and throughput are usually more meaningful than prefill time when evaluating KV-cache compression benefits.
The results suggest three main conclusions:
- Compression-quality trade-off is method-dependent. KNorm and SnapKV can degrade noticeably at high compression ratios, while StreamingLLM and LagKV are more stable in these runs.
- Decode-time metrics are more informative than prefill-time metrics. KV-cache compression is mainly designed to reduce the cost of attending over past tokens during decoding, so TPOT and throughput better reflect the expected benefit.
- Dataset structure matters. WikiText-2 raw gives higher and noisier PPL than PG19 under this suffix-PPL protocol, largely because of its line-based Wikipedia structure and formatting artifacts.
These observations are based on a small model and a limited number of evaluation samples. For stronger conclusions, increase n_timing, n_ppl, max_samples, and test larger models.

To reproduce the included results:
- Install dependencies from
requirements.txt. - Run the grid script with the same model and parameters.
- Keep
max_length=1024,prefix_ratio=0.5, andgen_tokens=32. - Use the same compression ratios:
0.1, 0.3, 0.5, 0.7. - Run
script/visualization.pyon the generated CSV files.
Example:
python script/run_grid_metrics.py \
--model EleutherAI/pythia-70m \
--max_length 1024 \
--max_samples 100 \
--n_timing 20 \
--n_ppl 20 \
--gen_tokens 32 \
--warmup_times 4 \
--output_dir ./results
python script/visualization.py \
--csv results/EleutherAI_pythia-70m_wikitext_kv_metrics.csv
python script/visualization.py \
--csv results/EleutherAI_pythia-70m_pg19_kv_metrics.csv- The absolute PPL values are suffix-PPL values under a cached-prefix protocol. Use
ppl_deltafor method comparison. - Timing numbers on small models can be noisy. Increase
n_timingand generation samples for more stable latency estimates. - Prefill speedup can be misleading when compression overhead dominates. TPOT and throughput are usually more representative.
- WikiText should be chunked by tokens, not characters.
- CUDA warmup is important. The scripts run warmup before evaluation to reduce cold-start bias.
