Skip to content

binghao-cai/Efficient-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KVPress

A lightweight experimental implementation of several KV-cache compression strategies for causal language models, with reproducible evaluation scripts for perplexity and generation latency.

This project evaluates how different KV-cache compression methods affect:

  • PPL: suffix perplexity after prefix prefill
  • Prefill time: prompt forward latency
  • TTFT: time to first generated token
  • TPOT: average time per generated token after the first token
  • Throughput: generated tokens per second

The current experiments use EleutherAI/pythia-70m on wikitext-2-raw-v1 and PG19.


Project structure

KVPress/
├── press/
│   ├── base_press.py
│   ├── score_based.py
│   ├── knorm_press.py
│   ├── snapkv_press.py
│   ├── streaming_llm_press.py
│   ├── lagkv_press.py
│   └── utils.py
├── script/
│   ├── evaluation_metrics.py
│   ├── run_grid_metrics.py
│   └── visualization.py
├── results/
│   ├── EleutherAI_pythia-70m_wikitext_kv_metrics.csv
│   ├── EleutherAI_pythia-70m_pg19_kv_metrics.csv
│   └── plots/
├── requirements.txt
└── README.md

Installation

Create a Python environment and install dependencies:

pip install -r requirements.txt

If you use Hugging Face models or datasets frequently, logging in is recommended:

huggingface-cli login

Public models and datasets can still be downloaded without login, but unauthenticated requests may be slower or rate-limited.


Implemented compression methods

Baseline

No KV-cache compression is applied. This row is used as the reference for PPL delta and speedup metrics.

KNormPress

Keeps KV pairs according to key-vector norm scores. In this implementation, lower key norm corresponds to higher retention priority by using negative norm as the score.

SnapKVPress

Uses attention statistics from a recent local window to select important KV pairs.

StreamingLLMPress

Keeps a fixed number of initial sink tokens and a recent-window style cache budget.

LagKVPress

Keeps sink tokens and recent tokens while applying a lag-aware retention strategy.


Evaluation protocol

The evaluation has two separate parts.

1. PPL evaluation

PPL is computed using a two-stage protocol:

input sequence length = max_length
prefix length = max_length * prefix_ratio
suffix length = max_length - prefix length

For example, with:

max_length = 1024
prefix_ratio = 0.5

we evaluate:

prefix = first 512 tokens
suffix = last 512 tokens

The model first performs prefill on the prefix. For compressed methods, KV compression is applied only during this prefix prefill stage. Then the suffix is evaluated using the resulting KV cache.

Therefore, reported PPL is suffix PPL under a cached-prefix setting, not the exact same value as standard full-sequence benchmark PPL. This protocol is designed to isolate the quality impact of KV-cache compression.

2. Generation latency evaluation

Generation metrics use a prompt-generation setup:

generation_prompt_tokens = max_length
generation_output_tokens = gen_tokens
generation_total_tokens = max_length + gen_tokens

For example:

max_length = 1024
gen_tokens = 32

means latency is measured by feeding a 1024-token prompt and generating 32 new tokens.

The reported latency metrics are:

Metric Meaning
prefill_time Time for prompt forward with use_cache=True
ttft Time to obtain the first generated token
tpot Average time per output token after the first generated token
throughput Generated tokens per second
decode_time Total time spent decoding generated tokens after the first token
total_generation_time TTFT + decode time

The CSV also includes relative metrics:

Metric Meaning
ppl_delta Current method PPL minus baseline PPL
prefill_speedup Baseline prefill time divided by method prefill time
ttft_speedup Baseline TTFT divided by method TTFT
tpot_speedup Baseline TPOT divided by method TPOT
throughput_ratio Method throughput divided by baseline throughput

Run a single method

Example: evaluate KNorm with compression ratio 0.5 on WikiText.

python script/evaluation_metrics.py \
  --method knorm \
  --compression_ratio 0.5 \
  --dataset wikitext \
  --model EleutherAI/pythia-70m \
  --max_length 1024 \
  --max_samples 100 \
  --n_timing 20 \
  --n_ppl 20 \
  --gen_tokens 32 \
  --warmup_times 4 \
  --output_dir ./results

Run the full grid

The grid script first evaluates the baseline, then evaluates all compression methods under multiple compression ratios.

Default compression ratios:

0.1, 0.3, 0.5, 0.7

Run both WikiText and PG19:

python script/run_grid_metrics.py \
  --model EleutherAI/pythia-70m \
  --max_length 1024 \
  --max_samples 100 \
  --n_timing 20 \
  --n_ppl 20 \
  --gen_tokens 32 \
  --warmup_times 4 \
  --output_dir ./results

Run only one dataset:

python script/run_grid_metrics.py --dataset wikitext

or:

python script/run_grid_metrics.py --dataset pg19

Output files are named by model and dataset:

results/EleutherAI_pythia-70m_wikitext_kv_metrics.csv
results/EleutherAI_pythia-70m_wikitext_kv_metrics.json
results/EleutherAI_pythia-70m_pg19_kv_metrics.csv
results/EleutherAI_pythia-70m_pg19_kv_metrics.json

If multiple datasets are evaluated together, an additional combined CSV/JSON is saved.


Visualization

The visualization script reads a metrics CSV and produces two figures per dataset:

  1. PPL vs compression ratio
  2. A representative time metric vs compression ratio

By default, the time metric is tpot, because TPOT directly reflects the decode-time benefit of KV-cache compression.

Plot WikiText results

python script/visualization.py \
  --csv results/EleutherAI_pythia-70m_wikitext_kv_metrics.csv

Plot PG19 results

python script/visualization.py \
  --csv results/EleutherAI_pythia-70m_pg19_kv_metrics.csv

Use a different time metric

python script/visualization.py \
  --csv results/EleutherAI_pythia-70m_wikitext_kv_metrics.csv \
  --time_metric ttft

Supported time metrics:

prefill_time
ttft
tpot
throughput
total_generation_time
decode_time

By default, figures are saved to:

<csv_parent>/plots/<dataset>/

Example output:

results/plots/wikitext/EleutherAI_pythia-70m_wikitext_ppl_vs_compression.png
results/plots/wikitext/EleutherAI_pythia-70m_wikitext_tpot_vs_compression.png
results/plots/pg19/EleutherAI_pythia-70m_pg19_ppl_vs_compression.png
results/plots/pg19/EleutherAI_pythia-70m_pg19_tpot_vs_compression.png

Notes on datasets

WikiText-2 raw

wikitext-2-raw-v1 consists of Wikipedia-style text with many short lines, headings, blank lines, and markup-like fragments. For language-model evaluation, the script tokenizes valid lines, concatenates them with EOS separators, and chunks the resulting token stream by token length.

Do not chunk WikiText by characters when computing PPL. Character-based chunks can produce much shorter and unstable token contexts, which often leads to much higher PPL.

PG19

PG19 is much larger. To avoid downloading the full dataset during quick experiments, the script uses streaming mode and only materializes a limited amount of text controlled by pg19_max_chars.


Current experimental results

The included CSV files report results for:

model = EleutherAI/pythia-70m
max_length = 1024
prefix_ratio = 0.5
gen_tokens = 32
warmup_times = 4
compression ratios = 0.1, 0.3, 0.5, 0.7

WikiText results

The baseline suffix PPL on WikiText is about 77.25. This value is higher than the PG19 baseline because WikiText-2 raw contains many short lines, headings, and Wikipedia formatting artifacts. Since this project uses suffix PPL rather than standard full-sequence benchmark PPL, the most important number is the relative degradation, ppl_delta.

Observed trends:

  • KNormPress shows a monotonic PPL increase as compression ratio grows. Its PPL delta rises from about +1.83 at compression ratio 0.1 to about +14.07 at compression ratio 0.7. This indicates that aggressive KNorm compression removes increasingly important context on WikiText.
  • SnapKVPress is relatively stable at low compression ratios but degrades at higher compression ratios. In the included results, its PPL delta is small at compression ratio 0.1, but becomes much larger when the ratio reaches 0.5 and 0.7.
  • StreamingLLMPress and LagKVPress are generally more stable in PPL across compression ratios. Their PPL degradation stays much smaller than KNorm at high compression rates, suggesting that sink-token and recent-token retention are strong heuristics for this setting.

For latency, TPOT is the most representative metric. KV-cache compression mainly helps during decoding because the model attends to a shorter cached context. In the included WikiText results, most compressed methods improve or roughly preserve TPOT compared with the baseline, while prefill and TTFT are noisier because they include compression overhead and Python/hook overhead.

PG19 results

The baseline suffix PPL on PG19 is about 29.21, which is much lower than WikiText in the included run. PG19 provides longer, more coherent book-style text, which is easier to evaluate under long-context chunking.

Observed trends:

  • KNormPress degrades gradually as compression ratio increases. Its PPL delta grows from about +0.40 at compression ratio 0.1 to about +3.80 at compression ratio 0.7.
  • SnapKVPress performs well at low compression ratios. Its PPL delta is very small around compression ratios 0.1 and 0.3, but becomes much larger at 0.5 and 0.7.
  • StreamingLLMPress remains very stable across the tested compression ratios. Its PPL delta stays below about +0.5, even at high compression ratios in the included run.
  • LagKVPress also remains stable, with small PPL degradation across all tested compression ratios.

For speed, the compressed methods usually reduce TPOT and improve throughput because the decode stage uses a smaller KV cache. However, prefill speedups are less consistent. On a small model such as Pythia-70M, compression overhead such as scoring, top-k selection, gathering, and hook management can be comparable to or larger than the saved compute. Therefore, TPOT and throughput are usually more meaningful than prefill time when evaluating KV-cache compression benefits.

Overall interpretation

The results suggest three main conclusions:

  1. Compression-quality trade-off is method-dependent. KNorm and SnapKV can degrade noticeably at high compression ratios, while StreamingLLM and LagKV are more stable in these runs.
  2. Decode-time metrics are more informative than prefill-time metrics. KV-cache compression is mainly designed to reduce the cost of attending over past tokens during decoding, so TPOT and throughput better reflect the expected benefit.
  3. Dataset structure matters. WikiText-2 raw gives higher and noisier PPL than PG19 under this suffix-PPL protocol, largely because of its line-based Wikipedia structure and formatting artifacts.

These observations are based on a small model and a limited number of evaluation samples. For stronger conclusions, increase n_timing, n_ppl, max_samples, and test larger models. PPL_pg19 PPL_wikitext TPOT_pg19

TPOT_wikitext


Reproducibility checklist

To reproduce the included results:

  1. Install dependencies from requirements.txt.
  2. Run the grid script with the same model and parameters.
  3. Keep max_length=1024, prefix_ratio=0.5, and gen_tokens=32.
  4. Use the same compression ratios: 0.1, 0.3, 0.5, 0.7.
  5. Run script/visualization.py on the generated CSV files.

Example:

python script/run_grid_metrics.py \
  --model EleutherAI/pythia-70m \
  --max_length 1024 \
  --max_samples 100 \
  --n_timing 20 \
  --n_ppl 20 \
  --gen_tokens 32 \
  --warmup_times 4 \
  --output_dir ./results

python script/visualization.py \
  --csv results/EleutherAI_pythia-70m_wikitext_kv_metrics.csv

python script/visualization.py \
  --csv results/EleutherAI_pythia-70m_pg19_kv_metrics.csv

Practical cautions

  • The absolute PPL values are suffix-PPL values under a cached-prefix protocol. Use ppl_delta for method comparison.
  • Timing numbers on small models can be noisy. Increase n_timing and generation samples for more stable latency estimates.
  • Prefill speedup can be misleading when compression overhead dominates. TPOT and throughput are usually more representative.
  • WikiText should be chunked by tokens, not characters.
  • CUDA warmup is important. The scripts run warmup before evaluation to reduce cold-start bias.

About

KV cache methods for LLM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages