A lightweight, high-performance Key-Value cache implementation for Large Language Models with PagedAttention support. Built for educational purposes and portfolio demonstration.
litecache/
βββ README.md
βββ pyproject.toml
βββ .gitignore
βββ LICENSE
β
βββ litecache/
β βββ __init__.py
β βββ config.py # Configuration classes
β βββ block_manager.py # Block allocation and management
β βββ cache/
β β βββ __init__.py
β β βββ base.py # Abstract cache interface
β β βββ paged_attention.py # PagedAttention implementation
β β βββ utils.py # Cache utilities
β βββ kernels/
β β βββ __init__.py
β β βββ triton_kernels.py # Triton GPU kernels
β β βββ torch_fallback.py # PyTorch CPU/fallback implementations
β βββ models/
β β βββ __init__.py
β β βββ adapter.py # Model integration adapter
β β βββ hooks.py # HuggingFace integration hooks
β βββ memory/
β βββ __init__.py
β βββ allocator.py # Physical block allocator
β βββ sequence.py # Logical sequence management
β
βββ tests/
β βββ __init__.py
β βββ conftest.py # Pytest fixtures
β βββ test_block_manager.py
β βββ test_cache.py
β βββ test_kernels.py
β βββ test_integration.py
β βββ test_models.py
β
βββ benchmarks/
β βββ __init__.py
β βββ run_benchmarks.py
β βββ throughput.py
β βββ memory_profile.py
β
βββ examples/
βββ basic_usage.py
βββ huggingface_integration.py
βββ benchmark_comparison.py
LiteCache implements an efficient KV cache system for LLM inference, featuring:
- PagedAttention: Memory-efficient attention mechanism with block-based memory management
- Pluggable Architecture: Easy to extend with different caching mechanisms (RadixAttention, StreamingLLM, etc.)
- GPU Acceleration: Triton kernels for optimized GPU operations
- Model Agnostic: Clean adapter interface for integration with existing models
- Quantization Support: FP16/BF16 precision modes
- β PagedAttention cache with block management
- β Copy-on-Write (CoW) for shared prefixes
- β Triton GPU kernels with PyTorch fallback
- β HuggingFace Transformers integration
- β Support for lightweight decoder-only models (GPT-2, TinyLlama, Phi, Qwen)
- β FP16/BF16 quantization support
- β Comprehensive test suite
- π Dynamic batching and continuous batching
- π RadixAttention for prefix caching
- π Multi-GPU support
- π FP8 quantization
- π Speculative decoding integration
- Python 3.10+
- PyTorch 2.0+
- Triton 2.0+ (for GPU support)
- CUDA 11.8+ (for GPU support)
- transformers (HuggingFace)
- Python 3.10+
- PyTorch 2.0+ (CPU or GPU version)
- CUDA 11.8+ (for GPU support only)
For CPU (development on laptop):
git clone https://github.com/yourusername/litecache.git
cd litecache
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install PyTorch CPU version
uv pip install torch --index-url https://download.pytorch.org/whl/cpu
# Install litecache with dev dependencies
uv pip install -e ".[dev]"For GPU (production/benchmarking):
git clone https://github.com/yourusername/litecache.git
cd litecache
uv venv
source .venv/bin/activate
# Install PyTorch GPU version with CUDA 11.8
uv pip install torch triton --index-url https://download.pytorch.org/whl/cu118
# Install litecache with dev dependencies
uv pip install -e ".[dev]"Using the setup script:
chmod +x setup.sh
./setup.sh cpu # or ./setup.sh gpufrom litecache import PagedAttentionCache, CacheConfig
import torch
# Configure cache
config = CacheConfig(
block_size=16, # tokens per block
num_blocks=1024, # total blocks
num_heads=32,
head_dim=128,
num_layers=32,
dtype=torch.float16,
device="cuda"
)
# Initialize cache
cache = PagedAttentionCache(config)
# Allocate sequence
seq_id = cache.allocate_sequence(seq_len=512)
# Use in attention computation
attention_output = cache.paged_attention(
query=q, # [batch, num_heads, seq_len, head_dim]
block_tables=..., # [batch, max_blocks]
context_lens=... # [batch]
)
# Free when done
cache.free_sequence(seq_id)from litecache.models import KVCacheAdapter
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Wrap with cache adapter
cached_model = KVCacheAdapter(model, cache_config=config)
# Generate with efficient caching
outputs = cached_model.generate(
input_ids=input_ids,
max_length=100,
temperature=0.7
)# Run all tests
pytest tests/
# Run specific test file
pytest tests/test_cache.py -v
# Run with coverage
pytest tests/ --cov=litecache --cov-report=html# Run throughput benchmarks
python benchmarks/run_benchmarks.py --model gpt2 --batch-size 1
# Compare with baseline
python examples/benchmark_comparison.pyExpected improvements over standard HuggingFace KV cache:
- Memory Efficiency: ~40-50% reduction in peak memory usage
- Throughput: ~1.5-2x tokens/second for long sequences
- Batch Scaling: Better memory scaling with increasing batch sizes
-
Block Manager (
block_manager.py)- Physical memory allocation
- Free block tracking
- Block recycling
-
Cache Backend (
cache/paged_attention.py)- KV tensor storage
- Logical-to-physical block mapping
- Attention computation orchestration
-
Triton Kernels (
kernels/triton_kernels.py)- Paged attention kernel
- Block copy operations
- Optimized memory access patterns
-
Model Adapter (
models/adapter.py)- Framework-agnostic integration
- Transparent cache management
- Generation loop handling
- Extensibility: Abstract base classes for cache backends
- Performance: Triton kernels with PyTorch fallback
- Correctness: Comprehensive test coverage
- Usability: Simple API with sane defaults
Detailed documentation is available in the /docs folder (coming soon):
- Architecture deep-dive
- API reference
- Performance tuning guide
- Kernel implementation details
This is primarily an educational project, but suggestions and improvements are welcome!
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
This project draws inspiration from:
- vLLM - PagedAttention implementation
- SGLang - RadixAttention concepts
- FlexFlow - Research foundations
Note: This is a portfolio/educational project demonstrating systems programming and ML optimization skills. For production use cases, consider battle-tested solutions like vLLM or SGLang.