A ground-up LLM inference engine for iOS and Android, written in Rust. Brings server-grade serving concepts — paged KV cache, continuous decode scheduling, multi-session concurrency — to phones running under 512MB RAM.
Not a wrapper around llama.cpp. Not a port of vLLM. A new runtime designed for mobile constraints from scratch.
Note
This is just a research project—don't get mad at me lol!
| Resource | Path |
|---|---|
| Getting Started | Quick Start below |
| Architecture & Design | docs/project_architecture.md |
| Paged KV Cache Deep Dive | docs/paged-kv-cache-foundation.md |
| Benchmarks | docs/benchmarks/README.md |
| Model Conversion | docs/convert-quantized-models.md |
| VLM (Vision) Guide | docs/vlm-smolvlm-onnx.md |
| iOS Demo App | bindings/ios/CellmDemo |
| Android Bindings | bindings/kotlin |
- Rust 1.75+ (modern stable toolchain)
- macOS / iOS for Metal acceleration (Linux/Android builds use CPU path)
- Git LFS (for bundled sample models)
cargo build --releasecargo run --release --bin infer -- \
--model models/smollm2-135m.cellm \
--tokenizer models/hf/smollm2-135m/tokenizer.json \
--prompt "Hello, how are you?" \
--chat \
--gen 32cargo run --release --bin infer -- \
--model models/smollm2-135m-int8.cellm \
--tokenizer models/hf/smollm2-135m/tokenizer.json \
--prompt "Hello" \
--chat \
--gen 16 \
--backend metalTip: Use
--chatfor ChatML-style formatting. Without it, many base models behave like text-completion engines and may not answer directly.
cargo run --release --bin metal-smokeflowchart LR
U["User Prompt"] --> API["App/UI Request Layer"]
API --> ORCH["CPU Orchestrator"]
ORCH --> TOK[Tokenizer]
TOK --> FMT["Prompt Formatter"]
FMT --> SCH["Decode Scheduler / Batcher"]
SCH -->|"prefill/decode jobs"| ENG["Engine Dispatcher"]
ENG -->|backend=CPU| CPUPATH["CPU Kernels"]
ENG -->|backend=Metal| METAL["Metal Kernels"]
METAL --> MATMUL["QKV / MLP MatMul"]
METAL --> ATTN["Attention + GroupKV Cache"]
METAL --> NORM["RMSNorm / RoPE / Logits"]
MATMUL --> SAMPLER
ATTN --> SAMPLER
NORM --> SAMPLER
CPUPATH --> SAMPLER["Sampler + Stop Rules"]
SAMPLER --> DETOK[Detokenizer]
DETOK --> STREAM["Streaming Output"]
STREAM --> API
API --> U
subgraph ModelAssets["Local Model Assets"]
W[".cellm / .cellmd mmap Weights"]
T["tokenizer.json + config"]
end
W --> ENG
T --> TOK
subgraph SessionState["Per-Session State"]
KV["KV Cache (GroupKV layout)"]
PT["Page Table / Sequence Cursor"]
TH["Thermal + QoS Policy"]
end
SCH --> KV
SCH --> PT
ORCH --> TH
TH --> SCH
| Feature | llama.cpp |
MLX |
ExecuTorch |
cellm |
|---|---|---|---|---|
| Language | C++ | C++/Python | C++ | Rust |
| KV Cache | Contiguous | Contiguous | Contiguous | Paged (Block-based) |
| Focus | Portability | Apple Native | Model Export | Mobile Multi-session |
| Scheduling | Static Batch | Mostly Single | N/A | Round-Robin Interleaved |
| Memory | Manual/Static | Managed Buffer | Static Graph | Dynamic Block Allocator |
cellm/
├── crates/
│ ├── cellm-core/ # Memory arena, tensor layout, op dispatch
│ ├── cellm-model/ # Model format, configuration, weight management
│ ├── cellm-cache/ # Paged KV cache: BlockAllocator, PageTable, physical storage
│ ├── cellm-kernels/ # CPU & Metal compute kernels (SIMD, Accelerate, Metal shaders)
│ ├── cellm-scheduler/ # Decode scheduler & batching logic
│ └── cellm-sdk/ # Public C FFI + high-level API for mobile consumers
├── bindings/
│ ├── ios/CellmDemo/ # SwiftUI demo app (LLM + VLM stub)
│ ├── kotlin/ # Android Kotlin/JNI bindings
│ └── swift/ # Swift Package + XCFramework build scripts
├── tools/
│ ├── infer/ # CLI inference runner (debug & validation)
│ ├── vlm-onnx-infer/ # VLM runner for SmolVLM ONNX exports
│ ├── vlm-smoke/ # SDK FFI VLM smoke test
│ ├── convert/ # HF Safetensors/GGUF/PyTorch -> .cellm converter
│ ├── bench/ # Latency & throughput benchmark harness
│ └── metal-smoke/ # Minimal Metal kernel compile + dispatch test
├── docs/ # Architecture deep-dives, benchmarks, model guides
└── models/ # Sample .cellm checkpoints (Git LFS)
Convert HuggingFace Safetensors or GGUF to .cellm:
cargo run --bin convert -- \
--input ./models/hf/smollm2-135m \
--output ./models/smollm2-135m.cellm \
--dtype f16Quantize during conversion:
cargo run --bin convert -- \
--input ./models/hf/smollm2-135m \
--output ./models/smollm2-135m-int8.cellm \
--dtype f16 \
--quantize-int8-symmetricSee docs/convert-quantized-models.md for GGUF, PyTorch, and 4-bit affine workflows.
# Quick smoke benchmark
cargo run --release --bin bench -- --model tiny
# Full LLM backend matrix (CPU vs Metal)
tools/bench/run_llm_backend_matrix.shDetailed benchmark reports live in docs/benchmarks/.
# ONNX vision + ONNX decoder (recommended)
cargo build --release -p cellm-vlm-onnx-infer
./target/release/vlm-infer \
--model-dir models/hf/smolvlm-256m-instruct \
--onnx-variant fp16 \
--image models/test_images/rococo.jpg \
--prompt "Describe this image." \
--split-image \
--max-new-tokens 96Native .cellm vision + decoder is experimental:
./target/release/vlm-infer \
--model-dir models/hf/smolvlm-256m-instruct \
--cellm-model models/smolvlm-256m.cellm \
--vision-backend cellm \
--decoder-backend cellm \
--image models/test_images/rococo.jpg \
--prompt "Describe this image." \
--max-new-tokens 12See docs/vlm-smolvlm-onnx.md for full VLM docs.
Build the XCFramework:
./scripts/build_xcframework.shThen open bindings/ios/CellmDemo in Xcode.
| Model | Size | Best For | Notes |
|---|---|---|---|
| SmolLM2 | 135M-360M | Fast smoke tests, small devices | Best LLM starter model |
| LFM2.5 | 350M | Long-context, efficient inference | Linear attention, up to 256K context |
| Qwen2.5 / Qwen3.0 / Qwen3.5 | 0.5B-0.8B | Multilingual, reasoning | DeltaNet layers supported (CPU ref) |
| Gemma-3 | 1B | Quality vs size tradeoff | Metal path active, CPU-safe fallback |
| Bonsai | 1.7B | High-quality local chat | 1-bit quantized; see docs/bonsai_1bit_analysis.md |
| Gemma-4 | 2B-4B | Larger mobile workloads | Experimental; see docs/gemma4_* |
| SmolVLM | 256M | Vision-language (ONNX) | Native .cellm VLM path in progress |
| FunctionGemma | 270M | Mobile actions / tool use | Experimental quality |
Recommended first download: SmolLM2-135M
Sample checkpoints bundled in this repo (via Git LFS):
models/smollm2-135m-int8.cellmmodels/smolvlm-256m-int8.cellmmodels/qwen3.5-0.8b-int4-textonly.cellm
- Paged KV Cache - Fixed-size block allocation with
BlockAllocator&PageTable - Multi-session Scheduler - Round-robin interleaved decoding
- 4-bit Affine Dequantization - Native MLX/HF packed weight support
- Multimodal Vision - Native ViT/SigLIP encoder + linear projector
- Accelerated Math - Metal compute kernels + SIMD CPU fallbacks
- High-Performance CLI - Conversion, benchmarking, debug inference
- Vulkan Support - Cross-platform compute kernels (research)
- Android Integration - Kotlin/JNI bindings & tuning (coming soon)
- Qwen iOS Porting - Optimize Qwen inference for native iOS
| Topic | Doc |
|---|---|
| Architecture & crate design | docs/project_architecture.md |
| Paged KV cache internals | docs/paged-kv-cache-foundation.md |
| Scheduler & continuous batching | docs/phase4-continuous-batching.md |
| Model conversion & quantization | docs/convert-quantized-models.md |
| TurboQuant KV compression | docs/turboquant_dataflow.md |
| VLM / SmolVLM ONNX guide | docs/vlm-smolvlm-onnx.md |
| VLM sequence tracking | docs/cellm-vlm-sequence.md |
| Qwen3.5 / DeltaNet | docs/qwen3_5-deltanet.md |
| Metal acceleration notes | docs/LFM_Metal_Acceleration.md |
| Benchmark history & raw runs | docs/benchmarks/ |
| Data flow diagrams | docs/data_flow.md |
| Format specification | docs/format.md |
| Inference graph | docs/inference_graph.md |
# 1. Verify Metal device access
cargo run --release --bin metal-smoke
# 2. Verify infer picks Metal
./target/release/infer \
--model models/smollm2-135m-int8.cellm \
--tokenizer models/hf/smollm2-135m/tokenizer.json \
--prompt "hello" --gen 8 --backend metalIn restricted/sandboxed shells, Metal device discovery can fail. infer --backend metal now errors instead of silently falling back to CPU.
CELLM_LLAMA_ROPE_INTERLEAVED=0 ./target/release/infer ...Default keeps norm/RoPE/logits on CPU-safe path for quality parity. Opt-in Metal paths:
CELLM_GEMMA_USE_METAL_NORM=1 # enable Metal RMSNorm
CELLM_GEMMA_USE_METAL_ROPE=1 # enable Metal RoPE
CELLM_GEMMA_USE_METAL_LOGITS=1 # enable Metal final logits matvecCELLM_LLAMA_ENABLE_GRAPH=1 ./target/release/infer ...| Model | Flag | Purpose |
|---|---|---|
| SmolLM2 360M | CELLM_LLAMA_ROPE_INTERLEAVED=0 |
Correct RoPE layout |
| Llama | CELLM_LLAMA_USE_METAL_NORM=1 |
Force Metal norm |
| Llama | CELLM_LLAMA_USE_METAL_ROPE=1 |
Force Metal RoPE |
| Qwen VLM | CELLM_VLM_TOKENIZER=... |
Set tokenizer path for vlm-smoke |
For more debug flags and backend-specific notes, see the per-model docs in docs/.
Licensed under either of:
- MIT license (
LICENSE-MIT) - Apache License, Version 2.0 (
LICENSE-APACHE)
at your option.
