cellm — Mobile-Native LLM Serving Engine

A ground-up LLM inference engine for iOS and Android, written in Rust. Brings server-grade serving concepts — paged KV cache, continuous decode scheduling, multi-session concurrency — to phones running under 512MB RAM.

Not a wrapper around llama.cpp. Not a port of vLLM. A new runtime designed for mobile constraints from scratch.

Note

This is just a research project—don't get mad at me lol!

Quick Links

Resource	Path
Getting Started	Quick Start below
Architecture & Design	`docs/project_architecture.md`
Paged KV Cache Deep Dive	`docs/paged-kv-cache-foundation.md`
Benchmarks	`docs/benchmarks/README.md`
Model Conversion	`docs/convert-quantized-models.md`
VLM (Vision) Guide	`docs/vlm-smolvlm-onnx.md`
iOS Demo App	`bindings/ios/CellmDemo`
Android Bindings	`bindings/kotlin`

Quick Start

Prerequisites

Rust 1.75+ (modern stable toolchain)
macOS / iOS for Metal acceleration (Linux/Android builds use CPU path)
Git LFS (for bundled sample models)

1. Build

cargo build --release

2. Run a smoke test (CPU)

cargo run --release --bin infer -- \
  --model models/smollm2-135m.cellm \
  --tokenizer models/hf/smollm2-135m/tokenizer.json \
  --prompt "Hello, how are you?" \
  --chat \
  --gen 32

3. Run with Metal (macOS/iOS)

cargo run --release --bin infer -- \
  --model models/smollm2-135m-int8.cellm \
  --tokenizer models/hf/smollm2-135m/tokenizer.json \
  --prompt "Hello" \
  --chat \
  --gen 16 \
  --backend metal

Tip: Use --chat for ChatML-style formatting. Without it, many base models behave like text-completion engines and may not answer directly.

4. Metal verification

cargo run --release --bin metal-smoke

Architecture Overview

flowchart LR
    U["User Prompt"] --> API["App/UI Request Layer"]
    API --> ORCH["CPU Orchestrator"]
    ORCH --> TOK[Tokenizer]
    TOK --> FMT["Prompt Formatter"]
    FMT --> SCH["Decode Scheduler / Batcher"]

    SCH -->|"prefill/decode jobs"| ENG["Engine Dispatcher"]
    ENG -->|backend=CPU| CPUPATH["CPU Kernels"]
    ENG -->|backend=Metal| METAL["Metal Kernels"]

    METAL --> MATMUL["QKV / MLP MatMul"]
    METAL --> ATTN["Attention + GroupKV Cache"]
    METAL --> NORM["RMSNorm / RoPE / Logits"]
    MATMUL --> SAMPLER
    ATTN --> SAMPLER
    NORM --> SAMPLER
    CPUPATH --> SAMPLER["Sampler + Stop Rules"]

    SAMPLER --> DETOK[Detokenizer]
    DETOK --> STREAM["Streaming Output"]
    STREAM --> API
    API --> U

    subgraph ModelAssets["Local Model Assets"]
      W[".cellm / .cellmd mmap Weights"]
      T["tokenizer.json + config"]
    end

    W --> ENG
    T --> TOK

    subgraph SessionState["Per-Session State"]
      KV["KV Cache (GroupKV layout)"]
      PT["Page Table / Sequence Cursor"]
      TH["Thermal + QoS Policy"]
    end

    SCH --> KV
    SCH --> PT
    ORCH --> TH
    TH --> SCH

What Makes cellm Different?

Feature	`llama.cpp`	`MLX`	`ExecuTorch`	`cellm`
Language	C++	C++/Python	C++	Rust
KV Cache	Contiguous	Contiguous	Contiguous	Paged (Block-based)
Focus	Portability	Apple Native	Model Export	Mobile Multi-session
Scheduling	Static Batch	Mostly Single	N/A	Round-Robin Interleaved
Memory	Manual/Static	Managed Buffer	Static Graph	Dynamic Block Allocator

Project Structure

cellm/
├── crates/
│   ├── cellm-core/          # Memory arena, tensor layout, op dispatch
│   ├── cellm-model/         # Model format, configuration, weight management
│   ├── cellm-cache/         # Paged KV cache: BlockAllocator, PageTable, physical storage
│   ├── cellm-kernels/       # CPU & Metal compute kernels (SIMD, Accelerate, Metal shaders)
│   ├── cellm-scheduler/     # Decode scheduler & batching logic
│   └── cellm-sdk/           # Public C FFI + high-level API for mobile consumers
├── bindings/
│   ├── ios/CellmDemo/       # SwiftUI demo app (LLM + VLM stub)
│   ├── kotlin/              # Android Kotlin/JNI bindings
│   └── swift/               # Swift Package + XCFramework build scripts
├── tools/
│   ├── infer/               # CLI inference runner (debug & validation)
│   ├── vlm-onnx-infer/      # VLM runner for SmolVLM ONNX exports
│   ├── vlm-smoke/           # SDK FFI VLM smoke test
│   ├── convert/             # HF Safetensors/GGUF/PyTorch -> .cellm converter
│   ├── bench/               # Latency & throughput benchmark harness
│   └── metal-smoke/         # Minimal Metal kernel compile + dispatch test
├── docs/                    # Architecture deep-dives, benchmarks, model guides
└── models/                  # Sample .cellm checkpoints (Git LFS)

Development Commands

Convert a Model

Convert HuggingFace Safetensors or GGUF to .cellm:

cargo run --bin convert -- \
  --input  ./models/hf/smollm2-135m \
  --output ./models/smollm2-135m.cellm \
  --dtype  f16

Quantize during conversion:

cargo run --bin convert -- \
  --input  ./models/hf/smollm2-135m \
  --output ./models/smollm2-135m-int8.cellm \
  --dtype  f16 \
  --quantize-int8-symmetric

See docs/convert-quantized-models.md for GGUF, PyTorch, and 4-bit affine workflows.

Run Benchmarks

# Quick smoke benchmark
cargo run --release --bin bench -- --model tiny

# Full LLM backend matrix (CPU vs Metal)
tools/bench/run_llm_backend_matrix.sh

Detailed benchmark reports live in docs/benchmarks/.

Run VLM (Vision-Language)

# ONNX vision + ONNX decoder (recommended)
cargo build --release -p cellm-vlm-onnx-infer

./target/release/vlm-infer \
  --model-dir models/hf/smolvlm-256m-instruct \
  --onnx-variant fp16 \
  --image models/test_images/rococo.jpg \
  --prompt "Describe this image." \
  --split-image \
  --max-new-tokens 96

Native .cellm vision + decoder is experimental:

./target/release/vlm-infer \
  --model-dir models/hf/smolvlm-256m-instruct \
  --cellm-model models/smolvlm-256m.cellm \
  --vision-backend cellm \
  --decoder-backend cellm \
  --image models/test_images/rococo.jpg \
  --prompt "Describe this image." \
  --max-new-tokens 12

See docs/vlm-smolvlm-onnx.md for full VLM docs.

iOS SwiftUI Demo

Build the XCFramework:

./scripts/build_xcframework.sh

Then open bindings/ios/CellmDemo in Xcode.

Supported Models

Model	Size	Best For	Notes
SmolLM2	135M-360M	Fast smoke tests, small devices	Best LLM starter model
LFM2.5	350M	Long-context, efficient inference	Linear attention, up to 256K context
Qwen2.5 / Qwen3.0 / Qwen3.5	0.5B-0.8B	Multilingual, reasoning	DeltaNet layers supported (CPU ref)
Gemma-3	1B	Quality vs size tradeoff	Metal path active, CPU-safe fallback
Bonsai	1.7B	High-quality local chat	1-bit quantized; see `docs/bonsai_1bit_analysis.md`
Gemma-4	2B-4B	Larger mobile workloads	Experimental; see `docs/gemma4_*`
SmolVLM	256M	Vision-language (ONNX)	Native `.cellm` VLM path in progress
FunctionGemma	270M	Mobile actions / tool use	Experimental quality

Recommended first download: SmolLM2-135M

Sample checkpoints bundled in this repo (via Git LFS):

models/smollm2-135m-int8.cellm
models/smolvlm-256m-int8.cellm
models/qwen3.5-0.8b-int4-textonly.cellm

Feature Status

Paged KV Cache - Fixed-size block allocation with BlockAllocator & PageTable
Multi-session Scheduler - Round-robin interleaved decoding
4-bit Affine Dequantization - Native MLX/HF packed weight support
Multimodal Vision - Native ViT/SigLIP encoder + linear projector
Accelerated Math - Metal compute kernels + SIMD CPU fallbacks
High-Performance CLI - Conversion, benchmarking, debug inference
Vulkan Support - Cross-platform compute kernels (research)
Android Integration - Kotlin/JNI bindings & tuning (coming soon)
Qwen iOS Porting - Optimize Qwen inference for native iOS

Documentation Index

Topic	Doc
Architecture & crate design	`docs/project_architecture.md`
Paged KV cache internals	`docs/paged-kv-cache-foundation.md`
Scheduler & continuous batching	`docs/phase4-continuous-batching.md`
Model conversion & quantization	`docs/convert-quantized-models.md`
TurboQuant KV compression	`docs/turboquant_dataflow.md`
VLM / SmolVLM ONNX guide	`docs/vlm-smolvlm-onnx.md`
VLM sequence tracking	`docs/cellm-vlm-sequence.md`
Qwen3.5 / DeltaNet	`docs/qwen3_5-deltanet.md`
Metal acceleration notes	`docs/LFM_Metal_Acceleration.md`
Benchmark history & raw runs	`docs/benchmarks/`
Data flow diagrams	`docs/data_flow.md`
Format specification	`docs/format.md`
Inference graph	`docs/inference_graph.md`

Troubleshooting

Metal is not being used

# 1. Verify Metal device access
cargo run --release --bin metal-smoke

# 2. Verify infer picks Metal
./target/release/infer \
  --model models/smollm2-135m-int8.cellm \
  --tokenizer models/hf/smollm2-135m/tokenizer.json \
  --prompt "hello" --gen 8 --backend metal

In restricted/sandboxed shells, Metal device discovery can fail. infer --backend metal now errors instead of silently falling back to CPU.

SmolLM2 360M needs non-interleaved RoPE

CELLM_LLAMA_ROPE_INTERLEAVED=0 ./target/release/infer ...

Gemma-3 Metal quality knobs

Default keeps norm/RoPE/logits on CPU-safe path for quality parity. Opt-in Metal paths:

CELLM_GEMMA_USE_METAL_NORM=1   # enable Metal RMSNorm
CELLM_GEMMA_USE_METAL_ROPE=1   # enable Metal RoPE
CELLM_GEMMA_USE_METAL_LOGITS=1 # enable Metal final logits matvec

Llama graph path (experimental speed)

CELLM_LLAMA_ENABLE_GRAPH=1 ./target/release/infer ...

Model-specific env flags

Model	Flag	Purpose
SmolLM2 360M	`CELLM_LLAMA_ROPE_INTERLEAVED=0`	Correct RoPE layout
Llama	`CELLM_LLAMA_USE_METAL_NORM=1`	Force Metal norm
Llama	`CELLM_LLAMA_USE_METAL_ROPE=1`	Force Metal RoPE
Qwen VLM	`CELLM_VLM_TOKENIZER=...`	Set tokenizer path for vlm-smoke

For more debug flags and backend-specific notes, see the per-model docs in docs/.

License

Licensed under either of:

MIT license (LICENSE-MIT)
Apache License, Version 2.0 (LICENSE-APACHE)

at your option.

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
.vscode		.vscode
bindings		bindings
brain		brain
crates		crates
docs		docs
models/test_images		models/test_images
research		research
scripts		scripts
tests		tests
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
PERFORMANCE_ANALYSIS.md		PERFORMANCE_ANALYSIS.md
README.md		README.md
convert_shards.py		convert_shards.py
inspect_model.rs		inspect_model.rs
paper.md		paper.md
references.md		references.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cellm — Mobile-Native LLM Serving Engine

Quick Links

Quick Start

Prerequisites

1. Build

2. Run a smoke test (CPU)

3. Run with Metal (macOS/iOS)

4. Metal verification

Architecture Overview

What Makes cellm Different?

Project Structure

Development Commands

Convert a Model

Run Benchmarks

Run VLM (Vision-Language)

iOS SwiftUI Demo

Supported Models

Feature Status

Documentation Index

Troubleshooting

Metal is not being used

SmolLM2 360M needs non-interleaved RoPE

Gemma-3 Metal quality knobs

Llama graph path (experimental speed)

Model-specific env flags

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cellm — Mobile-Native LLM Serving Engine

Quick Links

Quick Start

Prerequisites

1. Build

2. Run a smoke test (CPU)

3. Run with Metal (macOS/iOS)

4. Metal verification

Architecture Overview

What Makes cellm Different?

Project Structure

Development Commands

Convert a Model

Run Benchmarks

Run VLM (Vision-Language)

iOS SwiftUI Demo

Supported Models

Feature Status

Documentation Index

Troubleshooting

Metal is not being used

SmolLM2 360M needs non-interleaved RoPE

Gemma-3 Metal quality knobs

Llama graph path (experimental speed)

Model-specific env flags

License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages