LLM RAM calculator. Type a model name — anything from meta-llama/Llama-3.1-70B
to mlx-community/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16-mlx-fp16 — and get
a realistic estimate of the VRAM or unified memory needed to run it for
inference.
Live at skuld.zoleb.com.
The headline output — "fits on 1× A100 80 GB", "needs M3 Ultra 192 GB" — is the screenshot people share, so the math has to be right for the cases people care about most: long-context Llama 3.1, MoE Mixtral, and especially MLA DeepSeek (where naive attention math overstates KV cache by ~50×).
Most "LLM memory calculators" use a single formula:
total ≈ params × bytes_per_param × 1.2
That's wrong for any modern model:
| Model class | Why naive math fails |
|---|---|
| GQA (Llama 3+, Qwen, Mistral) | KV cache scales with kv_heads / num_heads, not 1 |
| MLA (DeepSeek V2/V3/V3.1/V3.2/R1) | KV is a compressed latent (~576 elements / token / layer), not 2 × hidden_size. At 32k context, the difference is ~2 GiB vs ~107 GiB. |
| MoE (Mixtral, DeepSeek, Qwen3-MoE, Qwen 3.6, Llama 4, GPT-OSS, Gemma 4) | Total weights vs active-per-token are different numbers; both matter |
| Hybrid linear/full attention (Qwen 3.5 / 3.6) | Only ~1 in 4 layers is full attention. Linear-attention layers have constant SSM state, not O(ctx) KV cache. Naive math overstates KV by 4×. |
| Sliding-window attention (Gemma 2 / 3 / 4) | Most layers cap KV at the window size (e.g. 1024 tokens), so KV stops scaling with context past the window. |
| Long context | KV cache dominates over weights once context exceeds the model's natural break point — for an 8B model this happens around 256k tokens |
skuld branches on attention type and MoE topology in the math module instead of papering over them.
Two pieces, both designed to run with no build step:
- Static front-end at
index.html— vanilla HTML + ES modules, vendored Inter font and zoleb.com style tokens. No framework, no bundler. Hosts on Cloudflare Pages. - Cloudflare Worker at
worker/hf-lookup.js— two endpoints:GET /api/hf-lookup?id=...— fallback when a model isn't in the hardcoded table. Fetches HFconfig.json, normalizes architecture fields, caches in Workers KV (24h TTL). Uses anHF_TOKENsecret to bypass the unauthenticated rate limit.POST /api/log-search— fire-and-forget search beacon. Body is{ query, hit }where hit is"table" | "hf-fallback" | "miss". Writes one event to Workers Analytics Engine (dataset:skuld_searches).
┌────────────────────────────────────────────────────────────────────────┐
│ skuld.zoleb.com │
│ ┌────────────┐ /api/hf-lookup ┌───────────────────────┐ │
│ │ index.html │ ────────────────────► │ worker/hf-lookup.js │ │
│ │ models.js │ │ ↓ │ │
│ │ ram.js │ /api/log-search │ Workers KV (24h) │ │
│ │ gpus.js │ ────────────────────► │ ↓ │ │
│ └────────────┘ (every search, │ huggingface.co API │ │
│ │ debounced 2s) │ ↓ │ │
│ │ │ Analytics Engine │ │
│ │ │ (skuld_searches) │ │
│ │ └───────────────────────┘ │
│ └─── all math is client-side for models in the hardcoded table │
└────────────────────────────────────────────────────────────────────────┘
Public — a per-host Cloudflare Access "Bypass everyone" app overrides the wildcard *.zoleb.com gate.
Two layers, both Cloudflare-native and zero-cost at this volume:
Page traffic — Cloudflare Web Analytics. Auto-installed at the edge for
the entire zoleb.com zone (lite mode, no script tag in the HTML). View at
the Cloudflare Web Analytics dashboard. Anonymous, no cookies, no GDPR banner.
Search queries — Workers Analytics Engine. Each search the user types
fires POST /api/log-search 2 s after they stop typing (debounced, deduped
per query). The Worker writes one row to the skuld_searches AE dataset:
| Column | Meaning |
|---|---|
blob1 |
normalized query (lowercased, trimmed, max 200 chars) |
blob2 |
hit kind: table / hf-fallback / miss |
blob3 |
country (from cf.country) |
index1 |
hit kind (sampled index for fast filtering) |
_sample_interval |
reciprocal sampling rate (multiply for true counts) |
Query via the Workers Analytics SQL API (account-scoped Analytics token):
curl -s "https://api.cloudflare.com/client/v4/accounts/$CF_ACCOUNT_ID/analytics_engine/sql" \
-H "Authorization: Bearer $CF_ANALYTICS_TOKEN" \
--data "SELECT blob1 AS query, blob2 AS hit, SUM(_sample_interval) AS n
FROM skuld_searches
WHERE timestamp > NOW() - INTERVAL '7' DAY
GROUP BY query, hit
ORDER BY n DESC
LIMIT 50"One-time setup: AE must be enabled once via the dashboard at
https://dash.cloudflare.com/<account-id>/workers/analytics-engine. Until
then the Worker accepts log-search requests and silently drops them (the
endpoint guards on if (env.SEARCHES)). After enabling, uncomment the
[[analytics_engine_datasets]] block in worker/wrangler.toml and redeploy.
.
├── index.html single-page UI
├── models.js hardcoded metadata for ~50 models + parsers
├── ram.js pure math: weights + KV + overhead + LoRA
├── gpus.js hardware fit table (Nvidia + Apple Silicon)
├── ram.test.js sentinel tests including the MLA correctness guard
├── gpus.test.js fit-list tests: at least one model per attention type
├── models.test.js resolution + parseModelName + glossarize tests
├── static/
│ ├── tokens.css zoleb.com design tokens (themed light/dark)
│ ├── theme-toggle.js light-mode/dark-mode button handler
│ └── InterVariable.ttf bundled font (no Google Fonts)
└── worker/
├── hf-lookup.js Cloudflare Worker source
├── wrangler.toml Worker config (route, KV binding, secrets)
└── package.json wrangler v3 pinned (Node 20 compat)
weightsBytes = params_total × bytes_per_param[quant]
weightsActiveBytes = params_active × bytes_per_param[quant] (MoE only)
bytes_per_param table (ram.js):
| Quant | Bytes/param | Source |
|---|---|---|
| FP32 | 4 | native |
| BF16 / FP16 | 2 | native |
| FP8 (E4M3, E5M2) | 1 | native, H100+ / recent MLX |
| INT8 | 1 | weight-only |
| Q4_K_M | 0.55 | llama.cpp avg ≈ 4.4 bits/weight |
| INT4 | 0.5 | weight-only |
| Q3_K_M | 0.45 | llama.cpp avg ≈ 3.6 bits/weight |
Branches on attention type:
GQA / MHA / MQA:
kvBytes = 2 × kv_layers × (hidden_size × kv_heads/num_heads)
× ctxLen × kvDtypeBytes × batch
+ 2 × (num_layers − kv_layers) × (hidden_size × kv_heads/num_heads)
× min(ctxLen, sliding_window) × kvDtypeBytes × batch [if sliding]
# kv_layers defaults to num_layers. Hybrid models (Qwen 3.5/3.6, Granite
# 4.0-H) override it to count only the full-attention layers; linear
# layers have constant SSM state and contribute ~0 KV.
# sliding_window applies to the non-full layers (Gemma 2/3/4): their
# effective KV is bounded by the window, not by ctxLen.
MLA (DeepSeek V2/V3/V3.1/V3.2/R1):
kvBytes = num_layers × (kv_lora_rank + qk_rope_head_dim)
× ctxLen × kvDtypeBytes × batch
# No factor of 2 — K and V share the compressed latent.
For DeepSeek-V3 at 32k BF16 this evaluates to ~2 GiB instead of the ~107 GiB
that naive attention math returns. There's a sentinel test (ram.test.js)
that locks this in.
overheadBytes = weightsBytes × 0.12 # activations + framework workspace
lora (rank 16):
adapter_params = params_total × 0.001
loraBytes = adapter_params × (2 + 8) # bf16 grads + Adam fp32 m,v
A model "fits" a GPU when:
required ≤ vram × usableFactor × (1 - 0.12)
usableFactor = 0.75 for Apple Silicon (the system reserves ~25% by
default; sudo sysctl iogpu.wired_limit_mb can raise it). 1.0 for Nvidia.
For MoE models, required always uses total weights, not active —
every expert must be resident in VRAM even though only a few activate per
token.
A query like mlx-community/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16-mlx-fp16
flows through five stages:
findExact(name)— direct lookup against the hardcoded ID/alias index.parseModelName(name)— strips a recognized quantizer-org prefix (mlx-community/,unsloth/,bartowski/,TheBloke/, …) and a chain of tensor-type / format / role suffixes (-bf16,-mlx,-Q4_K_M,-GGUF,-GPTQ,-instruct,-DPO, …). Returns{ base, quant, format }. The auto-detected quant gets applied to the dropdown.findExact(parsed.base)retried on the stripped base.fuzzyResolve(base)— token-set scoring with size-token boost. Size tokens require exact match (so3bdoesn't collide with30b); other tokens use substring match. Returnsapproximateconfidence.guessFromName(base)— extracts the size token (e.g.27b), picks the closest real model in the table by family hint (qwen,llama,mistral, …) and size, and copies that model's architecture as the basis. Returnsguessconfidence with_basedOnset to the source id.
In parallel, detectCustomMarkers(name) flags community fine-tune
indicators (uncensored, abliterated, dolphin, hermes, wizardlm,
aeon, lumimaid, …), and cleanedBaseName(name) produces a
display-friendly version (the example above renders as Qwen3.6-27B).
glossarize(name) decodes the user's input into a list of explanations
that appear in the right-hand pane below the model card. It runs three
passes:
- Generation patterns — specific known versions (Llama 3.1, Qwen 2.5,
DeepSeek V3, Granite 4.1, etc.) with detailed descriptions. If no
specific match hits, falls through to family-version capture so
Qwen3.6gets labeledQwen 3.6(not "Qwen 3.x (unknown)"). - Glossary lookup — regex search on the raw name for known terms:
tensor types (BF16, FP16, FP8, INT8, INT4), GGUF quants (Q4_K_M, Q3_K_M,
Q5_K_M, Q8_0), formats (GGUF, GGML, EXL2, MLX, GPTQ, AWQ), training
stages (Instruct, Chat, Base, DPO, SFT, RLHF, ORPO, KTO), and community
fine-tune lineages (Dolphin, Hermes, WizardLM, Vicuna, Lumimaid,
Magnum, Stheno, MythoMax, Miqu, AEON, etc.). MoE entries explicitly
spell out the
active_experts / total_expertsnotation (e.g.MoE 6/256reads as "6 active of 256 total"). - Tokenized lookup — for size shorthand (
27B→ "27 billion parameters") and MoE notation (8x7B→ "8 experts of 7B each").
The UI also appends model-derived entries: the resolved attention type (MHA/GQA/MQA/MLA), MoE topology (with active-vs-total breakdown), and the architectural basis when the resolution is a guess.
The generation entry stays first; everything else is sorted alphabetically.
Append to models.js. Required fields:
{
id: "ibm-granite/granite-4.1-30b",
aliases: ["granite-4.1-30b", "granite-30b"],
params_total: 28.87e9,
params_active: null, // null for dense, e.g. 37e9 for DeepSeek-V3
num_layers: 64,
hidden_size: 4096,
num_heads: 32,
kv_heads: 8, // ignored for MLA
attention: "GQA", // "MHA" | "GQA" | "MQA" | "MLA"
mla: null, // { kv_lora_rank, qk_rope_head_dim } when MLA
moe: null, // { total_experts, active_experts } when MoE
default_dtype: "bf16",
max_context: 131072, // model.config.max_position_embeddings
// Optional, for hybrid models:
// kv_layers: 16, // count of full-attention layers; rest contribute ~0 KV
// sliding_window: 1024, // window size for non-full layers (Gemma)
}Numbers come from the model's HF config.json. For param counts, prefer the
metadata.total_size field of model.safetensors.index.json divided by the
dtype byte width.
After adding, run node --test *.test.js — the sweep tests verify that all
MoE entries have both total and active param counts and that all MLA entries
declare kv_lora_rank and qk_rope_head_dim.
The page is two columns:
Left column stacks three cards:
- Model — input box with fuzzy autocomplete; below it a single
based on: <name>line showing what the calculator resolved your input to. - Glossary (auto-shown when the input contains decodable tokens) — one entry per row, term in mono on top, description in muted below.
- Form controls for quantization, context length, batch, and LoRA toggle.
Right column is the output card:
Total Usage ≈ N GBheadline with aweights X + KV Ysub-line.- Stacked breakdown bar showing weights vs KV cache. KV segment turns orange when it exceeds weights (the "long context dominates" cliff).
- GPU fit list grouped into NVIDIA Enterprise, Nvidia consumer, and Apple (with chip names like M5 Max · 64 GB, M3 Ultra · 512 GB). Each row shows required-vs-usable plus headroom or shortfall.
A ?m=<name>&q=<quant>&c=<ctx>&b=<batch>&lora=1 query string round-trips
the form state, so a configuration is a shareable URL.
The hardcoded table covers, as of the last refresh:
- Llama 2 (7B/13B/70B), 3 (8B/70B), 3.1 (8B/70B/405B), 3.2 (1B/3B), 3.3 (70B), 4 (Scout 17B/16E, Maverick 17B/128E)
- Qwen 2.5 (0.5B–72B), 3 (8B/14B/32B + 30B-A3B/235B-A22B MoE), 3.5 (4B/9B), 3.6 (27B + 35B-A3B MoE) — Qwen 3.5/3.6 use hybrid linear/full attention
- Mistral lineage — Mistral 7B, Mixtral 8×7B / 8×22B, Mistral Large 2 / 3 (MLA-MoE 675B), Mistral Small 4 (MLA-MoE 119B), Ministral 3 (3B/8B/14B), Magistral Small (reasoning)
- DeepSeek V2 / V2-Lite / V3 / V3.1 / V3.2 / R1 (MLA-MoE), V4-Flash (MQA + Q-LoRA, 1M ctx)
- MiniMax M2, M2.7 (240B / ~10B-active MoE, 256 experts top-8, 200k ctx)
- Gemma 2 (2B/9B/27B), 3 (4B/12B/27B), 4 (31B + 26B-A4B MoE) — Gemma uses sliding-window
- Phi 3 / 3.5-MoE / 4
- IBM Granite 4.1 (3B/8B/30B), Granite Vision 4.1, Granite Guardian 4.1
- GPT-OSS 20B / 120B
- Yi 34B
- Cohere Command R / R+
70 entries total. Family-level fallback patterns also recognize unknown
versions in any of the families above plus a generic MiniMax M*.
node --test ram.test.js gpus.test.js models.test.js55 tests; the math is the product. Every assertion in ram.test.js cites
where the expected number comes from. The most important one is the MLA
correctness guard:
DeepSeek-V3 MLA @ 32k BF16: KV ~2 GiB, NOT ~100 GiB
If this ever starts failing, the fit table will be telling people they need ~100 GiB of GPU just for KV cache when in reality 2 GiB suffices. That's the single bug worth blocking a release for.
# One-time setup (already done for the live site):
# - Pages project: skuld (originally created as llm-zoleb, renamed via API;
# the .pages.dev system subdomain is sticky and
# is still llm-zoleb.pages.dev — see caveat below)
# - Worker: skuld-hf
# - KV namespace: LLM_ZOLEB_HF_CACHE (binding name: CACHE)
# - Worker secret: HF_TOKEN
# - DNS: CNAME skuld → llm-zoleb.pages.dev (proxied)
# - Access: per-host "skuld.zoleb.com (public bypass)" app overrides the
# *.zoleb.com wildcard policy so the site is publicly reachable
# Worker:
cd worker
export CLOUDFLARE_ACCOUNT_ID=...
export CLOUDFLARE_API_TOKEN=... # needs Workers/KV/Routes scopes
npx wrangler deploy
# Pages (rsync to a tmp dir to exclude tests + node_modules + worker/):
mkdir -p /tmp/skuld-deploy
rsync -a --delete \
--exclude='*.test.js' --exclude='worker' --exclude='node_modules' \
--exclude='.git' --exclude='package*.json' --exclude='README.md' \
./ /tmp/skuld-deploy/
npx --prefix worker wrangler pages deploy /tmp/skuld-deploy \
--project-name=skuld --branch=main --commit-dirty=true
rm -rf /tmp/skuld-deployWrangler is pinned to v3 in worker/package.json because the host VPS runs
Node 20. v4 requires Node 22+. Bump together.
If you rename the Cloudflare Pages project (PATCH …/pages/projects/<name>
with {"name": "<new>"}):
-
The display name updates and the project shows up under the new name in the dashboard and CLI.
-
The system-assigned
.pages.devsubdomain stays the same — it does NOT migrate to<new>.pages.dev. CNAMEs pointing at the new subdomain hit Error 1014 (CNAME Cross-User Banned) because the new hostname is registered to a different account / doesn't exist yet. -
Custom domains attached before the rename flip to
status: "deactivated"and stop serving (you'll see Error 522 from Cloudflare). Detach + reattach the domain to re-verify it under the renamed project:curl -X DELETE -H "Authorization: Bearer $TOKEN" \ "https://api.cloudflare.com/client/v4/accounts/$ACCT/pages/projects/<new>/domains/<domain>" curl -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \ "https://api.cloudflare.com/client/v4/accounts/$ACCT/pages/projects/<new>/domains" \ -d '{"name":"<domain>"}'
Wait for
status: active. The CNAME should still target the original<old>.pages.devsubdomain.
- Mac default reserve. Apple Silicon "usable" is
vram × 0.75. Models that fit 512 GB on paper may needsudo sysctl iogpu.wired_limit_mbbefore they actually fit out of the box. The fit list silently applies the 75% factor. - M5 Ultra hasn't shipped. The 192/256/512 GB tiers are labeled M3 Ultra
because no Ultra-class M4 or M5 exists yet. Update
gpus.jswhen one does. - Hybrid attention IS handled for Qwen 3.5 / 3.6 (linear + full) and
Gemma 2 / 3 / 4 (sliding window) via the
kv_layersandsliding_windowfields. Pure-Mamba and Mamba-Transformer hybrids (Granite 4.0-H, etc.) are not yet handled — addingkv_layers: 0would zero out their KV but doesn't account for the SSM state that does exist. - Extended context (256k / 512k / 1M) hides behind a toggle. At those ranges KV cache dwarfs weights for almost any model.
- Native vs extended context. Each model entry has
max_contextreflectingmax_position_embeddings. Sliding past that requires RoPE/YaRN extension, which usually degrades quality — the slider hint flags it. - Auto-quant. When a tensor-type suffix is detected in the input
(
-bf16,-fp8,-Q4_K_M,-GGUF,-GPTQ, …), the dropdown is set automatically. Manual changes stick — auto-set only fires when the input changes.
MIT