Skip to content

michaeldtimpe/skuld

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

skuld

LLM RAM calculator. Type a model name — anything from meta-llama/Llama-3.1-70B to mlx-community/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16-mlx-fp16 — and get a realistic estimate of the VRAM or unified memory needed to run it for inference.

Live at skuld.zoleb.com.

The headline output — "fits on 1× A100 80 GB", "needs M3 Ultra 192 GB" — is the screenshot people share, so the math has to be right for the cases people care about most: long-context Llama 3.1, MoE Mixtral, and especially MLA DeepSeek (where naive attention math overstates KV cache by ~50×).

Why this exists

Most "LLM memory calculators" use a single formula:

total ≈ params × bytes_per_param × 1.2

That's wrong for any modern model:

Model class Why naive math fails
GQA (Llama 3+, Qwen, Mistral) KV cache scales with kv_heads / num_heads, not 1
MLA (DeepSeek V2/V3/V3.1/V3.2/R1) KV is a compressed latent (~576 elements / token / layer), not 2 × hidden_size. At 32k context, the difference is ~2 GiB vs ~107 GiB.
MoE (Mixtral, DeepSeek, Qwen3-MoE, Qwen 3.6, Llama 4, GPT-OSS, Gemma 4) Total weights vs active-per-token are different numbers; both matter
Hybrid linear/full attention (Qwen 3.5 / 3.6) Only ~1 in 4 layers is full attention. Linear-attention layers have constant SSM state, not O(ctx) KV cache. Naive math overstates KV by 4×.
Sliding-window attention (Gemma 2 / 3 / 4) Most layers cap KV at the window size (e.g. 1024 tokens), so KV stops scaling with context past the window.
Long context KV cache dominates over weights once context exceeds the model's natural break point — for an 8B model this happens around 256k tokens

skuld branches on attention type and MoE topology in the math module instead of papering over them.

Architecture

Two pieces, both designed to run with no build step:

  • Static front-end at index.html — vanilla HTML + ES modules, vendored Inter font and zoleb.com style tokens. No framework, no bundler. Hosts on Cloudflare Pages.
  • Cloudflare Worker at worker/hf-lookup.js — two endpoints:
    • GET /api/hf-lookup?id=... — fallback when a model isn't in the hardcoded table. Fetches HF config.json, normalizes architecture fields, caches in Workers KV (24h TTL). Uses an HF_TOKEN secret to bypass the unauthenticated rate limit.
    • POST /api/log-search — fire-and-forget search beacon. Body is { query, hit } where hit is "table" | "hf-fallback" | "miss". Writes one event to Workers Analytics Engine (dataset: skuld_searches).
┌────────────────────────────────────────────────────────────────────────┐
│                          skuld.zoleb.com                                │
│  ┌────────────┐    /api/hf-lookup     ┌───────────────────────┐         │
│  │ index.html │ ────────────────────► │ worker/hf-lookup.js   │         │
│  │  models.js │                       │   ↓                   │         │
│  │  ram.js    │    /api/log-search    │   Workers KV (24h)    │         │
│  │  gpus.js   │ ────────────────────► │   ↓                   │         │
│  └────────────┘    (every search,     │   huggingface.co API  │         │
│       │             debounced 2s)     │   ↓                   │         │
│       │                               │   Analytics Engine    │         │
│       │                               │   (skuld_searches)    │         │
│       │                               └───────────────────────┘         │
│       └─── all math is client-side for models in the hardcoded table   │
└────────────────────────────────────────────────────────────────────────┘

Public — a per-host Cloudflare Access "Bypass everyone" app overrides the wildcard *.zoleb.com gate.

Analytics

Two layers, both Cloudflare-native and zero-cost at this volume:

Page traffic — Cloudflare Web Analytics. Auto-installed at the edge for the entire zoleb.com zone (lite mode, no script tag in the HTML). View at the Cloudflare Web Analytics dashboard. Anonymous, no cookies, no GDPR banner.

Search queries — Workers Analytics Engine. Each search the user types fires POST /api/log-search 2 s after they stop typing (debounced, deduped per query). The Worker writes one row to the skuld_searches AE dataset:

Column Meaning
blob1 normalized query (lowercased, trimmed, max 200 chars)
blob2 hit kind: table / hf-fallback / miss
blob3 country (from cf.country)
index1 hit kind (sampled index for fast filtering)
_sample_interval reciprocal sampling rate (multiply for true counts)

Query via the Workers Analytics SQL API (account-scoped Analytics token):

curl -s "https://api.cloudflare.com/client/v4/accounts/$CF_ACCOUNT_ID/analytics_engine/sql" \
  -H "Authorization: Bearer $CF_ANALYTICS_TOKEN" \
  --data "SELECT blob1 AS query, blob2 AS hit, SUM(_sample_interval) AS n
          FROM skuld_searches
          WHERE timestamp > NOW() - INTERVAL '7' DAY
          GROUP BY query, hit
          ORDER BY n DESC
          LIMIT 50"

One-time setup: AE must be enabled once via the dashboard at https://dash.cloudflare.com/<account-id>/workers/analytics-engine. Until then the Worker accepts log-search requests and silently drops them (the endpoint guards on if (env.SEARCHES)). After enabling, uncomment the [[analytics_engine_datasets]] block in worker/wrangler.toml and redeploy.

Files

.
├── index.html              single-page UI
├── models.js               hardcoded metadata for ~50 models + parsers
├── ram.js                  pure math: weights + KV + overhead + LoRA
├── gpus.js                 hardware fit table (Nvidia + Apple Silicon)
├── ram.test.js             sentinel tests including the MLA correctness guard
├── gpus.test.js            fit-list tests: at least one model per attention type
├── models.test.js          resolution + parseModelName + glossarize tests
├── static/
│   ├── tokens.css          zoleb.com design tokens (themed light/dark)
│   ├── theme-toggle.js     light-mode/dark-mode button handler
│   └── InterVariable.ttf   bundled font (no Google Fonts)
└── worker/
    ├── hf-lookup.js        Cloudflare Worker source
    ├── wrangler.toml       Worker config (route, KV binding, secrets)
    └── package.json        wrangler v3 pinned (Node 20 compat)

Math reference

Weights

weightsBytes        = params_total  × bytes_per_param[quant]
weightsActiveBytes  = params_active × bytes_per_param[quant]   (MoE only)

bytes_per_param table (ram.js):

Quant Bytes/param Source
FP32 4 native
BF16 / FP16 2 native
FP8 (E4M3, E5M2) 1 native, H100+ / recent MLX
INT8 1 weight-only
Q4_K_M 0.55 llama.cpp avg ≈ 4.4 bits/weight
INT4 0.5 weight-only
Q3_K_M 0.45 llama.cpp avg ≈ 3.6 bits/weight

KV cache

Branches on attention type:

GQA / MHA / MQA:
  kvBytes = 2 × kv_layers × (hidden_size × kv_heads/num_heads)
            × ctxLen × kvDtypeBytes × batch
        +  2 × (num_layers − kv_layers) × (hidden_size × kv_heads/num_heads)
            × min(ctxLen, sliding_window) × kvDtypeBytes × batch    [if sliding]

  # kv_layers defaults to num_layers. Hybrid models (Qwen 3.5/3.6, Granite
  # 4.0-H) override it to count only the full-attention layers; linear
  # layers have constant SSM state and contribute ~0 KV.
  # sliding_window applies to the non-full layers (Gemma 2/3/4): their
  # effective KV is bounded by the window, not by ctxLen.

MLA (DeepSeek V2/V3/V3.1/V3.2/R1):
  kvBytes = num_layers × (kv_lora_rank + qk_rope_head_dim)
            × ctxLen × kvDtypeBytes × batch
  # No factor of 2 — K and V share the compressed latent.

For DeepSeek-V3 at 32k BF16 this evaluates to ~2 GiB instead of the ~107 GiB that naive attention math returns. There's a sentinel test (ram.test.js) that locks this in.

Overhead + LoRA

overheadBytes = weightsBytes × 0.12          # activations + framework workspace

lora (rank 16):
  adapter_params = params_total × 0.001
  loraBytes = adapter_params × (2 + 8)       # bf16 grads + Adam fp32 m,v

GPU fit

A model "fits" a GPU when:

required ≤ vram × usableFactor × (1 - 0.12)

usableFactor = 0.75 for Apple Silicon (the system reserves ~25% by default; sudo sysctl iogpu.wired_limit_mb can raise it). 1.0 for Nvidia.

For MoE models, required always uses total weights, not active — every expert must be resident in VRAM even though only a few activate per token.

Resolution flow

A query like mlx-community/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16-mlx-fp16 flows through five stages:

  1. findExact(name) — direct lookup against the hardcoded ID/alias index.
  2. parseModelName(name) — strips a recognized quantizer-org prefix (mlx-community/, unsloth/, bartowski/, TheBloke/, …) and a chain of tensor-type / format / role suffixes (-bf16, -mlx, -Q4_K_M, -GGUF, -GPTQ, -instruct, -DPO, …). Returns { base, quant, format }. The auto-detected quant gets applied to the dropdown.
  3. findExact(parsed.base) retried on the stripped base.
  4. fuzzyResolve(base) — token-set scoring with size-token boost. Size tokens require exact match (so 3b doesn't collide with 30b); other tokens use substring match. Returns approximate confidence.
  5. guessFromName(base) — extracts the size token (e.g. 27b), picks the closest real model in the table by family hint (qwen, llama, mistral, …) and size, and copies that model's architecture as the basis. Returns guess confidence with _basedOn set to the source id.

In parallel, detectCustomMarkers(name) flags community fine-tune indicators (uncensored, abliterated, dolphin, hermes, wizardlm, aeon, lumimaid, …), and cleanedBaseName(name) produces a display-friendly version (the example above renders as Qwen3.6-27B).

The glossary

glossarize(name) decodes the user's input into a list of explanations that appear in the right-hand pane below the model card. It runs three passes:

  1. Generation patterns — specific known versions (Llama 3.1, Qwen 2.5, DeepSeek V3, Granite 4.1, etc.) with detailed descriptions. If no specific match hits, falls through to family-version capture so Qwen3.6 gets labeled Qwen 3.6 (not "Qwen 3.x (unknown)").
  2. Glossary lookup — regex search on the raw name for known terms: tensor types (BF16, FP16, FP8, INT8, INT4), GGUF quants (Q4_K_M, Q3_K_M, Q5_K_M, Q8_0), formats (GGUF, GGML, EXL2, MLX, GPTQ, AWQ), training stages (Instruct, Chat, Base, DPO, SFT, RLHF, ORPO, KTO), and community fine-tune lineages (Dolphin, Hermes, WizardLM, Vicuna, Lumimaid, Magnum, Stheno, MythoMax, Miqu, AEON, etc.). MoE entries explicitly spell out the active_experts / total_experts notation (e.g. MoE 6/256 reads as "6 active of 256 total").
  3. Tokenized lookup — for size shorthand (27B → "27 billion parameters") and MoE notation (8x7B → "8 experts of 7B each").

The UI also appends model-derived entries: the resolved attention type (MHA/GQA/MQA/MLA), MoE topology (with active-vs-total breakdown), and the architectural basis when the resolution is a guess.

The generation entry stays first; everything else is sorted alphabetically.

Adding a model

Append to models.js. Required fields:

{
  id: "ibm-granite/granite-4.1-30b",
  aliases: ["granite-4.1-30b", "granite-30b"],
  params_total: 28.87e9,
  params_active: null,                // null for dense, e.g. 37e9 for DeepSeek-V3
  num_layers: 64,
  hidden_size: 4096,
  num_heads: 32,
  kv_heads: 8,                        // ignored for MLA
  attention: "GQA",                   // "MHA" | "GQA" | "MQA" | "MLA"
  mla: null,                          // { kv_lora_rank, qk_rope_head_dim } when MLA
  moe: null,                          // { total_experts, active_experts } when MoE
  default_dtype: "bf16",
  max_context: 131072,                // model.config.max_position_embeddings
  // Optional, for hybrid models:
  // kv_layers: 16,                   // count of full-attention layers; rest contribute ~0 KV
  // sliding_window: 1024,            // window size for non-full layers (Gemma)
}

Numbers come from the model's HF config.json. For param counts, prefer the metadata.total_size field of model.safetensors.index.json divided by the dtype byte width.

After adding, run node --test *.test.js — the sweep tests verify that all MoE entries have both total and active param counts and that all MLA entries declare kv_lora_rank and qk_rope_head_dim.

UI

The page is two columns:

Left column stacks three cards:

  • Model — input box with fuzzy autocomplete; below it a single based on: <name> line showing what the calculator resolved your input to.
  • Glossary (auto-shown when the input contains decodable tokens) — one entry per row, term in mono on top, description in muted below.
  • Form controls for quantization, context length, batch, and LoRA toggle.

Right column is the output card:

  • Total Usage ≈ N GB headline with a weights X + KV Y sub-line.
  • Stacked breakdown bar showing weights vs KV cache. KV segment turns orange when it exceeds weights (the "long context dominates" cliff).
  • GPU fit list grouped into NVIDIA Enterprise, Nvidia consumer, and Apple (with chip names like M5 Max · 64 GB, M3 Ultra · 512 GB). Each row shows required-vs-usable plus headroom or shortfall.

A ?m=<name>&q=<quant>&c=<ctx>&b=<batch>&lora=1 query string round-trips the form state, so a configuration is a shareable URL.

Coverage

The hardcoded table covers, as of the last refresh:

  • Llama 2 (7B/13B/70B), 3 (8B/70B), 3.1 (8B/70B/405B), 3.2 (1B/3B), 3.3 (70B), 4 (Scout 17B/16E, Maverick 17B/128E)
  • Qwen 2.5 (0.5B–72B), 3 (8B/14B/32B + 30B-A3B/235B-A22B MoE), 3.5 (4B/9B), 3.6 (27B + 35B-A3B MoE) — Qwen 3.5/3.6 use hybrid linear/full attention
  • Mistral lineage — Mistral 7B, Mixtral 8×7B / 8×22B, Mistral Large 2 / 3 (MLA-MoE 675B), Mistral Small 4 (MLA-MoE 119B), Ministral 3 (3B/8B/14B), Magistral Small (reasoning)
  • DeepSeek V2 / V2-Lite / V3 / V3.1 / V3.2 / R1 (MLA-MoE), V4-Flash (MQA + Q-LoRA, 1M ctx)
  • MiniMax M2, M2.7 (240B / ~10B-active MoE, 256 experts top-8, 200k ctx)
  • Gemma 2 (2B/9B/27B), 3 (4B/12B/27B), 4 (31B + 26B-A4B MoE) — Gemma uses sliding-window
  • Phi 3 / 3.5-MoE / 4
  • IBM Granite 4.1 (3B/8B/30B), Granite Vision 4.1, Granite Guardian 4.1
  • GPT-OSS 20B / 120B
  • Yi 34B
  • Cohere Command R / R+

70 entries total. Family-level fallback patterns also recognize unknown versions in any of the families above plus a generic MiniMax M*.

Tests

node --test ram.test.js gpus.test.js models.test.js

55 tests; the math is the product. Every assertion in ram.test.js cites where the expected number comes from. The most important one is the MLA correctness guard:

DeepSeek-V3 MLA @ 32k BF16: KV ~2 GiB, NOT ~100 GiB

If this ever starts failing, the fit table will be telling people they need ~100 GiB of GPU just for KV cache when in reality 2 GiB suffices. That's the single bug worth blocking a release for.

Deploy

# One-time setup (already done for the live site):
#   - Pages project: skuld   (originally created as llm-zoleb, renamed via API;
#                             the .pages.dev system subdomain is sticky and
#                             is still llm-zoleb.pages.dev — see caveat below)
#   - Worker: skuld-hf
#   - KV namespace: LLM_ZOLEB_HF_CACHE (binding name: CACHE)
#   - Worker secret: HF_TOKEN
#   - DNS: CNAME skuld → llm-zoleb.pages.dev (proxied)
#   - Access: per-host "skuld.zoleb.com (public bypass)" app overrides the
#             *.zoleb.com wildcard policy so the site is publicly reachable

# Worker:
cd worker
export CLOUDFLARE_ACCOUNT_ID=...
export CLOUDFLARE_API_TOKEN=...   # needs Workers/KV/Routes scopes
npx wrangler deploy

# Pages (rsync to a tmp dir to exclude tests + node_modules + worker/):
mkdir -p /tmp/skuld-deploy
rsync -a --delete \
  --exclude='*.test.js' --exclude='worker' --exclude='node_modules' \
  --exclude='.git' --exclude='package*.json' --exclude='README.md' \
  ./ /tmp/skuld-deploy/
npx --prefix worker wrangler pages deploy /tmp/skuld-deploy \
  --project-name=skuld --branch=main --commit-dirty=true
rm -rf /tmp/skuld-deploy

Wrangler is pinned to v3 in worker/package.json because the host VPS runs Node 20. v4 requires Node 22+. Bump together.

Pages rename caveat

If you rename the Cloudflare Pages project (PATCH …/pages/projects/<name> with {"name": "<new>"}):

  1. The display name updates and the project shows up under the new name in the dashboard and CLI.

  2. The system-assigned .pages.dev subdomain stays the same — it does NOT migrate to <new>.pages.dev. CNAMEs pointing at the new subdomain hit Error 1014 (CNAME Cross-User Banned) because the new hostname is registered to a different account / doesn't exist yet.

  3. Custom domains attached before the rename flip to status: "deactivated" and stop serving (you'll see Error 522 from Cloudflare). Detach + reattach the domain to re-verify it under the renamed project:

    curl -X DELETE -H "Authorization: Bearer $TOKEN" \
      "https://api.cloudflare.com/client/v4/accounts/$ACCT/pages/projects/<new>/domains/<domain>"
    curl -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
      "https://api.cloudflare.com/client/v4/accounts/$ACCT/pages/projects/<new>/domains" \
      -d '{"name":"<domain>"}'

    Wait for status: active. The CNAME should still target the original <old>.pages.dev subdomain.

Things to know

  • Mac default reserve. Apple Silicon "usable" is vram × 0.75. Models that fit 512 GB on paper may need sudo sysctl iogpu.wired_limit_mb before they actually fit out of the box. The fit list silently applies the 75% factor.
  • M5 Ultra hasn't shipped. The 192/256/512 GB tiers are labeled M3 Ultra because no Ultra-class M4 or M5 exists yet. Update gpus.js when one does.
  • Hybrid attention IS handled for Qwen 3.5 / 3.6 (linear + full) and Gemma 2 / 3 / 4 (sliding window) via the kv_layers and sliding_window fields. Pure-Mamba and Mamba-Transformer hybrids (Granite 4.0-H, etc.) are not yet handled — adding kv_layers: 0 would zero out their KV but doesn't account for the SSM state that does exist.
  • Extended context (256k / 512k / 1M) hides behind a toggle. At those ranges KV cache dwarfs weights for almost any model.
  • Native vs extended context. Each model entry has max_context reflecting max_position_embeddings. Sliding past that requires RoPE/YaRN extension, which usually degrades quality — the slider hint flags it.
  • Auto-quant. When a tensor-type suffix is detected in the input (-bf16, -fp8, -Q4_K_M, -GGUF, -GPTQ, …), the dropdown is set automatically. Manual changes stick — auto-set only fires when the input changes.

License

MIT

About

LLM RAM calculator — handles GQA/MLA/MoE correctly. Live at llm.zoleb.com

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors