What happens to next-token probabilities when you change one knob?
π§ Visualize token-by-token sampling with chat templates, nucleus filtering, constrained decoding, and attribution heatmaps in one local app β‘
Quick Start Β· How It Works Β· API
Sampling is usually a black box. You tweak temperature or top-p and hope for better text, without seeing what changed at the token level.
The Solution: This demo exposes every generation step: prompt/completion tokens, pre/post-sampling probabilities, nucleus pruning, forced tokens, and token attribution to input context.
The Result: You can explain each generated token with concrete evidence, in a browser, with a backend you can read in one file.
- Visible special tokens across prompt, completion stack, and distribution bars.
- Chat-template-aware tokenization (when the model tokenizer provides a template).
- Side-by-side probability display:
p_base: raw model probability.p_final: probability after temperature, penalties, constraints, and nucleus filtering.
- Pre-nucleus top-k rendering where excluded tokens remain visible (
kept = false,p_final = 0). - Click-to-force token from distribution, or force arbitrary text append.
- Editable completion path:
- Click last completion token to delete.
- Pick alternate candidate on historical token overlays to rewind and branch.
- Branching workflow (fork prompt/completion state and compare outcomes).
- Attribution heatmaps on hover for generated tokens and candidate next tokens.
- Attention attribution CPU fallback when attention scores are unavailable on active device.
- LLM on/off toggle that greys out model effects and clears distribution output.
- Constrained decoding using Outlines + llguidance:
- Multiple choice
- Regex
- CFG
- JSON schema
- uv installed.
uv run server.pyThen open:
http://127.0.0.1:8000
- Wait for the default model to finish loading (
Qwen/Qwen2.5-0.5B-Instructunless overridden). - Type a prompt in
Input 1. - Click
Stepto append one token at a time. - Hover generated tokens or distribution candidates to inspect attribution.
- Change temperature/top-p and watch
p_finalshift in real time.
- Build model input IDs from prompt + completion IDs.
- Optionally wrap messages with tokenizer chat template (
system+user) andadd_generation_prompt=True. - Compute next-token logits.
- Apply repetition, presence, and frequency penalties.
- Apply temperature.
- Optionally apply constrained decoding mask.
- Compute distribution:
- Display top-k from pre-nucleus probabilities.
- Compute nucleus keep-set from post-constraint distribution.
- Set
p_final = 0for tokens outside nucleus (or invalid under constraints).
- Select token via greedy, sampling, forced token ID, or forced text append.
- Methods:
attention,saliency,input_x_gradient,integrated_gradients. - Attribution targets one completion token at a time.
- If attention attribution is unsupported/unavailable on GPU/MPS, the server retries on CPU automatically.
Default decoding params:
{
"greedy": false,
"temperature": 0.7,
"top_p": 0.8,
"repetition_penalty": 1.05,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"seed": null,
"stop_sequences": []
}Presets in UI: greedy, balanced, reliable, creative, custom.
| Method | Path | Purpose |
|---|---|---|
GET |
/api/model/status |
Read current model/device/template status |
POST |
/api/model/load |
Load a Hugging Face model by repo ID |
POST |
/api/preview |
Compute prompt/completion tokens + next-token distribution |
POST |
/api/step |
Append one token (sampled/greedy/forced) or force text |
POST |
/api/attribution |
Compute token-level attribution scores |
POST /api/model/load
{ "model_id": "Qwen/Qwen2.5-0.5B-Instruct" }POST /api/preview / POST /api/step
{
"branch_id": "default",
"prompt": "The capital of France is",
"system_prompt": "You are a helpful assistant.",
"use_chat_template": true,
"completion_ids": [],
"top_k": 10,
"params": {
"greedy": false,
"temperature": 0.7,
"top_p": 0.8,
"repetition_penalty": 1.05,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"seed": null,
"stop_sequences": []
},
"constraint": {
"enabled": false,
"type": "multiple_choice",
"schema": "Yes\nNo\nMaybe"
}
}Additional POST /api/step fields:
{
"reset_rng": false,
"force_token_id": null,
"force_text": null
}POST /api/attribution
{
"branch_id": "default",
"prompt": "The capital of France is",
"system_prompt": "You are a helpful assistant.",
"use_chat_template": true,
"completion_ids": [151645, 271],
"target_index": 1,
"target_token_id": 271,
"method": "attention"
}prompt_tokens,completion_tokens: token pills used in UI (withspecial: true|false).dist.tokens[].p_base: raw probability.dist.tokens[].p_final: post-sampling probability (0 when excluded).dist.tokens[].kept: nucleus inclusion flag.dist.tokens[].valid: constraint validity flag.appended_meta: per-appended-token snapshot (distribution + selection flags at append time).
.
βββ index.html # Entire frontend (HTML/CSS/JS), no build step
βββ server.py # FastAPI server + model loading + sampling/attribution APIs
| Variable | Default | Description |
|---|---|---|
LLM_DEMO_DEFAULT_MODEL |
Qwen/Qwen2.5-0.5B-Instruct |
Model loaded on startup |
LLM_DEMO_LOG_LEVEL |
info |
Server log level |
Persisted UI state:
llmDemo.paramsllmDemo.modelIdllmDemo.llmEnabledllmDemo.decodingCollapsedllmDemo.stepSpeedCollapsedllmDemo.attrCollapsedllmDemo.attrMethodllmDemo.constraintEnabledllmDemo.constraintTypellmDemo.constraintSchemallmDemo.constraintCollapsedllmDemo.topKllmDemo.stepSpeedMs
