Local-first context compression for AI coding tools. Reduces redundant tokens by compressing conversation history across every LLM call.
Every time an AI coding tool sends a request to an LLM, it re-sends the same context: who you are, what you're working on, your preferences, your project structure. This redundancy costs real money and eats into context windows.
Engram reduces it. It runs locally as a lightweight daemon and compresses your conversation context (message history, tool results, responses) across every LLM call. The result: smaller prompts, lower costs, and more room in the context window for what actually matters.
Note: Identity compression (CLAUDE.md / system prompt codebook compression) is currently disabled for Anthropic/Claude due to compatibility issues. System prompts pass through verbatim. Context compression and redundancy control are fully active.
Engram applies two active compression stages:
- Context compression — Older conversation history is collapsed into a
[CONTEXT_SUMMARY]block at the local HTTP proxy while the most recent turns remain verbatim. - Redundancy control — Large tool outputs are checked for repeated content so Claude can respond with concise delta-only summaries instead of restating everything.
flowchart LR
A["Claude Code request<br/>15 prior messages"] --> B["Engram Proxy"]
B --> C["Keep recent window<br/>last 8 messages"]
B --> D["Compress older head<br/>first 7 messages"]
D --> E["[CONTEXT_SUMMARY]<br/>user: ...<br/>assistant: ..."]
C --> F["Tail messages unchanged"]
E --> G["Rewritten request"]
F --> G
G --> H["Anthropic / Claude API"]
| Metric | Value |
|---|---|
| Context compression | ~40-60% token reduction |
| Startup overhead | <50ms |
| Memory footprint | ~30MB resident |
# Install via Homebrew (macOS/Linux)
brew install pythondatascrape/tap/engram
# Or download a release binary
curl -fsSL https://github.com/pythondatascrape/engram/releases/latest/download/engram_$(uname -s | tr A-Z a-z)_$(uname -m | sed 's/x86_64/amd64/').tar.gz | tar xz
sudo mv engram /usr/local/bin/
# Or install from source
go install github.com/pythondatascrape/engram/cmd/engram@latest
# Install for Claude Code — one command, no manual steps
engram install --claude-code
# Restart Claude Code and start using Engram immediatelyStarting in v0.3.0, a single install command handles everything:
- Copies the Engram plugin into
~/.claude/plugins/ - Registers the statusline in
~/.claude/settings.json - Creates
~/.engram/engram.yamlwith working defaults (if not already present) - Installs the daemon as a background service (launchd on macOS, systemd on Linux)
- Starts the service immediately
- Verifies the daemon socket and proxy are reachable
- Exits non-zero if any step fails — no silent partial installs
After install, restart Claude Code. No engram serve step is required for normal use.
| Command | Description |
|---|---|
engram install --claude-code |
Zero-touch install for Claude Code — daemon, config, and plugin in one step |
engram install --openclaw |
Install for OpenClaw |
engram install |
Auto-detect and install for all supported clients |
engram analyze |
Analyze your project and show compression opportunities |
engram advisor |
Show optimization recommendations based on session data |
engram serve |
Start the compression daemon manually (advanced) |
engram status |
Show daemon status, active sessions, and savings |
Every command supports --help for detailed usage.
| Flag | Default | Description |
|---|---|---|
--config |
~/.engram/engram.yaml |
Path to configuration file |
--socket |
~/.engram/engram.sock |
Unix socket path |
engram install --claude-code
# Restart Claude Code — compression is active immediatelyEngram registers as a Claude Code plugin, starts the local background service, and routes Claude requests through the local proxy. In a fresh session:
- the
SessionStarthook registers the real Claude session ID with the proxy CLAUDE.mdis collected and compressed directly into an identity blockPOST /v1/messagesrequests are rewritten before upstream so older history becomes a[CONTEXT_SUMMARY]block- large tool outputs go through Engram's redundancy check so Claude can avoid re-sending repeated output
For end users, the flow is: install, restart Claude Code, and start working.
Manual install (if you prefer explicit steps):
engram install --claude-code # installs plugin + daemon
engram status # confirm daemon is runningClaude Code Skills
| Skill | Invocation | Description |
|---|---|---|
| Codebook Manager | /engram:codebook |
Show, diff, init, or validate your active codebook |
| Codebook Wizard | /engram:engram-codebook |
Interactive wizard — builds identity + prompt codebook from scratch, scoped to session, project, or global |
| Token Report | /engram:report |
Generate a session token savings report |
| Settings | /engram:config |
Edit plugin configuration interactively |
The Codebook Wizard (/engram:engram-codebook) is the fastest way to wire up compression for a new project. It reads your CLAUDE.md, derives identity dimensions, builds a prompt vocabulary from your project's domain language, and writes a .engram-codebook.yaml + engram.yaml in two minutes. Behavioral dimensions that derive_codebook misses from prose-style CLAUDE.md files are pre-populated and reviewable before writing.
engram install --openclawOpenClaw support is currently plugin installation only. The zero-touch daemon, readiness verification, and automatic proxy wiring shipped for Claude Code first.
For custom integrations, Engram provides thin client SDKs in three languages. All connect to the local daemon over a Unix socket.
Python:
from engram import Engram
async with await Engram.connect() as client:
result = await client.compress({"identity": "...", "history": [], "query": "..."})Go:
client, _ := engram.Connect(ctx, "")
defer client.Close()
result, _ := client.Compress(ctx, map[string]any{...})Node.js:
import { Engram } from "engram";
const client = await Engram.connect();
const result = await client.compress({identity: "...", history: [], query: "..."});See the Integration Guide for details.
Engram's config file lives at ~/.engram/engram.yaml and is created automatically during install. The most common settings to adjust:
# ~/.engram/engram.yaml
server:
port: 4433 # daemon JSON-RPC port
proxy:
port: 4242 # HTTP proxy port (Claude Code routes through this)
openai_port: 4243 # OpenAI-compatible proxy port (optional, 0 = disabled)
window_size: 8 # default: keep the last 8 messages verbatim, summarize older historyRun engram serve --help for the full list of options.
When openai_port is set (default: 4243), Engram exposes a second listener that accepts standard OpenAI API calls and applies the same context-compression pipeline before forwarding to OpenAI.
Supported endpoints:
POST /v1/chat/completions— Chat Completions API (streaming and non-streaming)POST /v1/responses— Responses API (streaming and non-streaming)
Quick test:
curl http://localhost:4243/v1/chat/completions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"ping"}]}'Point an OpenAI SDK at the proxy:
from openai import OpenAI
client = OpenAI(
api_key="...",
base_url="http://localhost:4243/v1",
)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello"}],
)import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
baseURL: "http://localhost:4243/v1",
});Engram compresses messages (or input for the Responses API) using the same window-based summarisation as the Anthropic proxy, then forwards the full request — headers, model, tools, tool_choice, stream, etc. — unmodified to api.openai.com.
See the Travelbound demo project for a working example that shows Engram compressing a real project's context from ~4,200 tokens to ~380 tokens.
- Getting Started — Install and configure Engram for your first project
- CLI Reference — Full command documentation with examples
- Integration Guide — Configure Claude Code, OpenClaw, and SDK setup
- Changelog — Release history
Note: Identity compression (codebook-based system prompt compression) is currently disabled for Anthropic/Claude. The codebook workflow described below applies to OpenAI and custom SDK integrations. For Claude Code users, context compression is active automatically; system prompts pass through verbatim.
Engram's context compression is automatic once installed. For SDK integrations, you can also wire in identity compression via the codebook workflow.
Note: The identity compression workflow below applies to OpenAI and custom SDK integrations. For Claude Code, system prompts pass through verbatim — Anthropic's API does not currently support the codebook protocol.
Every LLM call re-sends your identity: who you are, what you prefer, how you work. Written as prose, that identity is verbose by design — natural language is redundant. Engram replaces prose with a derived codebook so the same information travels in a fraction of the tokens.
| Format | Example | ~Tokens |
|---|---|---|
| Plain prose | "I prefer concise responses without summaries. I'm a senior Go engineer who has been writing Go for ten years. I'm working on a CLI tool targeting macOS and Linux." |
~45 |
| Codebook (key=value) | role=engineer lang=go exp=10y platform=macos,linux style=concise no_summaries=true |
~14 |
| Savings | ~69% |
At scale — a full CLAUDE.md, system prompt, and project instructions — this compression reaches 96–98%. That headroom goes directly toward more productive context: longer conversation history, bigger diffs, more tool results.
Engram's derive_codebook MCP tool scans your CLAUDE.md and project instructions, extracts structured dimensions, and produces a compact codebook. On the first turn, the codebook definition is sent once so the model understands the format. Every subsequent turn references keys only.
# First turn — definition sent once
codebook: role(enum:engineer,researcher) lang(text) style(enum:concise,verbose) no_summaries(bool)
# All subsequent turns — keys only
role=engineer lang=go style=concise no_summaries=true
You can also define codebooks manually in YAML for domain-specific applications:
# ~/.engram-codebook.yaml
name: my-project
version: 1
dimensions:
- name: role
type: enum
values: [engineer, researcher, designer]
required: true
- name: lang
type: text
required: true
- name: style
type: enum
values: [concise, verbose]
required: false
- name: no_summaries
type: boolean
required: falsePass the override path to derive_codebook and it merges your explicit dimensions with anything auto-detected:
result = await client.compress({
"identity": open("CLAUDE.md").read(),
"yamlOverridePath": ".engram-codebook.yaml",
...
})1. Make dimensions specific and bounded. Enum and boolean fields compress best — they map to a single token. Free-text fields still compress, but less aggressively.
# Good — bounded enum
- name: experience
type: enum
values: [junior, mid, senior, staff]
# Less efficient — open text
- name: experience
type: text2. Prefer required: true for your most-repeated attributes. Required dimensions are always included in the serialized output and always stripped from subsequent prose. Optional ones are only emitted when set.
3. Use check_redundancy before sending responses. The MCP tool flags prose that re-states what the codebook already encodes, so you can strip it before it enters context.
# In a PostToolUse hook or response filter
check = await mcp.check_redundancy(content=llm_response)
if check["redundant"]:
llm_response = check["stripped"] # codebook entries removed4. Run engram analyze after your first session. It reads session data and suggests dimensions you haven't captured yet — common repeated phrases that are good codebook candidates.
engram analyze
# → Suggested dimensions: deadline_pressure (bool), test_style (enum: tdd,bdd,none)For transparent proxy use in Claude Code, no code changes are needed after engram install --claude-code — requests route through Engram automatically.
For SDK use, replace the direct Anthropic/OpenAI base URL with the local proxy:
Before:
import anthropic
client = anthropic.Anthropic(api_key="...")
response = client.messages.create(
model="claude-sonnet-4-6",
system="You are a concise senior Go engineer. I prefer no trailing summaries. "
"I'm working on a macOS/Linux CLI. My style preference is direct and brief...",
messages=[{"role": "user", "content": "..."}]
)After (proxy):
import anthropic
import httpx
# Route through Engram's local proxy — older history is summarized before upstream
client = anthropic.Anthropic(
api_key="...",
http_client=httpx.Client(proxies="http://localhost:4242")
)
# Same call — Engram compresses identity and older context before forwarding
response = client.messages.create(
model="claude-sonnet-4-6",
system="You are a concise senior Go engineer...", # compressed automatically
messages=[{"role": "user", "content": "..."}]
)After (MCP tools, explicit):
from engram import Engram
async with await Engram.connect() as client:
# Derive codebook from your identity text once per session
codebook = await client.derive_codebook(content=open("CLAUDE.md").read())
# Compress identity — use the returned block as your system prompt prefix
compressed = await client.compress_identity()
# compressed["block"] → "role=engineer lang=go style=concise no_summaries=true"
response = anthropic_client.messages.create(
model="claude-sonnet-4-6",
system=compressed["block"], # ~14 tokens instead of ~450
messages=[{"role": "user", "content": "..."}]
)| Attribute | Plain-Text CLAUDE.md | Engram Codebook |
|---|---|---|
| Format | Natural language prose | key=value pairs |
| Tokens (typical) | 300–800 | 10–30 |
| Sent every turn | Yes | Keys only (definition sent once) |
| LLM interpretability | High (verbose) | High (structured, unambiguous) |
| Redundancy detection | None | check_redundancy strips re-stated values |
| Manual tuning | Edit CLAUDE.md | Edit .engram-codebook.yaml |
| Auto-derived | — | derive_codebook extracts from CLAUDE.md |
| Identity compression | — | ~96–98% (OpenAI/SDK only; disabled for Claude) |
The key insight: plain-text identity is high-fidelity but pays a large per-turn token tax. A codebook pays a one-time definition cost and then transmits identity in a handful of tokens for every turn that follows. On a 50-turn session, a 400-token identity compressed to 15 tokens saves ~19,000 tokens — roughly equivalent to adding 15 pages of usable context back to your window.
Engram Source Available License — see LICENSE. Commercial use and Compression Technology require written consent from the author.