-
Notifications
You must be signed in to change notification settings - Fork 1.1k
RFC: Object Schema Headers (Nest-Collapse) for Keyed Object CollectionsΒ #290
Description
Hey y'all!
I've been working on a project with Claude Code & I recently found out about TOON, so, I decided to see if I couldn't use it to save tokens around corners during hooks & processes. I'm working on a plugin that has a lot of nested json structure & found that the tokens saved using TOON wasn't really worth doing a conversion of like, 3-400 files, lol. The save was maybe 10-13%, or something.
So! I figured, why not ask Claude if there was any practical way the table-collapse y'all use in TOON could be scaled to a work as a sort of 'nest-collapse'. After awhile, the results came back pretty good, it seems. I'm not a coder per se (I'm using Claude Code to work on a research project), so, I'm not the best at seeing if this is 100% up to speed, but I figured I'd send it your way. The document is all generated by Claude, for total transparency.
I considered using it, but I thought that it might be better to just send it your way & follow the project, rather than implement this fork & watch your beta from the sidelines, so I decided not to implement it. The time it would take to convert all the files anyways, then, essentially, not be able to update without reforking any updates y'all made, didn't seem to make a lot of sense.
So, this is for you! If you can do something with it, great! If not, no harm, no foul.
Cheers, friends, & good luck with everything.
metafish.
---# RFC: Object Schema Headers (Nest-Collapse)
Target: TOON spec v4.0
Status: Proposal with working prototype and test results
Author: metafish
Date: 2026-03-24
Problem
TOON v3.0's table-collapse ([N]{fields}:) eliminates redundant key repetition in arrays of uniform objects. This is the format's strongest compression feature. But the JSON data model uses keyed objects as containers at least as often as arrays. Configuration registries, catalogs, keyed record stores, and nested state files all use the pattern:
{
"entries": {
"entry-a": { "status": "active", "count": 5, "label": "Alpha" },
"entry-b": { "status": "paused", "count": 0, "label": "Beta" },
"entry-c": { "status": "active", "count": 12, "label": "Gamma" }
}
}TOON v3.0 encodes this as:
entries:
entry-a:
status: active
count: 5
label: Alpha
entry-b:
status: paused
count: 0
label: Beta
entry-c:
status: active
count: 12
label: Gamma
Every key name (status, count, label) repeats N times. With 150 entries and 13 primitive fields each, that is ~1,950 redundant key tokens. Table-collapse cannot help because the container is an object, not an array.
Scale of the Problem
We tested against a real-world corpus of 28 JSON files used in an LLM plugin system (session state, manifests, registries, schemas, hook configs, test fixtures). The three largest files -- a 154-entry source manifest (130K tokens), a 318-entry cross-reference registry (32K tokens), and a session state file (9K tokens) -- account for 91% of per-session token load.
TOON v3.0 saves 12.8-15.2% on these files. Token decomposition shows 12.2% of the manifest's JSON tokens are redundant key names that TOON currently cannot eliminate.
Proposed Extension
Syntax
Add one production rule:
; Existing
header = [key] bracket-seg [fields-seg] ":"
bracket-seg = "[" 1*DIGIT [delimsym] "]"
fields-seg = "{" fieldname *( delim fieldname ) "}"
; New
header = [key] (bracket-seg / obj-schema) [fields-seg] ":"
obj-schema = fields-seg "*"The * suffix on a fields segment without a bracket segment distinguishes object schema headers from array table headers. No ambiguity: [N]{fields}: is an array, {fields}*: is an object schema.
Encoding
key{f1,f2,...,fN}*: declares that children of key are objects whose primitive fields map positionally to the header.
entries{status,count,label}*:
entry-a: active,5,Alpha
entry-b: paused,0,Beta
entry-c: active,12,Gamma
Three entries, three fields declared once, zero key repetition.
Mixed primitive and non-primitive fields: Only primitive-valued fields go in the header. Non-primitive fields (nested objects, arrays) expand below each child's row at increased indentation:
sources{name,type,version}*:
alpha: Alpha,library,2.1
dependencies[2]: beta,gamma
config:
debug: false
beta: Beta,service,1.0
dependencies[0]:
config:
debug: true
Dot-path key folding (composing with S13.4): Nested sub-objects that are themselves uniform and all-primitive can be inlined via dot notation:
sources{name,type,metrics.accuracy,metrics.latency}*:
alpha: Alpha,library,0.95,12
beta: Beta,service,0.88,45
Decodes (with expandPaths="safe") to:
{
"sources": {
"alpha": { "name": "Alpha", "type": "library", "metrics": { "accuracy": 0.95, "latency": 12 } },
"beta": { "name": "Beta", "type": "service", "metrics": { "accuracy": 0.88, "latency": 45 } }
}
}Absent Sentinel (absentSentinel option)
Real-world keyed collections rarely have perfectly uniform schemas. Optional fields, nullable sub-objects, and evolving record shapes all produce children where some fields exist in most entries but not all. Without a way to represent "field not present" in a positional row, encoders must either:
- (a) Restrict the schema header to only fields present in every child (conservative, loses savings on partial-coverage fields), or
- (b) Use
nullas a stand-in for absence (lossy -- conflates "field is null" with "field doesn't exist")
We propose an optional absent sentinel ~, governed by a new encoder/decoder option absentSentinel (analogous to keyFolding and expandPaths in S13):
notes{source,status,date,origin}*:
note-1: design_doc,active,2026-03-24,internal
note-2: external,merged,2026-03-20,~
~ means the field does not exist in the decoded object for that child. The four-way distinction in a schema row:
| Row content | Decoded as |
|---|---|
value |
The value (type-inferred per S4) |
| (empty between delimiters) | "" (empty string) |
~ |
Field absent from this child's object |
null |
Null |
"~" |
Literal string "~" (quoted to escape) |
Option semantics (absentSentinel):
| Mode | Encoder behavior | Decoder behavior |
|---|---|---|
"off" (default) |
MUST NOT emit ~. Schema headers include only fields present in all children. |
~ decoded as literal string "~". |
"on" |
MAY emit ~ for absent fields. Schema headers may include fields present in >= 50% of children. |
~ omits the field from the decoded object. |
When absentSentinel is "off", the extension still works -- encoders simply restrict schema headers to universally-present fields and expand optional fields below each row. This is the conservative path and handles most real-world heterogeneity. The sentinel adds incremental value for dot-path-inlined sub-objects that are null or empty in some children, where the alternative is falling back to per-child key repetition for an entire sub-object.
Our test data: A 154-entry manifest where 16 entries have metrics: null and 43 have metrics: {}. Without the sentinel, the encoder cannot inline metrics.* fields via dot-path (they don't exist in all children), losing ~6 percentage points of savings. With the sentinel, the encoder puts ~,~,~,~,~,~ in those 59 rows and expands metrics: null or metrics: below to preserve the original value. Net effect: 21.5% savings (with sentinel, tested) vs ~15-17% (without, estimated).
Decoding Rules
- Line matches
key{fields}*:at depth D -> enter object-schema mode for depth D+1 children - Each child line
name: v1,v2,...at depth D+1:- Split values by active delimiter (respecting quoting per S7)
- Map positionally to header fields
~values: whenabsentSentinelis"on", skip field (do not include key in decoded object). When"off", decode as literal string"~".- Dot-path fields: unfold via path expansion (S13.4)
- Lines at depth D+2 below a child row: parse as additional key-value pairs, merge into the child object
- Object-schema mode ends when a line at depth <= D is encountered
Disambiguation: Empty String vs Empty Object
This is already well-defined in TOON v3.0 but bears emphasizing in the context of expanded (non-schema) fields below a child row:
key: ""-> empty string value (key present, value is"")key:(bare, no value) -> empty object (key present, value is{})
Encoders MUST use key: "" for empty string values in expanded fields. This prevents the decoder from misinterpreting an empty string as an empty object.
In the schema row itself, empty between delimiters is always empty string (per existing S9 inline array semantics). Empty object cannot appear in a schema row because schema fields are primitive-only.
Conformance
Encoder Requirements
- MUST apply object schema headers only when all children of the target object are themselves objects
- MUST include only fields that are primitive-valued in all children where they exist (when
absentSentinelis"off") or in >= 50% of children (whenabsentSentinelis"on") - MUST expand non-primitive fields below the child's row at increased indentation
- MUST use
key: ""(notkey:) for empty string values in expanded fields - When
absentSentinelis"on": MUST use~for fields absent from a specific child. When a dot-path-inlined sub-object is null or empty, MUST use~for all its dot-path fields AND expand the original value (e.g.,metrics: null) below the row to preserve it. - When
absentSentinelis"off"(default): MUST NOT emit~. MUST restrict schema headers to fields present in all children. - MAY use dot-path notation to inline uniform all-primitive nested sub-objects
- SHOULD apply a cost-benefit gate: only use object schema when
(n_children - 1) * n_fields > n_fields + overhead. Token savings must exceed header cost.
Decoder Requirements
- MUST recognize
{fields}*:as an object schema header (distinct from[N]{fields}:array header) - MUST map child row values positionally to header fields
- When
absentSentinelis"on": MUST treat~as field-absent (do not include key in decoded object) - When
absentSentinelis"off"(default): MUST decode~as the literal string"~" - MUST unfold dot-path fields via path expansion when
expandPathsis enabled - MUST merge expanded fields (lines below child row) into the child object
- MUST apply type inference (S4) to row values and expanded values identically
New Options (extending S13)
| Option | Applies to | Values | Default |
|---|---|---|---|
absentSentinel |
Encoder + Decoder | "off", "on" |
"off" |
Test Results
We built a general-purpose encoder and decoder implementing this proposal and tested against 28 real-world JSON files.
Primary Targets
| File | Entries | JSON tokens | TOON v3.0 | TOON + Object Schema | Round-trip |
|---|---|---|---|---|---|
| Source manifest (154 entries, 1270 sub-entries) | 154 | 129,992 | 15.2% savings | 21.5% savings | PASS |
| Cross-reference registry (318 entries, 6 key-set shapes) | 318 | 32,341 | 12.8% savings | 25.9% savings | PASS |
| Session state (heterogeneous) | N/A | 9,244 | 9.8% savings | 9.8% savings | PASS |
The cross-reference registry has 6 distinct key-set shapes across 318 entries (optional fields: origin, cross_references, merged_into, merged_date). The encoder discovers the common primitive fields (source, description, status, date) and puts them in the header. Optional primitive fields expand below each child's row. The absent sentinel was not needed here because optional fields are handled by expansion.
Generalization
| Category | Files | Result |
|---|---|---|
| Benefit (>1pp uplift) | 5 | Registries, catalogs, uniform-schema collections |
| Neutral (-0.5 to +1pp) | 15 | Flat configs, primitive arrays, small files |
| Slight degradation | 3 | Small fixtures where header overhead exceeds savings |
| Round-trip failure | 2 | Unrelated list-item decoder bug (not caused by this extension) |
Object schema headers help when the file contains keyed collections of uniform-ish objects. They are correctly neutral on flat structures, arrays, and small files when the encoder applies a cost-benefit gate.
Token Decomposition
For the 130K-token source manifest:
| Token category | Tokens | % |
|---|---|---|
| Actual values | 55,486 | 42.7% |
| Structural overhead (eliminated by TOON v3.0) | 49,368 | 38.0% |
| Redundant key repetition (target of this proposal) | 15,820 | 12.2% |
| Unique identifiers | 9,293 | 7.2% |
TOON v3.0 handles the 38%. This proposal handles the 12.2%. The remaining 49.9% is irreducible content.
What We Tested and Rejected
Schema definitions (@schema S{fields}* declared once, referenced as @S*:): Saves 0.9% on top of object schema headers. Not worth the added spec complexity. Field headers are short and repetition cost is small relative to the key-elimination savings.
Forcing all optional fields into the header via sentinel: We tested putting every optional primitive field in the schema header with ~ for absent entries (instead of the option-gated approach proposed above). This works but adds ~ tokens to rows where fields are rarely present, partially offsetting the savings. The option-gated design (absentSentinel: "off" by default) lets conservative implementations get most of the value by restricting headers to universally-present fields. The sentinel earns its keep specifically when dot-path inlining would otherwise be blocked by null/empty sub-objects in a minority of children.
Backward Compatibility
- New syntax:
{fields}*:does not conflict with any existing TOON v3.0 production. A v3.0 decoder encountering this syntax would fail to parse (no silent misinterpretation). - Existing documents unchanged: No v3.0-valid document changes meaning under this extension.
- Opt-in: Encoders can choose whether to apply object schema headers. A conservative encoder can produce valid TOON v4.0 output identical to v3.0 by never using the feature.
Reference Implementation
A Python prototype (encoder + decoder, ~650 LOC) is available. It implements automatic schema discovery, dot-path key folding, absent sentinel, cost-benefit gating, and full round-trip verification. The implementation has been through code review, which caught and fixed 5 correctness bugs (3 critical: unquoted ~ sentinel leaking into non-schema decode paths, quoted keys with ": " causing wrong key/value splits, and non-deterministic field ordering from set iteration). Post-review, 27/29 files in our test corpus pass round-trip (2 failures are a pre-existing list-item decoder bug unrelated to this extension). We are happy to contribute this as a reference or test fixture.
Appendix: Full Example
JSON Input
{
"version": "2.0",
"catalog": {
"widget-a": {
"name": "Alpha Widget",
"status": "active",
"price": 9.99,
"metrics": { "views": 1200, "sales": 45 },
"tags": ["popular", "new"]
},
"widget-b": {
"name": "Beta Widget",
"status": "discontinued",
"price": 4.50,
"metrics": { "views": 300, "sales": 2 },
"tags": ["clearance"]
},
"widget-c": {
"name": "Gamma Widget",
"status": "active",
"price": 19.99,
"metrics": null,
"tags": []
}
}
}TOON v3.0 Output (no object schema)
version: 2.0
catalog:
widget-a:
name: Alpha Widget
status: active
price: 9.99
metrics:
views: 1200
sales: 45
tags[2]: popular,new
widget-b:
name: Beta Widget
status: discontinued
price: 4.5
metrics:
views: 300
sales: 2
tags[1]: clearance
widget-c:
name: Gamma Widget
status: active
price: 19.99
metrics: null
tags[0]:
TOON v4.0 Output (object schema, absentSentinel: "off")
Without the sentinel, the encoder restricts the header to fields that are primitive in all three children. Since widget-c has metrics: null, the metrics dot-paths are excluded:
version: 2.0
catalog{name,status,price}*:
widget-a: Alpha Widget,active,9.99
metrics:
views: 1200
sales: 45
tags[2]: popular,new
widget-b: Beta Widget,discontinued,4.5
metrics:
views: 300
sales: 2
tags[1]: clearance
widget-c: Gamma Widget,active,19.99
metrics: null
tags[0]:
Key repetition eliminated for name, status, price. The metrics sub-object still repeats its keys per child because the encoder cannot inline it without the sentinel.
TOON v4.0 Output (object schema, absentSentinel: "on")
With the sentinel, the encoder can inline metrics.* fields and use ~ for widget-c:
version: 2.0
catalog{name,status,price,metrics.views,metrics.sales}*:
widget-a: Alpha Widget,active,9.99,1200,45
tags[2]: popular,new
widget-b: Beta Widget,discontinued,4.5,300,2
tags[1]: clearance
widget-c: Gamma Widget,active,19.99,~,~
metrics: null
tags[0]:
All six key names appear once (in the header) instead of three times. widget-c uses ~ for absent metrics sub-fields and expands metrics: null below the row to preserve the original null value (distinct from absent -- the key exists, its value is null).