Skip to content

RFC: Object Schema Headers (Nest-Collapse) for Keyed Object CollectionsΒ #290

@metafishTV

Description

@metafishTV

Hey y'all!
I've been working on a project with Claude Code & I recently found out about TOON, so, I decided to see if I couldn't use it to save tokens around corners during hooks & processes. I'm working on a plugin that has a lot of nested json structure & found that the tokens saved using TOON wasn't really worth doing a conversion of like, 3-400 files, lol. The save was maybe 10-13%, or something.

So! I figured, why not ask Claude if there was any practical way the table-collapse y'all use in TOON could be scaled to a work as a sort of 'nest-collapse'. After awhile, the results came back pretty good, it seems. I'm not a coder per se (I'm using Claude Code to work on a research project), so, I'm not the best at seeing if this is 100% up to speed, but I figured I'd send it your way. The document is all generated by Claude, for total transparency.

I considered using it, but I thought that it might be better to just send it your way & follow the project, rather than implement this fork & watch your beta from the sidelines, so I decided not to implement it. The time it would take to convert all the files anyways, then, essentially, not be able to update without reforking any updates y'all made, didn't seem to make a lot of sense.

So, this is for you! If you can do something with it, great! If not, no harm, no foul.
Cheers, friends, & good luck with everything.
metafish.

---# RFC: Object Schema Headers (Nest-Collapse)

Target: TOON spec v4.0
Status: Proposal with working prototype and test results
Author: metafish
Date: 2026-03-24


Problem

TOON v3.0's table-collapse ([N]{fields}:) eliminates redundant key repetition in arrays of uniform objects. This is the format's strongest compression feature. But the JSON data model uses keyed objects as containers at least as often as arrays. Configuration registries, catalogs, keyed record stores, and nested state files all use the pattern:

{
  "entries": {
    "entry-a": { "status": "active", "count": 5, "label": "Alpha" },
    "entry-b": { "status": "paused", "count": 0, "label": "Beta" },
    "entry-c": { "status": "active", "count": 12, "label": "Gamma" }
  }
}

TOON v3.0 encodes this as:

entries:
  entry-a:
    status: active
    count: 5
    label: Alpha
  entry-b:
    status: paused
    count: 0
    label: Beta
  entry-c:
    status: active
    count: 12
    label: Gamma

Every key name (status, count, label) repeats N times. With 150 entries and 13 primitive fields each, that is ~1,950 redundant key tokens. Table-collapse cannot help because the container is an object, not an array.

Scale of the Problem

We tested against a real-world corpus of 28 JSON files used in an LLM plugin system (session state, manifests, registries, schemas, hook configs, test fixtures). The three largest files -- a 154-entry source manifest (130K tokens), a 318-entry cross-reference registry (32K tokens), and a session state file (9K tokens) -- account for 91% of per-session token load.

TOON v3.0 saves 12.8-15.2% on these files. Token decomposition shows 12.2% of the manifest's JSON tokens are redundant key names that TOON currently cannot eliminate.

Proposed Extension

Syntax

Add one production rule:

; Existing
header       = [key] bracket-seg [fields-seg] ":"
bracket-seg  = "[" 1*DIGIT [delimsym] "]"
fields-seg   = "{" fieldname *( delim fieldname ) "}"

; New
header       = [key] (bracket-seg / obj-schema) [fields-seg] ":"
obj-schema   = fields-seg "*"

The * suffix on a fields segment without a bracket segment distinguishes object schema headers from array table headers. No ambiguity: [N]{fields}: is an array, {fields}*: is an object schema.

Encoding

key{f1,f2,...,fN}*: declares that children of key are objects whose primitive fields map positionally to the header.

entries{status,count,label}*:
  entry-a: active,5,Alpha
  entry-b: paused,0,Beta
  entry-c: active,12,Gamma

Three entries, three fields declared once, zero key repetition.

Mixed primitive and non-primitive fields: Only primitive-valued fields go in the header. Non-primitive fields (nested objects, arrays) expand below each child's row at increased indentation:

sources{name,type,version}*:
  alpha: Alpha,library,2.1
    dependencies[2]: beta,gamma
    config:
      debug: false
  beta: Beta,service,1.0
    dependencies[0]:
    config:
      debug: true

Dot-path key folding (composing with S13.4): Nested sub-objects that are themselves uniform and all-primitive can be inlined via dot notation:

sources{name,type,metrics.accuracy,metrics.latency}*:
  alpha: Alpha,library,0.95,12
  beta: Beta,service,0.88,45

Decodes (with expandPaths="safe") to:

{
  "sources": {
    "alpha": { "name": "Alpha", "type": "library", "metrics": { "accuracy": 0.95, "latency": 12 } },
    "beta": { "name": "Beta", "type": "service", "metrics": { "accuracy": 0.88, "latency": 45 } }
  }
}

Absent Sentinel (absentSentinel option)

Real-world keyed collections rarely have perfectly uniform schemas. Optional fields, nullable sub-objects, and evolving record shapes all produce children where some fields exist in most entries but not all. Without a way to represent "field not present" in a positional row, encoders must either:

  • (a) Restrict the schema header to only fields present in every child (conservative, loses savings on partial-coverage fields), or
  • (b) Use null as a stand-in for absence (lossy -- conflates "field is null" with "field doesn't exist")

We propose an optional absent sentinel ~, governed by a new encoder/decoder option absentSentinel (analogous to keyFolding and expandPaths in S13):

notes{source,status,date,origin}*:
  note-1: design_doc,active,2026-03-24,internal
  note-2: external,merged,2026-03-20,~

~ means the field does not exist in the decoded object for that child. The four-way distinction in a schema row:

Row content Decoded as
value The value (type-inferred per S4)
(empty between delimiters) "" (empty string)
~ Field absent from this child's object
null Null
"~" Literal string "~" (quoted to escape)

Option semantics (absentSentinel):

Mode Encoder behavior Decoder behavior
"off" (default) MUST NOT emit ~. Schema headers include only fields present in all children. ~ decoded as literal string "~".
"on" MAY emit ~ for absent fields. Schema headers may include fields present in >= 50% of children. ~ omits the field from the decoded object.

When absentSentinel is "off", the extension still works -- encoders simply restrict schema headers to universally-present fields and expand optional fields below each row. This is the conservative path and handles most real-world heterogeneity. The sentinel adds incremental value for dot-path-inlined sub-objects that are null or empty in some children, where the alternative is falling back to per-child key repetition for an entire sub-object.

Our test data: A 154-entry manifest where 16 entries have metrics: null and 43 have metrics: {}. Without the sentinel, the encoder cannot inline metrics.* fields via dot-path (they don't exist in all children), losing ~6 percentage points of savings. With the sentinel, the encoder puts ~,~,~,~,~,~ in those 59 rows and expands metrics: null or metrics: below to preserve the original value. Net effect: 21.5% savings (with sentinel, tested) vs ~15-17% (without, estimated).

Decoding Rules

  1. Line matches key{fields}*: at depth D -> enter object-schema mode for depth D+1 children
  2. Each child line name: v1,v2,... at depth D+1:
    • Split values by active delimiter (respecting quoting per S7)
    • Map positionally to header fields
    • ~ values: when absentSentinel is "on", skip field (do not include key in decoded object). When "off", decode as literal string "~".
    • Dot-path fields: unfold via path expansion (S13.4)
  3. Lines at depth D+2 below a child row: parse as additional key-value pairs, merge into the child object
  4. Object-schema mode ends when a line at depth <= D is encountered

Disambiguation: Empty String vs Empty Object

This is already well-defined in TOON v3.0 but bears emphasizing in the context of expanded (non-schema) fields below a child row:

  • key: "" -> empty string value (key present, value is "")
  • key: (bare, no value) -> empty object (key present, value is {})

Encoders MUST use key: "" for empty string values in expanded fields. This prevents the decoder from misinterpreting an empty string as an empty object.

In the schema row itself, empty between delimiters is always empty string (per existing S9 inline array semantics). Empty object cannot appear in a schema row because schema fields are primitive-only.

Conformance

Encoder Requirements

  • MUST apply object schema headers only when all children of the target object are themselves objects
  • MUST include only fields that are primitive-valued in all children where they exist (when absentSentinel is "off") or in >= 50% of children (when absentSentinel is "on")
  • MUST expand non-primitive fields below the child's row at increased indentation
  • MUST use key: "" (not key:) for empty string values in expanded fields
  • When absentSentinel is "on": MUST use ~ for fields absent from a specific child. When a dot-path-inlined sub-object is null or empty, MUST use ~ for all its dot-path fields AND expand the original value (e.g., metrics: null) below the row to preserve it.
  • When absentSentinel is "off" (default): MUST NOT emit ~. MUST restrict schema headers to fields present in all children.
  • MAY use dot-path notation to inline uniform all-primitive nested sub-objects
  • SHOULD apply a cost-benefit gate: only use object schema when (n_children - 1) * n_fields > n_fields + overhead. Token savings must exceed header cost.

Decoder Requirements

  • MUST recognize {fields}*: as an object schema header (distinct from [N]{fields}: array header)
  • MUST map child row values positionally to header fields
  • When absentSentinel is "on": MUST treat ~ as field-absent (do not include key in decoded object)
  • When absentSentinel is "off" (default): MUST decode ~ as the literal string "~"
  • MUST unfold dot-path fields via path expansion when expandPaths is enabled
  • MUST merge expanded fields (lines below child row) into the child object
  • MUST apply type inference (S4) to row values and expanded values identically

New Options (extending S13)

Option Applies to Values Default
absentSentinel Encoder + Decoder "off", "on" "off"

Test Results

We built a general-purpose encoder and decoder implementing this proposal and tested against 28 real-world JSON files.

Primary Targets

File Entries JSON tokens TOON v3.0 TOON + Object Schema Round-trip
Source manifest (154 entries, 1270 sub-entries) 154 129,992 15.2% savings 21.5% savings PASS
Cross-reference registry (318 entries, 6 key-set shapes) 318 32,341 12.8% savings 25.9% savings PASS
Session state (heterogeneous) N/A 9,244 9.8% savings 9.8% savings PASS

The cross-reference registry has 6 distinct key-set shapes across 318 entries (optional fields: origin, cross_references, merged_into, merged_date). The encoder discovers the common primitive fields (source, description, status, date) and puts them in the header. Optional primitive fields expand below each child's row. The absent sentinel was not needed here because optional fields are handled by expansion.

Generalization

Category Files Result
Benefit (>1pp uplift) 5 Registries, catalogs, uniform-schema collections
Neutral (-0.5 to +1pp) 15 Flat configs, primitive arrays, small files
Slight degradation 3 Small fixtures where header overhead exceeds savings
Round-trip failure 2 Unrelated list-item decoder bug (not caused by this extension)

Object schema headers help when the file contains keyed collections of uniform-ish objects. They are correctly neutral on flat structures, arrays, and small files when the encoder applies a cost-benefit gate.

Token Decomposition

For the 130K-token source manifest:

Token category Tokens %
Actual values 55,486 42.7%
Structural overhead (eliminated by TOON v3.0) 49,368 38.0%
Redundant key repetition (target of this proposal) 15,820 12.2%
Unique identifiers 9,293 7.2%

TOON v3.0 handles the 38%. This proposal handles the 12.2%. The remaining 49.9% is irreducible content.

What We Tested and Rejected

Schema definitions (@schema S{fields}* declared once, referenced as @S*:): Saves 0.9% on top of object schema headers. Not worth the added spec complexity. Field headers are short and repetition cost is small relative to the key-elimination savings.

Forcing all optional fields into the header via sentinel: We tested putting every optional primitive field in the schema header with ~ for absent entries (instead of the option-gated approach proposed above). This works but adds ~ tokens to rows where fields are rarely present, partially offsetting the savings. The option-gated design (absentSentinel: "off" by default) lets conservative implementations get most of the value by restricting headers to universally-present fields. The sentinel earns its keep specifically when dot-path inlining would otherwise be blocked by null/empty sub-objects in a minority of children.

Backward Compatibility

  • New syntax: {fields}*: does not conflict with any existing TOON v3.0 production. A v3.0 decoder encountering this syntax would fail to parse (no silent misinterpretation).
  • Existing documents unchanged: No v3.0-valid document changes meaning under this extension.
  • Opt-in: Encoders can choose whether to apply object schema headers. A conservative encoder can produce valid TOON v4.0 output identical to v3.0 by never using the feature.

Reference Implementation

A Python prototype (encoder + decoder, ~650 LOC) is available. It implements automatic schema discovery, dot-path key folding, absent sentinel, cost-benefit gating, and full round-trip verification. The implementation has been through code review, which caught and fixed 5 correctness bugs (3 critical: unquoted ~ sentinel leaking into non-schema decode paths, quoted keys with ": " causing wrong key/value splits, and non-deterministic field ordering from set iteration). Post-review, 27/29 files in our test corpus pass round-trip (2 failures are a pre-existing list-item decoder bug unrelated to this extension). We are happy to contribute this as a reference or test fixture.


Appendix: Full Example

JSON Input

{
  "version": "2.0",
  "catalog": {
    "widget-a": {
      "name": "Alpha Widget",
      "status": "active",
      "price": 9.99,
      "metrics": { "views": 1200, "sales": 45 },
      "tags": ["popular", "new"]
    },
    "widget-b": {
      "name": "Beta Widget",
      "status": "discontinued",
      "price": 4.50,
      "metrics": { "views": 300, "sales": 2 },
      "tags": ["clearance"]
    },
    "widget-c": {
      "name": "Gamma Widget",
      "status": "active",
      "price": 19.99,
      "metrics": null,
      "tags": []
    }
  }
}

TOON v3.0 Output (no object schema)

version: 2.0
catalog:
  widget-a:
    name: Alpha Widget
    status: active
    price: 9.99
    metrics:
      views: 1200
      sales: 45
    tags[2]: popular,new
  widget-b:
    name: Beta Widget
    status: discontinued
    price: 4.5
    metrics:
      views: 300
      sales: 2
    tags[1]: clearance
  widget-c:
    name: Gamma Widget
    status: active
    price: 19.99
    metrics: null
    tags[0]:

TOON v4.0 Output (object schema, absentSentinel: "off")

Without the sentinel, the encoder restricts the header to fields that are primitive in all three children. Since widget-c has metrics: null, the metrics dot-paths are excluded:

version: 2.0
catalog{name,status,price}*:
  widget-a: Alpha Widget,active,9.99
    metrics:
      views: 1200
      sales: 45
    tags[2]: popular,new
  widget-b: Beta Widget,discontinued,4.5
    metrics:
      views: 300
      sales: 2
    tags[1]: clearance
  widget-c: Gamma Widget,active,19.99
    metrics: null
    tags[0]:

Key repetition eliminated for name, status, price. The metrics sub-object still repeats its keys per child because the encoder cannot inline it without the sentinel.

TOON v4.0 Output (object schema, absentSentinel: "on")

With the sentinel, the encoder can inline metrics.* fields and use ~ for widget-c:

version: 2.0
catalog{name,status,price,metrics.views,metrics.sales}*:
  widget-a: Alpha Widget,active,9.99,1200,45
    tags[2]: popular,new
  widget-b: Beta Widget,discontinued,4.5,300,2
    tags[1]: clearance
  widget-c: Gamma Widget,active,19.99,~,~
    metrics: null
    tags[0]:

All six key names appear once (in the header) instead of three times. widget-c uses ~ for absent metrics sub-fields and expands metrics: null below the row to preserve the original null value (distinct from absent -- the key exists, its value is null).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions