RFC: Object Schema Headers (Nest-Collapse) for Keyed Object Collections

Hey y'all!
I've been working on a project with Claude Code & I recently found out about TOON, so, I decided to see if I couldn't use it to save tokens around corners during hooks & processes. I'm working on a plugin that has a lot of nested json structure & found that the tokens saved using TOON wasn't really worth doing a conversion of like, 3-400 files, lol. The save was maybe 10-13%, or something.

So! I figured, why not ask Claude if there was any practical way the table-collapse y'all use in TOON could be scaled to a work as a sort of 'nest-collapse'. After awhile, the results came back pretty good, it seems. I'm not a coder per se (I'm using Claude Code to work on a research project), so, I'm not the best at seeing if this is 100% up to speed, but I figured I'd send it your way. The document is all generated by Claude, for total transparency.

I considered using it, but I thought that it might be better to just send it your way & follow the project, rather than implement this fork & watch your beta from the sidelines, so I decided not to implement it. The time it would take to convert all the files anyways, then, essentially, not be able to update without reforking any updates y'all made, didn't seem to make a lot of sense.

So, this is for you! If you can do something with it, great! If not, no harm, no foul.
Cheers, friends, & good luck with everything.
metafish.

---# RFC: Object Schema Headers (Nest-Collapse)

**Target**: TOON spec v4.0
**Status**: Proposal with working prototype and test results
**Author**: metafish
**Date**: 2026-03-24

---

## Problem

TOON v3.0's table-collapse (`[N]{fields}:`) eliminates redundant key repetition in arrays of uniform objects. This is the format's strongest compression feature. But the JSON data model uses keyed objects as containers at least as often as arrays. Configuration registries, catalogs, keyed record stores, and nested state files all use the pattern:

```json
{
  "entries": {
    "entry-a": { "status": "active", "count": 5, "label": "Alpha" },
    "entry-b": { "status": "paused", "count": 0, "label": "Beta" },
    "entry-c": { "status": "active", "count": 12, "label": "Gamma" }
  }
}
```

TOON v3.0 encodes this as:

```toon
entries:
  entry-a:
    status: active
    count: 5
    label: Alpha
  entry-b:
    status: paused
    count: 0
    label: Beta
  entry-c:
    status: active
    count: 12
    label: Gamma
```

Every key name (`status`, `count`, `label`) repeats N times. With 150 entries and 13 primitive fields each, that is ~1,950 redundant key tokens. Table-collapse cannot help because the container is an object, not an array.

### Scale of the Problem

We tested against a real-world corpus of 28 JSON files used in an LLM plugin system (session state, manifests, registries, schemas, hook configs, test fixtures). The three largest files -- a 154-entry source manifest (130K tokens), a 318-entry cross-reference registry (32K tokens), and a session state file (9K tokens) -- account for 91% of per-session token load.

TOON v3.0 saves 12.8-15.2% on these files. Token decomposition shows 12.2% of the manifest's JSON tokens are redundant key names that TOON currently cannot eliminate.

## Proposed Extension

### Syntax

Add one production rule:

```abnf
; Existing
header       = [key] bracket-seg [fields-seg] ":"
bracket-seg  = "[" 1*DIGIT [delimsym] "]"
fields-seg   = "{" fieldname *( delim fieldname ) "}"

; New
header       = [key] (bracket-seg / obj-schema) [fields-seg] ":"
obj-schema   = fields-seg "*"
```

The `*` suffix on a fields segment without a bracket segment distinguishes object schema headers from array table headers. No ambiguity: `[N]{fields}:` is an array, `{fields}*:` is an object schema.

### Encoding

`key{f1,f2,...,fN}*:` declares that children of `key` are objects whose primitive fields map positionally to the header.

```toon
entries{status,count,label}*:
  entry-a: active,5,Alpha
  entry-b: paused,0,Beta
  entry-c: active,12,Gamma
```

Three entries, three fields declared once, zero key repetition.

**Mixed primitive and non-primitive fields**: Only primitive-valued fields go in the header. Non-primitive fields (nested objects, arrays) expand below each child's row at increased indentation:

```toon
sources{name,type,version}*:
  alpha: Alpha,library,2.1
    dependencies[2]: beta,gamma
    config:
      debug: false
  beta: Beta,service,1.0
    dependencies[0]:
    config:
      debug: true
```

**Dot-path key folding** (composing with S13.4): Nested sub-objects that are themselves uniform and all-primitive can be inlined via dot notation:

```toon
sources{name,type,metrics.accuracy,metrics.latency}*:
  alpha: Alpha,library,0.95,12
  beta: Beta,service,0.88,45
```

Decodes (with `expandPaths="safe"`) to:

```json
{
  "sources": {
    "alpha": { "name": "Alpha", "type": "library", "metrics": { "accuracy": 0.95, "latency": 12 } },
    "beta": { "name": "Beta", "type": "service", "metrics": { "accuracy": 0.88, "latency": 45 } }
  }
}
```

### Absent Sentinel (`absentSentinel` option)

Real-world keyed collections rarely have perfectly uniform schemas. Optional fields, nullable sub-objects, and evolving record shapes all produce children where some fields exist in most entries but not all. Without a way to represent "field not present" in a positional row, encoders must either:

- (a) Restrict the schema header to only fields present in *every* child (conservative, loses savings on partial-coverage fields), or
- (b) Use `null` as a stand-in for absence (lossy -- conflates "field is null" with "field doesn't exist")

We propose an **optional** absent sentinel `~`, governed by a new encoder/decoder option `absentSentinel` (analogous to `keyFolding` and `expandPaths` in S13):

```toon
notes{source,status,date,origin}*:
  note-1: design_doc,active,2026-03-24,internal
  note-2: external,merged,2026-03-20,~
```

`~` means the field does not exist in the decoded object for that child. The four-way distinction in a schema row:

| Row content | Decoded as |
|---|---|
| `value` | The value (type-inferred per S4) |
| (empty between delimiters) | `""` (empty string) |
| `~` | Field absent from this child's object |
| `null` | Null |
| `"~"` | Literal string "~" (quoted to escape) |

**Option semantics** (`absentSentinel`):

| Mode | Encoder behavior | Decoder behavior |
|---|---|---|
| `"off"` (default) | MUST NOT emit `~`. Schema headers include only fields present in all children. | `~` decoded as literal string "~". |
| `"on"` | MAY emit `~` for absent fields. Schema headers may include fields present in >= 50% of children. | `~` omits the field from the decoded object. |

When `absentSentinel` is `"off"`, the extension still works -- encoders simply restrict schema headers to universally-present fields and expand optional fields below each row. This is the conservative path and handles most real-world heterogeneity. The sentinel adds incremental value for dot-path-inlined sub-objects that are null or empty in some children, where the alternative is falling back to per-child key repetition for an entire sub-object.

**Our test data**: A 154-entry manifest where 16 entries have `metrics: null` and 43 have `metrics: {}`. Without the sentinel, the encoder cannot inline `metrics.*` fields via dot-path (they don't exist in all children), losing ~6 percentage points of savings. With the sentinel, the encoder puts `~,~,~,~,~,~` in those 59 rows and expands `metrics: null` or `metrics:` below to preserve the original value. Net effect: 21.5% savings (with sentinel, tested) vs ~15-17% (without, estimated).

### Decoding Rules

1. Line matches `key{fields}*:` at depth D -> enter object-schema mode for depth D+1 children
2. Each child line `name: v1,v2,...` at depth D+1:
   - Split values by active delimiter (respecting quoting per S7)
   - Map positionally to header fields
   - `~` values: when `absentSentinel` is `"on"`, skip field (do not include key in decoded object). When `"off"`, decode as literal string `"~"`.
   - Dot-path fields: unfold via path expansion (S13.4)
3. Lines at depth D+2 below a child row: parse as additional key-value pairs, merge into the child object
4. Object-schema mode ends when a line at depth <= D is encountered

### Disambiguation: Empty String vs Empty Object

This is already well-defined in TOON v3.0 but bears emphasizing in the context of expanded (non-schema) fields below a child row:

- `key: ""` -> empty string value (key present, value is `""`)
- `key:` (bare, no value) -> empty object (key present, value is `{}`)

Encoders MUST use `key: ""` for empty string values in expanded fields. This prevents the decoder from misinterpreting an empty string as an empty object.

In the schema row itself, empty between delimiters is always empty string (per existing S9 inline array semantics). Empty object cannot appear in a schema row because schema fields are primitive-only.

## Conformance

### Encoder Requirements

- MUST apply object schema headers only when all children of the target object are themselves objects
- MUST include only fields that are primitive-valued in all children where they exist (when `absentSentinel` is `"off"`) or in >= 50% of children (when `absentSentinel` is `"on"`)
- MUST expand non-primitive fields below the child's row at increased indentation
- MUST use `key: ""` (not `key:`) for empty string values in expanded fields
- When `absentSentinel` is `"on"`: MUST use `~` for fields absent from a specific child. When a dot-path-inlined sub-object is null or empty, MUST use `~` for all its dot-path fields AND expand the original value (e.g., `metrics: null`) below the row to preserve it.
- When `absentSentinel` is `"off"` (default): MUST NOT emit `~`. MUST restrict schema headers to fields present in all children.
- MAY use dot-path notation to inline uniform all-primitive nested sub-objects
- SHOULD apply a cost-benefit gate: only use object schema when `(n_children - 1) * n_fields > n_fields + overhead`. Token savings must exceed header cost.

### Decoder Requirements

- MUST recognize `{fields}*:` as an object schema header (distinct from `[N]{fields}:` array header)
- MUST map child row values positionally to header fields
- When `absentSentinel` is `"on"`: MUST treat `~` as field-absent (do not include key in decoded object)
- When `absentSentinel` is `"off"` (default): MUST decode `~` as the literal string `"~"`
- MUST unfold dot-path fields via path expansion when `expandPaths` is enabled
- MUST merge expanded fields (lines below child row) into the child object
- MUST apply type inference (S4) to row values and expanded values identically

### New Options (extending S13)

| Option | Applies to | Values | Default |
|---|---|---|---|
| `absentSentinel` | Encoder + Decoder | `"off"`, `"on"` | `"off"` |

## Test Results

We built a general-purpose encoder and decoder implementing this proposal and tested against 28 real-world JSON files.

### Primary Targets

| File | Entries | JSON tokens | TOON v3.0 | TOON + Object Schema | Round-trip |
|---|---|---|---|---|---|
| Source manifest (154 entries, 1270 sub-entries) | 154 | 129,992 | 15.2% savings | **21.5% savings** | PASS |
| Cross-reference registry (318 entries, 6 key-set shapes) | 318 | 32,341 | 12.8% savings | **25.9% savings** | PASS |
| Session state (heterogeneous) | N/A | 9,244 | 9.8% savings | 9.8% savings | PASS |

The cross-reference registry has 6 distinct key-set shapes across 318 entries (optional fields: origin, cross_references, merged_into, merged_date). The encoder discovers the common primitive fields (source, description, status, date) and puts them in the header. Optional primitive fields expand below each child's row. The absent sentinel was not needed here because optional fields are handled by expansion.

### Generalization

| Category | Files | Result |
|---|---|---|
| Benefit (>1pp uplift) | 5 | Registries, catalogs, uniform-schema collections |
| Neutral (-0.5 to +1pp) | 15 | Flat configs, primitive arrays, small files |
| Slight degradation | 3 | Small fixtures where header overhead exceeds savings |
| Round-trip failure | 2 | Unrelated list-item decoder bug (not caused by this extension) |

Object schema headers help when the file contains keyed collections of uniform-ish objects. They are correctly neutral on flat structures, arrays, and small files when the encoder applies a cost-benefit gate.

### Token Decomposition

For the 130K-token source manifest:

| Token category | Tokens | % |
|---|---|---|
| Actual values | 55,486 | 42.7% |
| Structural overhead (eliminated by TOON v3.0) | 49,368 | 38.0% |
| **Redundant key repetition (target of this proposal)** | **15,820** | **12.2%** |
| Unique identifiers | 9,293 | 7.2% |

TOON v3.0 handles the 38%. This proposal handles the 12.2%. The remaining 49.9% is irreducible content.

### What We Tested and Rejected

**Schema definitions** (`@schema S{fields}*` declared once, referenced as `@S*:`): Saves 0.9% on top of object schema headers. Not worth the added spec complexity. Field headers are short and repetition cost is small relative to the key-elimination savings.

**Forcing all optional fields into the header via sentinel**: We tested putting every optional primitive field in the schema header with `~` for absent entries (instead of the option-gated approach proposed above). This works but adds `~` tokens to rows where fields are rarely present, partially offsetting the savings. The option-gated design (`absentSentinel: "off"` by default) lets conservative implementations get most of the value by restricting headers to universally-present fields. The sentinel earns its keep specifically when dot-path inlining would otherwise be blocked by null/empty sub-objects in a minority of children.

## Backward Compatibility

- **New syntax**: `{fields}*:` does not conflict with any existing TOON v3.0 production. A v3.0 decoder encountering this syntax would fail to parse (no silent misinterpretation).
- **Existing documents unchanged**: No v3.0-valid document changes meaning under this extension.
- **Opt-in**: Encoders can choose whether to apply object schema headers. A conservative encoder can produce valid TOON v4.0 output identical to v3.0 by never using the feature.

## Reference Implementation

A Python prototype (encoder + decoder, ~650 LOC) is available. It implements automatic schema discovery, dot-path key folding, absent sentinel, cost-benefit gating, and full round-trip verification. The implementation has been through code review, which caught and fixed 5 correctness bugs (3 critical: unquoted `~` sentinel leaking into non-schema decode paths, quoted keys with `": "` causing wrong key/value splits, and non-deterministic field ordering from set iteration). Post-review, 27/29 files in our test corpus pass round-trip (2 failures are a pre-existing list-item decoder bug unrelated to this extension). We are happy to contribute this as a reference or test fixture.

---

## Appendix: Full Example

### JSON Input

```json
{
  "version": "2.0",
  "catalog": {
    "widget-a": {
      "name": "Alpha Widget",
      "status": "active",
      "price": 9.99,
      "metrics": { "views": 1200, "sales": 45 },
      "tags": ["popular", "new"]
    },
    "widget-b": {
      "name": "Beta Widget",
      "status": "discontinued",
      "price": 4.50,
      "metrics": { "views": 300, "sales": 2 },
      "tags": ["clearance"]
    },
    "widget-c": {
      "name": "Gamma Widget",
      "status": "active",
      "price": 19.99,
      "metrics": null,
      "tags": []
    }
  }
}
```

### TOON v3.0 Output (no object schema)

```toon
version: 2.0
catalog:
  widget-a:
    name: Alpha Widget
    status: active
    price: 9.99
    metrics:
      views: 1200
      sales: 45
    tags[2]: popular,new
  widget-b:
    name: Beta Widget
    status: discontinued
    price: 4.5
    metrics:
      views: 300
      sales: 2
    tags[1]: clearance
  widget-c:
    name: Gamma Widget
    status: active
    price: 19.99
    metrics: null
    tags[0]:
```

### TOON v4.0 Output (object schema, `absentSentinel: "off"`)

Without the sentinel, the encoder restricts the header to fields that are primitive in all three children. Since `widget-c` has `metrics: null`, the metrics dot-paths are excluded:

```toon
version: 2.0
catalog{name,status,price}*:
  widget-a: Alpha Widget,active,9.99
    metrics:
      views: 1200
      sales: 45
    tags[2]: popular,new
  widget-b: Beta Widget,discontinued,4.5
    metrics:
      views: 300
      sales: 2
    tags[1]: clearance
  widget-c: Gamma Widget,active,19.99
    metrics: null
    tags[0]:
```

Key repetition eliminated for `name`, `status`, `price`. The `metrics` sub-object still repeats its keys per child because the encoder cannot inline it without the sentinel.

### TOON v4.0 Output (object schema, `absentSentinel: "on"`)

With the sentinel, the encoder can inline `metrics.*` fields and use `~` for `widget-c`:

```toon
version: 2.0
catalog{name,status,price,metrics.views,metrics.sales}*:
  widget-a: Alpha Widget,active,9.99,1200,45
    tags[2]: popular,new
  widget-b: Beta Widget,discontinued,4.5,300,2
    tags[1]: clearance
  widget-c: Gamma Widget,active,19.99,~,~
    metrics: null
    tags[0]:
```

All six key names appear once (in the header) instead of three times. `widget-c` uses `~` for absent metrics sub-fields and expands `metrics: null` below the row to preserve the original null value (distinct from absent -- the key exists, its value is null).

Mode	Encoder behavior	Decoder behavior
`"off"` (default)	MUST NOT emit `~`. Schema headers include only fields present in all children.	`~` decoded as literal string "~".
`"on"`	MAY emit `~` for absent fields. Schema headers may include fields present in >= 50% of children.	`~` omits the field from the decoded object.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Object Schema Headers (Nest-Collapse) for Keyed Object Collections #290

Problem

Scale of the Problem

Proposed Extension

Syntax

Encoding

Absent Sentinel (`absentSentinel` option)

Decoding Rules

Disambiguation: Empty String vs Empty Object

Conformance

Encoder Requirements

Decoder Requirements

New Options (extending S13)

Test Results

Primary Targets

Generalization

Token Decomposition

What We Tested and Rejected

Backward Compatibility

Reference Implementation

Appendix: Full Example

JSON Input

TOON v3.0 Output (no object schema)

TOON v4.0 Output (object schema, `absentSentinel: "off"`)

TOON v4.0 Output (object schema, `absentSentinel: "on"`)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Row content	Decoded as
`value`	The value (type-inferred per S4)
(empty between delimiters)	`""` (empty string)
`~`	Field absent from this child's object
`null`	Null
`"~"`	Literal string "~" (quoted to escape)

File	Entries	JSON tokens	TOON v3.0	TOON + Object Schema	Round-trip
Source manifest (154 entries, 1270 sub-entries)	154	129,992	15.2% savings	21.5% savings	PASS
Cross-reference registry (318 entries, 6 key-set shapes)	318	32,341	12.8% savings	25.9% savings	PASS
Session state (heterogeneous)	N/A	9,244	9.8% savings	9.8% savings	PASS

Category	Files	Result
Benefit (>1pp uplift)	5	Registries, catalogs, uniform-schema collections
Neutral (-0.5 to +1pp)	15	Flat configs, primitive arrays, small files
Slight degradation	3	Small fixtures where header overhead exceeds savings
Round-trip failure	2	Unrelated list-item decoder bug (not caused by this extension)

Token category	Tokens	%
Actual values	55,486	42.7%
Structural overhead (eliminated by TOON v3.0)	49,368	38.0%
Redundant key repetition (target of this proposal)	15,820	12.2%
Unique identifiers	9,293	7.2%

RFC: Object Schema Headers (Nest-Collapse) for Keyed Object Collections #290

Description

Problem

Scale of the Problem

Proposed Extension

Syntax

Encoding

Absent Sentinel (absentSentinel option)

Decoding Rules

Disambiguation: Empty String vs Empty Object

Conformance

Encoder Requirements

Decoder Requirements

New Options (extending S13)

Test Results

Primary Targets

Generalization

Token Decomposition

What We Tested and Rejected

Backward Compatibility

Reference Implementation

Appendix: Full Example

JSON Input

TOON v3.0 Output (no object schema)

TOON v4.0 Output (object schema, absentSentinel: "off")

TOON v4.0 Output (object schema, absentSentinel: "on")

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Absent Sentinel (`absentSentinel` option)

TOON v4.0 Output (object schema, `absentSentinel: "off"`)

TOON v4.0 Output (object schema, `absentSentinel: "on"`)