Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions docs/adr/0004-image-shape-ownership.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
status: proposed
---

# Image shape and S3-access index lookup is owned by format handlers via a fixed-offset Essentials packet

A SIPI module that needs the intrinsic shape of a service master — `(img_w, img_h, tile_w, tile_h, clevels, numpages, nc, bps)` — calls `SipiIO::read_shape(origpath) → SipiImgInfo` on the appropriate format handler. The cache holds no shape data; `SipiCache::SizeRecord`, `SipiCache::sizetable`, and `SipiCache::getSize()` are removed. The two consumers today are canonical-URL computation (always — it needs `img_w`/`img_h` to resolve pixel-coord cache keys) and the decode-memory-budget peak estimator (cache-miss only — the full struct).

`read_shape` is the existing `SipiIO::getDim(filepath) → SipiImgInfo` virtual, renamed for self-documentation: the return type carries full image shape (dimensions + tile + clevels + numpages + nc + bps + orientation + mimetype + sub-image resolutions), not just dimensions. No new virtual method is added; existing overrides are renamed.

The **service master format handlers** — `SipiIOJ2k` (JP2) today, `SipiIOTiff` (pyramidal mode) after the planned migration — implement `read_shape` via a fast path: they read shape *and file-structure offsets* from a dedicated **Essentials packet** at a known fixed prefix offset within the service master file. The packet's contents broaden from "shape only" to:

- **Image shape** (8 fields): `img_w, img_h, tile_w, tile_h, clevels, numpages, nc, bps`.
- **File-structure offsets** (per-format):
- Pyramidal TIFF: per-IFD byte offset + compressed size for each pyramid level.
- JP2: codestream-box offset + per-resolution-level offsets within the codestream.
- **ICC profile bytes** (existing).
- **Identity** (existing): original filename, mimetype, hash type, pixel checksum.

Other format handlers (`SipiIOPng`, `SipiIOJpeg`, plain `SipiIOTiff`) implement `read_shape` via standard format-native header parsing only. They are not in the server hot path: they are used for IIIF output writes in server mode (post-decode encoding to JPEG / PNG / TIFF; latency dominated by the encode itself), and for reading input files of arbitrary format in CLI mode (the read side of conversion; offline, not S3-bound). Neither use case warrants the Essentials-packet fast path.

CLI-mode conversion writes the Essentials packet with shape + structure offsets. Server-mode reads benefit from the fast path. The Essentials-packet schema is format-agnostic — same CBOR wire format ([ADR-0005](./0005-essentials-packet-versioned-binary-serialization.md)) for JP2 and pyramidal TIFF — but the *embedding mechanism* is handler-specific (a JP2 UUID box positioned after the JP2 signature + FTYP boxes; a tag in the first IFD of a TIFF, reachable via the file-header offset at bytes 4-7).

We accept this for four coupled reasons.

**1. Shape is intrinsic to the image, not derivable from cache state.** The cache's earlier shape memoization (`SipiCache::sizetable`) was parasitic — populated only as a side effect of `add()`, never independently persisted, vestigial after eviction, absent for un-cached origpaths. The right-shaped optimisation lives at the format-handler layer, not the cache layer. See [Probe 1](../deep-modules.md#probe-1--sipicache).

**2. The S3 transition is a forcing function (3-6 months out).** Service masters are accessed remotely *today* (NFS-mounted ZFS spinning disk; each read is a network round trip with seek penalties on spinning disk). The S3 transition makes every read an HTTP **range GET** (~1-10ms per round trip, no seek but with TLS + auth overhead). The packet-at-fixed-offset design allows SIPI's pre-decode logic to fetch the packet with **one** range GET of a known prefix (e.g. the first 64KB of the file), then **one** targeted range GET for the data SIPI actually needs to decode. Without the packet, SIPI must walk format-native structures (TIFF IFD chains, JP2 box hierarchies) — each parse step is a separate range GET — racking up 5+ round trips per request, multiplied by pyramid depth. The latency difference is roughly 5ms vs. 50ms on the server hot path. NFS already pays a fraction of this cost today; S3 makes it the dominant load-bearing factor.

**3. The packet's file-position must be fixed-prefix-readable.** For JP2: a UUID box positioned after the JP2 signature box and FTYP box. For pyramidal TIFF: a tag in the first IFD (which is reachable in the first 64KB by virtue of TIFF's header pointing to it at bytes 4-7). Both formats accommodate this without breaking spec compliance. If 64KB isn't enough for outliers (large embedded ICC, many pyramid levels), either bump the prefix size globally or include a `packet_size` field at a known fixed offset for a worst-case-bounded second range GET.

**4. Existing service masters lacking the packet (or lacking the structure-offset additions) fall back to format-native structure walking.** Backward compatibility is preserved; the fast path activates incrementally as files are re-processed. No mass re-conversion required.

## Considered Options

- **Keep the parasitic shape memoization in the cache** — rejected. The two structs (`SizeRecord` and `CacheRecord`) overlap by construction (same fields, different keys); `sizetable` is populated only by `add()`, never independently persisted, vestigial after eviction, absent for un-cached origpaths. Over S3 the cache-as-shape-source design has no architectural advantage that wouldn't also exist for the format handler — and the memoization couples shape lookup to cache state, which is wrong.

- **Memoize shape inside the peak-memory estimator** — rejected. The estimator is not the only consumer; canonical-URL computation also needs `img_w`/`img_h` on every request. Putting the memoization in the estimator would either duplicate the lookup at the canonical-URL site or force the canonical-URL site to depend on the estimator. Both worse than the format-handler-as-source design.

- **JP2-specific Essentials-packet fast path only, no pyramidal TIFF** — rejected. Pyramidal TIFF is the planned successor to JP2 as the service master format. Designing the packet carrier as JP2-specific would require duplicating it for TIFF immediately. Defining the schema once at the Essentials-packet layer and letting both `SipiIOJ2k` and `SipiIOTiff` consume it is uniform.

- **Add the shape-read fast path to every format handler** — rejected. PNG, JPEG, and non-pyramidal TIFF handlers are used either for IIIF output writes (server mode, post-decode) or for reading arbitrary-format input files in CLI mode (offline). Neither path is server-hot. Format-native header parsing is sufficient.

- **Limit the packet to image shape only (no file-structure offsets)** — rejected. Without offsets, SIPI must walk format-native structures over S3 — the optimization is half-measure. The marginal cost of including offsets in the packet is small (a few hundred bytes for typical pyramids); the marginal benefit is large (5+ round trips → 1).

- **Defer the S3 design to a separate future ADR** — rejected. The decisions are coupled (where the packet lives, what's in it, how SIPI reads it). Splitting risks designing the shape-only packet now, then rediscovering S3 constraints later and reopening the file format. The 3-6 month S3 horizon is too close to defer the design.

## Consequences

- **`SipiCache` shrinks**. `SizeRecord`, `sizetable`, `getSize()`, and the bug-prone non-cleanup of `sizetable` on eviction (`purge()` and `remove()` don't touch it — vestigial growth) all go away. The cache becomes a pure representation cache. See [Probe 1](../deep-modules.md#probe-1--sipicache).

- **`SipiIO::getDim` is renamed to `read_shape`**. Existing virtual already returns full shape via `SipiImgInfo`; the rename is for self-documentation. Each subclass's override updates name. No new virtual method added. See [Probe 3](../deep-modules.md#probe-3--format_handlers-renamed-from-formats).

- **Service master format handlers (`SipiIOJ2k` + pyramidal `SipiIOTiff`) get the Essentials-packet fast path** in `read_shape`. Implementation: range-read a fixed prefix (e.g. first 64KB), parse the packet from its known location, return shape from the packet's `shape` section. Fallback to format-native parsing if the packet is absent or lacks the requested fields.

- **The Essentials packet schema gains image-shape AND file-structure-offset fields**. Wire format is CBOR per ADR-0005. The packet's role broadens from "preserve cross-conversion identity" to "preserve identity *and* serve as the S3-access index for the file." Old service masters without the new fields fall back to format-native parsing.

- **`SipiHttpServer.cpp` request flow simplifies**. The call site at line 1571 becomes `format_handler->read_shape(infile)`, returning a full `SipiImgInfo` with `nc`/`bps` populated (the "remain 0" overestimate comment at line 1563 disappears). The decode-memory-budget admission check gets accurate inputs as a happy side effect.

- **A `SourceReader` abstraction is needed** to wrap local FS / NFS / S3 access uniformly. Today format handlers call libtiff / Kakadu / libjpeg / libpng with file paths; under S3 they need a stream-or-range abstraction. Out of scope for this ADR; flagged as the natural follow-on (likely a separate ADR + module under `src/source_reader/` or similar). The Essentials-packet design in this ADR is independent of which `SourceReader` implementation runs underneath — same packet location, same parsing.

- **`SipiLua.cpp` is not directly affected**. The Lua admin surface does not touch `getSize` or `read_shape`.

- **Approval-test surface is unchanged**. Shape and structure-offset fields are read-only metadata; rendered image bytes are unaffected.

- **Migration path for existing service masters**: existing JP2s and TIFFs in production lacking the packet (or lacking the structure-offset additions) fall back to format-native parsing — slower over S3 but functionally correct. New conversions populate the packet. No mass re-conversion required; fast path activates incrementally as files are re-processed. This is the load-bearing operational property — given the 100K-master-file install base ([ADR-0005](./0005-essentials-packet-versioned-binary-serialization.md)'s longevity invariant), mass re-encoding is not feasible.

- **The decision is coupled with [ADR-0005](./0005-essentials-packet-versioned-binary-serialization.md)** (CBOR wire format). The CBOR choice matters more under S3 — every byte of header sits inside a range-GET span, and forward-compat allows future schema additions (e.g. per-tile offset tables for ultra-tile-heavy access patterns) without re-encoding.

- **Glossary deltas land in [`UBIQUITOUS_LANGUAGE.md`](../../UBIQUITOUS_LANGUAGE.md)** in the batched edit pass: `Image shape`, `Operating mode` / `Server mode` / `CLI mode`, `Service master` / `Service master format`, `Archival master` / `Archival master format`, `Pyramidal TIFF`, `Object storage`, `Range GET`, `Codec` (sharpened), `read_shape` (rename of `getDim`), and a sharpened `Essentials packet` definition. Tracked in the [glossary delta register](../deep-modules.md#glossary-delta-register).
53 changes: 53 additions & 0 deletions docs/adr/0005-essentials-packet-versioned-binary-serialization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
status: proposed
---

# Essentials packet adopts a versioned, self-describing binary wire format (CBOR)

The Essentials packet's wire format migrates from pipe-delimited text to **CBOR (RFC 8949)** with a top-level `format_version` field. New SIPI versions read every prior `format_version` they support; legacy text-format packets are detected by the absence of a CBOR-tagged byte sequence and parsed via the legacy reader indefinitely. New packets are always written in the CBOR form.

We accept this because SIPI's preservation guarantee is that conversions don't lose metadata, and SIPI's installed base is *hundreds of thousands* of master files. Mass re-encoding to flip serialization formats is not operationally feasible at that scale; the wire format must therefore be forward-evolvable forever — new fields, new optimizations, new types — without requiring coordination across the existing master-file population. The current pipe-delimited format has zero versioning, no schema, no escaping, and no field discovery; ADR-0004 alone (adding eight image-shape fields) already pushes it past its design limits, and any subsequent schema change would compound the fragility. Biting this bullet now, while the install base is smaller than it will be at any future point, is strictly cheaper than deferring.

CBOR is chosen over the obvious alternatives:

- **Protocol Buffers** — solves forward-compat via field numbering, but adds a `protoc` codegen step and a `.proto` schema file to the build. The dep + workflow friction is not justified for a packet that's a few KB at most.
- **MessagePack** — functionally equivalent to CBOR with comparable forward-compat semantics, but no IETF standard. CBOR's preservation-community traction (used by COSE / CWT, JOSE successor stacks, IoT data formats, and increasingly by IIIF-adjacent specs) is the tiebreaker.
- **Custom versioned binary** — tightest bytes per packet, but reinventing schema-evolution semantics (length-prefixing, type tagging, default-value rules for missing fields) at a 100K-master-file horizon is the kind of decision a longevity-driven codebase should not bet on. CBOR's primitives are exactly the schema-evolution semantics we would otherwise reinvent.
- **JSON** — text-based; defeats the compactness goal in image headers and reintroduces the same escaping fragility that motivated this ADR.

Forward-compatibility in CBOR works at two levels and we use both:

- **Field-level**: additive schema changes (new fields) require no `format_version` bump. Readers ignore unknown CBOR map keys; writers add new keys at will. This covers the common case (e.g. ADR-0004's image-shape fields, future colour-space hints, future provenance fields).
- **Format-level**: breaking changes (field type changes, semantic redefinition) bump `format_version`. The new SIPI version is responsible for retaining a migrating reader for every prior `format_version` it claims to support. `format_version` is the *only* thing a reader checks before dispatching to a per-version parser; everything else is field-level evolution.

## Considered Options

- **Stay pipe-delimited; add ADR-0004 fields by appending more pipe segments** — rejected. Pushes past the format's design limits when we already know the next schema change is coming. The pipe-delimited reader has no escape mechanism (an `origname` containing `|` corrupts parse — latent bug today on macOS/Linux where filesystems allow `|`), no version discriminator, and no defined rule for unknown fields. Each future schema change pays the same cost; biting the bullet later is strictly more expensive than biting it now.

- **Adopt Protocol Buffers** — rejected. The codegen + schema-file workflow is appropriate for service interfaces but heavyweight for a single embedded metadata packet. Forward-compat via field numbering is real but the tooling overhead is not.

- **Adopt MessagePack** — rejected. Functionally equivalent to CBOR but lacks IETF standardization. No technical advantage; CBOR is the conservative choice for a 10+ year preservation horizon.

- **Custom versioned binary format** — rejected. Reinvents what CBOR's spec already nails down (canonical encoding, type tagging, length-prefixing, integer/float/string/array/map primitives). The maintenance cost of an in-house format compounds across the lifetime of the master-file population.

- **Adopt CBOR but without a `format_version` field**, relying entirely on field-level forward-compat — rejected. Field-level evolution covers most cases but cannot handle semantic redefinition (e.g. if `numpages` semantics ever change for some IIIF-spec reason). A coarse-grained version field is cheap insurance against the cases field-level evolution can't handle.

## Consequences

- **`SipiEssentials::parse(bytes)` becomes a dispatcher**: probe the leading bytes for a CBOR tag → CBOR parse via the chosen library; otherwise fall back to the legacy pipe-delimited parser. The legacy parser is retained indefinitely (longevity invariant). If both parsers fail, treat as no Essentials packet present (same as today's "is_set = false" path).

- **`SipiEssentials::serialize()` always emits CBOR** with `format_version = 1` for the initial cutover. The class's in-memory representation is unchanged; only the wire format moves. The existing `operator std::string()` and `operator<<` overloads — which the [Probe 2 deep-module analysis](../deep-modules.md#probe-2--metadata) flagged as shallowness leaks anyway — are removed in favour of an explicit `serialize() → std::vector<unsigned char>`. The on-disk artefact is *bytes*, not a `std::string`; the existing API's typing was a mistake.

- **New build dep**: a CBOR library. Candidates include `tinycbor` (lightweight, C, used by IoT/embedded), `jsoncons` (header-only C++, broad format support), or an in-tree minimal encoder (CBOR is small enough that a few hundred lines covers our needs). Choice deferred to implementation time; not an architectural decision.

- **`SipiEssentials::parse()` becomes fallible in a way the current API doesn't expose** (the dispatcher needs to report which parser was used, and CBOR-parse can fail in more ways than text-parse). Consider switching the API to `std::expected<SipiEssentials, ParseError>` to match `cpp-style-guide.md`'s preference. Aligns with the Rust target.

- **Approval-test goldens for image-header bytes change**: where the test asserts on the embedded packet bytes (any test that round-trips a master file through the encoder and inspects the header), the goldens are regenerated alongside ADR-0004's image-shape-field addition. Do both changes in one PR so the approval-suite churn is single.

- **Existing master files keep working unchanged**. Their pipe-delimited packets continue to parse via the legacy reader. The CBOR fast path activates incrementally as files are re-processed (CLI conversions, format-conversion writes), without any mass re-encoding event. This is the load-bearing operational property of this ADR.

- **The pipe-delimited fragility goes away for new packets**: filenames with `|`, no schema versioning, no field discovery, no escape semantics — all replaced by CBOR's well-defined encoding rules. Old packets remain fragile but, by definition, are read-only at this point (they're already on disk; nothing new will be written in the legacy format).

- **The `metadata/` Bazel package documented in [Probe 2](../deep-modules.md#probe-2--metadata) gains a CBOR-library dep but no visibility change.** Consumers (`SipiImage`, format handlers) see no API surface change at the call sites that already use `SipiEssentials` getters/setters. Only `parse` / `serialize` change shape.

- **Future schema additions become low-friction**. Adding a new field is: (a) extend the in-memory struct, (b) write the new key in `serialize()`, (c) read the new key (with a sensible default) in `parse()`. No `format_version` bump, no migration tooling, no coordinated deploys. This is the operational property the 100K-master-file horizon requires.
Loading
Loading