diff --git a/docs/adr/0004-image-shape-ownership.md b/docs/adr/0004-image-shape-ownership.md new file mode 100644 index 00000000..80052e37 --- /dev/null +++ b/docs/adr/0004-image-shape-ownership.md @@ -0,0 +1,70 @@ +--- +status: proposed +--- + +# Image shape and S3-access index lookup is owned by format handlers via a fixed-offset Essentials packet + +A SIPI module that needs the intrinsic shape of a service master — `(img_w, img_h, tile_w, tile_h, clevels, numpages, nc, bps)` — calls `SipiIO::read_shape(origpath) → SipiImgInfo` on the appropriate format handler. The cache holds no shape data; `SipiCache::SizeRecord`, `SipiCache::sizetable`, and `SipiCache::getSize()` are removed. The two consumers today are canonical-URL computation (always — it needs `img_w`/`img_h` to resolve pixel-coord cache keys) and the decode-memory-budget peak estimator (cache-miss only — the full struct). + +`read_shape` is the existing `SipiIO::getDim(filepath) → SipiImgInfo` virtual, renamed for self-documentation: the return type carries full image shape (dimensions + tile + clevels + numpages + nc + bps + orientation + mimetype + sub-image resolutions), not just dimensions. No new virtual method is added; existing overrides are renamed. + +The **service master format handlers** — `SipiIOJ2k` (JP2) today, `SipiIOTiff` (pyramidal mode) after the planned migration — implement `read_shape` via a fast path: they read shape *and file-structure offsets* from a dedicated **Essentials packet** at a known fixed prefix offset within the service master file. The packet's contents broaden from "shape only" to: + +- **Image shape** (8 fields): `img_w, img_h, tile_w, tile_h, clevels, numpages, nc, bps`. +- **File-structure offsets** (per-format): + - Pyramidal TIFF: per-IFD byte offset + compressed size for each pyramid level. + - JP2: codestream-box offset + per-resolution-level offsets within the codestream. +- **ICC profile bytes** (existing). +- **Identity** (existing): original filename, mimetype, hash type, pixel checksum. + +Other format handlers (`SipiIOPng`, `SipiIOJpeg`, plain `SipiIOTiff`) implement `read_shape` via standard format-native header parsing only. They are not in the server hot path: they are used for IIIF output writes in server mode (post-decode encoding to JPEG / PNG / TIFF; latency dominated by the encode itself), and for reading input files of arbitrary format in CLI mode (the read side of conversion; offline, not S3-bound). Neither use case warrants the Essentials-packet fast path. + +CLI-mode conversion writes the Essentials packet with shape + structure offsets. Server-mode reads benefit from the fast path. The Essentials-packet schema is format-agnostic — same CBOR wire format ([ADR-0005](./0005-essentials-packet-versioned-binary-serialization.md)) for JP2 and pyramidal TIFF — but the *embedding mechanism* is handler-specific (a JP2 UUID box positioned after the JP2 signature + FTYP boxes; a tag in the first IFD of a TIFF, reachable via the file-header offset at bytes 4-7). + +We accept this for four coupled reasons. + +**1. Shape is intrinsic to the image, not derivable from cache state.** The cache's earlier shape memoization (`SipiCache::sizetable`) was parasitic — populated only as a side effect of `add()`, never independently persisted, vestigial after eviction, absent for un-cached origpaths. The right-shaped optimisation lives at the format-handler layer, not the cache layer. See [Probe 1](../deep-modules.md#probe-1--sipicache). + +**2. The S3 transition is a forcing function (3-6 months out).** Service masters are accessed remotely *today* (NFS-mounted ZFS spinning disk; each read is a network round trip with seek penalties on spinning disk). The S3 transition makes every read an HTTP **range GET** (~1-10ms per round trip, no seek but with TLS + auth overhead). The packet-at-fixed-offset design allows SIPI's pre-decode logic to fetch the packet with **one** range GET of a known prefix (e.g. the first 64KB of the file), then **one** targeted range GET for the data SIPI actually needs to decode. Without the packet, SIPI must walk format-native structures (TIFF IFD chains, JP2 box hierarchies) — each parse step is a separate range GET — racking up 5+ round trips per request, multiplied by pyramid depth. The latency difference is roughly 5ms vs. 50ms on the server hot path. NFS already pays a fraction of this cost today; S3 makes it the dominant load-bearing factor. + +**3. The packet's file-position must be fixed-prefix-readable.** For JP2: a UUID box positioned after the JP2 signature box and FTYP box. For pyramidal TIFF: a tag in the first IFD (which is reachable in the first 64KB by virtue of TIFF's header pointing to it at bytes 4-7). Both formats accommodate this without breaking spec compliance. If 64KB isn't enough for outliers (large embedded ICC, many pyramid levels), either bump the prefix size globally or include a `packet_size` field at a known fixed offset for a worst-case-bounded second range GET. + +**4. Existing service masters lacking the packet (or lacking the structure-offset additions) fall back to format-native structure walking.** Backward compatibility is preserved; the fast path activates incrementally as files are re-processed. No mass re-conversion required. + +## Considered Options + +- **Keep the parasitic shape memoization in the cache** — rejected. The two structs (`SizeRecord` and `CacheRecord`) overlap by construction (same fields, different keys); `sizetable` is populated only by `add()`, never independently persisted, vestigial after eviction, absent for un-cached origpaths. Over S3 the cache-as-shape-source design has no architectural advantage that wouldn't also exist for the format handler — and the memoization couples shape lookup to cache state, which is wrong. + +- **Memoize shape inside the peak-memory estimator** — rejected. The estimator is not the only consumer; canonical-URL computation also needs `img_w`/`img_h` on every request. Putting the memoization in the estimator would either duplicate the lookup at the canonical-URL site or force the canonical-URL site to depend on the estimator. Both worse than the format-handler-as-source design. + +- **JP2-specific Essentials-packet fast path only, no pyramidal TIFF** — rejected. Pyramidal TIFF is the planned successor to JP2 as the service master format. Designing the packet carrier as JP2-specific would require duplicating it for TIFF immediately. Defining the schema once at the Essentials-packet layer and letting both `SipiIOJ2k` and `SipiIOTiff` consume it is uniform. + +- **Add the shape-read fast path to every format handler** — rejected. PNG, JPEG, and non-pyramidal TIFF handlers are used either for IIIF output writes (server mode, post-decode) or for reading arbitrary-format input files in CLI mode (offline). Neither path is server-hot. Format-native header parsing is sufficient. + +- **Limit the packet to image shape only (no file-structure offsets)** — rejected. Without offsets, SIPI must walk format-native structures over S3 — the optimization is half-measure. The marginal cost of including offsets in the packet is small (a few hundred bytes for typical pyramids); the marginal benefit is large (5+ round trips → 1). + +- **Defer the S3 design to a separate future ADR** — rejected. The decisions are coupled (where the packet lives, what's in it, how SIPI reads it). Splitting risks designing the shape-only packet now, then rediscovering S3 constraints later and reopening the file format. The 3-6 month S3 horizon is too close to defer the design. + +## Consequences + +- **`SipiCache` shrinks**. `SizeRecord`, `sizetable`, `getSize()`, and the bug-prone non-cleanup of `sizetable` on eviction (`purge()` and `remove()` don't touch it — vestigial growth) all go away. The cache becomes a pure representation cache. See [Probe 1](../deep-modules.md#probe-1--sipicache). + +- **`SipiIO::getDim` is renamed to `read_shape`**. Existing virtual already returns full shape via `SipiImgInfo`; the rename is for self-documentation. Each subclass's override updates name. No new virtual method added. See [Probe 3](../deep-modules.md#probe-3--format_handlers-renamed-from-formats). + +- **Service master format handlers (`SipiIOJ2k` + pyramidal `SipiIOTiff`) get the Essentials-packet fast path** in `read_shape`. Implementation: range-read a fixed prefix (e.g. first 64KB), parse the packet from its known location, return shape from the packet's `shape` section. Fallback to format-native parsing if the packet is absent or lacks the requested fields. + +- **The Essentials packet schema gains image-shape AND file-structure-offset fields**. Wire format is CBOR per ADR-0005. The packet's role broadens from "preserve cross-conversion identity" to "preserve identity *and* serve as the S3-access index for the file." Old service masters without the new fields fall back to format-native parsing. + +- **`SipiHttpServer.cpp` request flow simplifies**. The call site at line 1571 becomes `format_handler->read_shape(infile)`, returning a full `SipiImgInfo` with `nc`/`bps` populated (the "remain 0" overestimate comment at line 1563 disappears). The decode-memory-budget admission check gets accurate inputs as a happy side effect. + +- **A `SourceReader` abstraction is needed** to wrap local FS / NFS / S3 access uniformly. Today format handlers call libtiff / Kakadu / libjpeg / libpng with file paths; under S3 they need a stream-or-range abstraction. Out of scope for this ADR; flagged as the natural follow-on (likely a separate ADR + module under `src/source_reader/` or similar). The Essentials-packet design in this ADR is independent of which `SourceReader` implementation runs underneath — same packet location, same parsing. + +- **`SipiLua.cpp` is not directly affected**. The Lua admin surface does not touch `getSize` or `read_shape`. + +- **Approval-test surface is unchanged**. Shape and structure-offset fields are read-only metadata; rendered image bytes are unaffected. + +- **Migration path for existing service masters**: existing JP2s and TIFFs in production lacking the packet (or lacking the structure-offset additions) fall back to format-native parsing — slower over S3 but functionally correct. New conversions populate the packet. No mass re-conversion required; fast path activates incrementally as files are re-processed. This is the load-bearing operational property — given the 100K-master-file install base ([ADR-0005](./0005-essentials-packet-versioned-binary-serialization.md)'s longevity invariant), mass re-encoding is not feasible. + +- **The decision is coupled with [ADR-0005](./0005-essentials-packet-versioned-binary-serialization.md)** (CBOR wire format). The CBOR choice matters more under S3 — every byte of header sits inside a range-GET span, and forward-compat allows future schema additions (e.g. per-tile offset tables for ultra-tile-heavy access patterns) without re-encoding. + +- **Glossary deltas land in [`UBIQUITOUS_LANGUAGE.md`](../../UBIQUITOUS_LANGUAGE.md)** in the batched edit pass: `Image shape`, `Operating mode` / `Server mode` / `CLI mode`, `Service master` / `Service master format`, `Archival master` / `Archival master format`, `Pyramidal TIFF`, `Object storage`, `Range GET`, `Codec` (sharpened), `read_shape` (rename of `getDim`), and a sharpened `Essentials packet` definition. Tracked in the [glossary delta register](../deep-modules.md#glossary-delta-register). diff --git a/docs/adr/0005-essentials-packet-versioned-binary-serialization.md b/docs/adr/0005-essentials-packet-versioned-binary-serialization.md new file mode 100644 index 00000000..c8c0aa6f --- /dev/null +++ b/docs/adr/0005-essentials-packet-versioned-binary-serialization.md @@ -0,0 +1,53 @@ +--- +status: proposed +--- + +# Essentials packet adopts a versioned, self-describing binary wire format (CBOR) + +The Essentials packet's wire format migrates from pipe-delimited text to **CBOR (RFC 8949)** with a top-level `format_version` field. New SIPI versions read every prior `format_version` they support; legacy text-format packets are detected by the absence of a CBOR-tagged byte sequence and parsed via the legacy reader indefinitely. New packets are always written in the CBOR form. + +We accept this because SIPI's preservation guarantee is that conversions don't lose metadata, and SIPI's installed base is *hundreds of thousands* of master files. Mass re-encoding to flip serialization formats is not operationally feasible at that scale; the wire format must therefore be forward-evolvable forever — new fields, new optimizations, new types — without requiring coordination across the existing master-file population. The current pipe-delimited format has zero versioning, no schema, no escaping, and no field discovery; ADR-0004 alone (adding eight image-shape fields) already pushes it past its design limits, and any subsequent schema change would compound the fragility. Biting this bullet now, while the install base is smaller than it will be at any future point, is strictly cheaper than deferring. + +CBOR is chosen over the obvious alternatives: + +- **Protocol Buffers** — solves forward-compat via field numbering, but adds a `protoc` codegen step and a `.proto` schema file to the build. The dep + workflow friction is not justified for a packet that's a few KB at most. +- **MessagePack** — functionally equivalent to CBOR with comparable forward-compat semantics, but no IETF standard. CBOR's preservation-community traction (used by COSE / CWT, JOSE successor stacks, IoT data formats, and increasingly by IIIF-adjacent specs) is the tiebreaker. +- **Custom versioned binary** — tightest bytes per packet, but reinventing schema-evolution semantics (length-prefixing, type tagging, default-value rules for missing fields) at a 100K-master-file horizon is the kind of decision a longevity-driven codebase should not bet on. CBOR's primitives are exactly the schema-evolution semantics we would otherwise reinvent. +- **JSON** — text-based; defeats the compactness goal in image headers and reintroduces the same escaping fragility that motivated this ADR. + +Forward-compatibility in CBOR works at two levels and we use both: + +- **Field-level**: additive schema changes (new fields) require no `format_version` bump. Readers ignore unknown CBOR map keys; writers add new keys at will. This covers the common case (e.g. ADR-0004's image-shape fields, future colour-space hints, future provenance fields). +- **Format-level**: breaking changes (field type changes, semantic redefinition) bump `format_version`. The new SIPI version is responsible for retaining a migrating reader for every prior `format_version` it claims to support. `format_version` is the *only* thing a reader checks before dispatching to a per-version parser; everything else is field-level evolution. + +## Considered Options + +- **Stay pipe-delimited; add ADR-0004 fields by appending more pipe segments** — rejected. Pushes past the format's design limits when we already know the next schema change is coming. The pipe-delimited reader has no escape mechanism (an `origname` containing `|` corrupts parse — latent bug today on macOS/Linux where filesystems allow `|`), no version discriminator, and no defined rule for unknown fields. Each future schema change pays the same cost; biting the bullet later is strictly more expensive than biting it now. + +- **Adopt Protocol Buffers** — rejected. The codegen + schema-file workflow is appropriate for service interfaces but heavyweight for a single embedded metadata packet. Forward-compat via field numbering is real but the tooling overhead is not. + +- **Adopt MessagePack** — rejected. Functionally equivalent to CBOR but lacks IETF standardization. No technical advantage; CBOR is the conservative choice for a 10+ year preservation horizon. + +- **Custom versioned binary format** — rejected. Reinvents what CBOR's spec already nails down (canonical encoding, type tagging, length-prefixing, integer/float/string/array/map primitives). The maintenance cost of an in-house format compounds across the lifetime of the master-file population. + +- **Adopt CBOR but without a `format_version` field**, relying entirely on field-level forward-compat — rejected. Field-level evolution covers most cases but cannot handle semantic redefinition (e.g. if `numpages` semantics ever change for some IIIF-spec reason). A coarse-grained version field is cheap insurance against the cases field-level evolution can't handle. + +## Consequences + +- **`SipiEssentials::parse(bytes)` becomes a dispatcher**: probe the leading bytes for a CBOR tag → CBOR parse via the chosen library; otherwise fall back to the legacy pipe-delimited parser. The legacy parser is retained indefinitely (longevity invariant). If both parsers fail, treat as no Essentials packet present (same as today's "is_set = false" path). + +- **`SipiEssentials::serialize()` always emits CBOR** with `format_version = 1` for the initial cutover. The class's in-memory representation is unchanged; only the wire format moves. The existing `operator std::string()` and `operator<<` overloads — which the [Probe 2 deep-module analysis](../deep-modules.md#probe-2--metadata) flagged as shallowness leaks anyway — are removed in favour of an explicit `serialize() → std::vector`. The on-disk artefact is *bytes*, not a `std::string`; the existing API's typing was a mistake. + +- **New build dep**: a CBOR library. Candidates include `tinycbor` (lightweight, C, used by IoT/embedded), `jsoncons` (header-only C++, broad format support), or an in-tree minimal encoder (CBOR is small enough that a few hundred lines covers our needs). Choice deferred to implementation time; not an architectural decision. + +- **`SipiEssentials::parse()` becomes fallible in a way the current API doesn't expose** (the dispatcher needs to report which parser was used, and CBOR-parse can fail in more ways than text-parse). Consider switching the API to `std::expected` to match `cpp-style-guide.md`'s preference. Aligns with the Rust target. + +- **Approval-test goldens for image-header bytes change**: where the test asserts on the embedded packet bytes (any test that round-trips a master file through the encoder and inspects the header), the goldens are regenerated alongside ADR-0004's image-shape-field addition. Do both changes in one PR so the approval-suite churn is single. + +- **Existing master files keep working unchanged**. Their pipe-delimited packets continue to parse via the legacy reader. The CBOR fast path activates incrementally as files are re-processed (CLI conversions, format-conversion writes), without any mass re-encoding event. This is the load-bearing operational property of this ADR. + +- **The pipe-delimited fragility goes away for new packets**: filenames with `|`, no schema versioning, no field discovery, no escape semantics — all replaced by CBOR's well-defined encoding rules. Old packets remain fragile but, by definition, are read-only at this point (they're already on disk; nothing new will be written in the legacy format). + +- **The `metadata/` Bazel package documented in [Probe 2](../deep-modules.md#probe-2--metadata) gains a CBOR-library dep but no visibility change.** Consumers (`SipiImage`, format handlers) see no API surface change at the call sites that already use `SipiEssentials` getters/setters. Only `parse` / `serialize` change shape. + +- **Future schema additions become low-friction**. Adding a new field is: (a) extend the in-memory struct, (b) write the new key in `serialize()`, (c) read the new key (with a sensible default) in `parse()`. No `format_version` bump, no migration tooling, no coordinated deploys. This is the operational property the 100K-master-file horizon requires. diff --git a/docs/adr/0006-format-handler-api-modernization.md b/docs/adr/0006-format-handler-api-modernization.md new file mode 100644 index 00000000..5c5dd7f1 --- /dev/null +++ b/docs/adr/0006-format-handler-api-modernization.md @@ -0,0 +1,91 @@ +--- +status: proposed +--- + +# Format handler API modernization (Rust-aligned, transitional) + +The format-handler API (`Sipi::SipiIO` base class + `SipiIOJ2k` / `SipiIOTiff` / `SipiIOJpeg` / `SipiIOPng` subclasses) is modernized as a coordinated set of changes that bring the C++ surface closer to the Rust idioms it will eventually be ported to under [ADR-0001](./0001-shttps-as-strangler-fig-target.md): + +1. **`std::expected` for fallible operations** — replacing `bool` returns + out-parameters and the tri-state `SipiImgInfo::success` enum. +2. **C++ default arguments instead of overload sets** for `read()` and `write()` — replacing the five `read()` overloads in `SipiIO.h`. +3. **A typed sum type for output sinks** — replacing the magic-string filepath sentinels (`"-"` for stdout, `"HTTP"` for the HTTP server output) with `std::variant`. +4. **Per-codec typed parameter structs in a `std::variant`** — replacing the stringly-typed `SipiCompressionParams = std::unordered_map` with `std::variant`. +5. **A SIPI-local `SIPI_TRY` macro** providing Rust-`?`-style early return for `std::expected` chains — addressing the loudest ergonomic gap with `std::expected` without adopting Abseil. + +We accept these because the [Rust-aligned, transitional C++](../deep-modules.md#method-invariants-dont-drift-on-these-across-sessions) method invariant mandates that C++ improvements should preference patterns that translate cleanly to Rust over patterns that would harden the C++ for the long term. Each of the five changes maps onto an established Rust idiom: + +| C++ pattern (current) | C++ pattern (target) | Rust equivalent (post-port) | +| --- | --- | --- | +| `bool read(...)` + out-params | `std::expected` | `Result<(), IoError>` | +| `SipiImgInfo::success` enum | `std::expected` | `Result` | +| 5 `read()` overloads | C++ default args | Default args / Builder | +| Magic strings (`"-"`, `"HTTP"`) | `std::variant` | Rust enum | +| `unordered_map` params | `std::variant` | Rust enum | +| `if (e) ... else e.error() ...` | `SIPI_TRY(var, expr)` | `let var = expr?;` | + +The `SIPI_TRY` macro is the only piece that doesn't survive the Rust port (Rust has the `?` operator natively). Everything else becomes a mechanical rename when the relevant module is reimplemented in Rust. + +## SIPI_TRY macro definition + +```cpp +// src/util/expected_macros.h +#define SIPI_TRY(var, expr) \ + auto&& _sipi_try_##var = (expr); \ + if (!_sipi_try_##var) \ + return std::unexpected(std::move(_sipi_try_##var).error()); \ + auto&& var = *std::move(_sipi_try_##var) +``` + +Usage: + +```cpp +std::expected render_thumbnail(const std::string& path) { + SIPI_TRY(handler, choose_handler(path)); // Result ? + SIPI_TRY(shape, handler->read_shape(path)); // Result ? + SIPI_TRY(decoded, handler->decode(path)); // Result ? + return scale(decoded, shape, kThumbSize); +} +``` + +Functionally equivalent to: + +```rust +fn render_thumbnail(path: &Path) -> Result { + let handler = choose_handler(path)?; + let shape = handler.read_shape(path)?; + let decoded = handler.decode(path)?; + Ok(scale(decoded, shape, THUMB_SIZE)) +} +``` + +## Considered Options + +- **Adopt Abseil's `absl::StatusOr` instead of `std::expected`** — rejected per the [Rust-aligned, transitional C++](../deep-modules.md#method-invariants-dont-drift-on-these-across-sessions) method invariant. Adopting Abseil would be a substantive C++-only commitment whose benefits (`StatusOr`, `flat_hash_map`, `StrCat`, etc.) all need re-evaluation for the Rust rewrite. `std::expected` maps directly onto Rust's `Result`; `absl::StatusOr` does not (Rust has nothing like `absl::Status`). + +- **Defer the modernization to the Rust rewrite itself** — rejected. The strangler-fig migration is incremental, not big-bang. Each Rust-rewritten module needs to interoperate with still-C++ surrounding modules through stable shims; the cleaner the C++ API at the seam, the cleaner the shim. Modernizing now reduces shim complexity later. + +- **Status quo (keep current API)** — rejected. The five identified issues compound at every consumer (`SipiHttpServer`, `SipiImage`, `SipiLua`, every format handler), making the eventual refactor strictly more expensive. + +- **Adopt `boost::leaf` or `boost::outcome`** — rejected. Neither maps cleanly to Rust's `Result`; both add a Boost dependency for marginal C++-only ergonomic wins. + +- **Custom `sipi::Result` type** — rejected. Would duplicate `std::expected` for negligible gain; deviates from the stdlib path the broader C++ ecosystem is converging on. + +## Consequences + +- **Every fallible function in `format_handlers/`** is rewritten to return `std::expected`. Callers update from `if (!handler->read(...)) { ... }` to `SIPI_TRY` or monadic `.and_then` chains. + +- **`SipiImgInfo::success` enum is removed**. The "DIMS" partial-success case becomes a `bool partial` field on the success branch: `std::expected` where `SipiImgInfo` carries `partial = true` when only dimensions could be read (no full shape). + +- **`SipiCompressionParams` becomes a `std::variant`** of per-codec parameter structs. The existing `SipiCompressionParamName` enum collapses into the variant alternatives. Each format handler's `write` accepts only its own variant alternative; mismatch is a compile-time error via `std::get` or `std::visit`. + +- **Magic-string filepaths are replaced with `std::variant`** at the `read()` / `write()` surface. Internal handling of stdout-vs-HTTP-vs-file is unchanged; only the API typing tightens. Resolves the ambiguity where `"HTTP"` was a special filename sentinel. + +- **`SIPI_TRY` macro lives at `src/util/expected_macros.h`** (or wherever a small util module lands). Single header, no implementation. + +- **Migration is staged**, not big-bang: one PR per format handler after `format_handlers/` is promoted to a Bazel package per [ADR-0003](./0003-module-co-located-source-and-tests.md). Tracked as child issues under the `format_handlers` parent issue in Linear. + +- **No behaviour change is intended** by this ADR alone. All five changes are API-shape-only. Decode/encode internals stay unchanged. Approval-test surface is unaffected. + +- **Rust port preview**: when a `format_handlers/` module is eventually rewritten in Rust under the strangler-fig plan, the API's shape requires no redesign — only a syntax translation. This is the load-bearing benefit of the modernization. + +- **Coupled with [ADR-0004](./0004-image-shape-ownership.md)**: `read_shape` (the renamed `getDim`) is the first method to land the new return-type style — `std::expected`. The `read_shape` modernization rides ADR-0004's amendment commit; the rest of the format-handler API follows in subsequent PRs. diff --git a/docs/adr/0007-sipiimage-decomposition.md b/docs/adr/0007-sipiimage-decomposition.md new file mode 100644 index 00000000..5bbc5375 --- /dev/null +++ b/docs/adr/0007-sipiimage-decomposition.md @@ -0,0 +1,70 @@ +--- +status: proposed +--- + +# `SipiImage` decomposition into `image/` + `image_processing/` + +The `SipiImage` god-object (~2,526 lines of `.cpp` + `.hpp`, six distinct responsibility groups in one class) is decomposed into: + +- **`src/image/`** — a pure value type (geometry, photometric, RAII pixel buffer, metadata composite) with ~15 public methods. No image-processing behaviour, no I/O facade, no HTTP integration. +- **`src/image_processing/`** — free functions over `const Image&` for crop, scale, rotate, colour conversion, channel ops, bit-depth reduction, dithering, watermark application, comparison, arithmetic. + +Three concerns leave the class entirely: + +- The static `io` registry (extension → format handler) moves to `format_handlers/` (registry of self). +- The `shttps::Connection*` field disappears (replaced by `OutputSink::HttpSink` per ADR-0006). +- The `app14_transform` JPEG-specific marker field moves into the JPEG decode pipeline (consumed at decode, inverted before `Image` is "complete"; downstream sees standard CMYK). + +We accept this for five reasons. + +**1. Six responsibility groups in one class is the textbook god-object shape.** Image data, metadata, format I/O facade, image processing, HTTP integration, and arithmetic are conceptually separable; co-locating them costs every consumer (heavy header transitively-includes, broken encapsulation via 5 friend declarations, raw-pointer pixel ownership requiring manual memory-management ceremony, leaky JPEG-specific field in the universal type). The accumulation is historical (the class is the SIPI core since 2016 and has absorbed every responsibility that touched images) rather than designed. + +**2. Free functions over a value type is the Rust-aligned shape.** `image.crop(...)` doesn't survive the Rust port directly (Rust doesn't have inheritance-based dispatch on values). `Sipi::processing::crop(image, ...)` maps cleanly to a Rust trait method or free function. Per the [Rust-aligned, transitional C++ method invariant](../deep-modules.md#method-invariants-dont-drift-on-these-across-sessions). + +**3. The HTTP-output dual-write optimization is preserved through `OutputSink::TeeSink` composition.** Today's encoder writes its byte stream simultaneously to the HTTP socket *and* a cache file. The current implementation uses the magic-string filepath sentinel `"HTTP"` to encode the destination, with the cache-write happening outside the format handler. Under ADR-0006's typed `OutputSink` variant, a `TeeSink` alternative composes other sinks: the request handler hands the format handler one sink — `TeeSink{HttpSink, FilePath}` — and the format handler writes once, with each chunk broadcast to all sub-sinks. Same semantic; cleaner location of the tee (inside the format handler's write loop, not in the request handler orchestrating two passes). Generalises naturally — an `S3Sink` alternative joining the tee covers write-through to remote storage if that ever lands. + +**4. Migration is staged into 8 small PRs**, each independently reversible. The first is a *non-code audit* of every internal use of the raw `pixels` pointer (catalog read / write / pointer-pass / arithmetic / `new[]`/`delete[]` cases) plus a benchmark of representative decode + scale + encode workloads. Output is a doc; gates the rest. This addresses the legitimate concern that pixel-buffer code may do "crazy stuff" depending on raw-pointer semantics — the audit produces explicit confidence (or surfaces an actual blocker) before any byte moves. + +**5. Performance is preserved.** `std::vector::operator[]` and `byte_ptr[i]` compile to identical machine code in optimized builds; `vec.data()` returns a `byte*` indistinguishable from the raw pointer for any C-API use (libtiff, libpng, libjpeg-turbo, Kakadu). The existing pixel-touching algorithms (Floyd-Steinberg dither, in-place colour conversion, libtiff strip writes, etc.) run at identical speed. No measurable regression expected; verified by the audit PR's benchmark before any swap. Fallback: if benchmarking shows any regression, use `std::unique_ptr` instead — same RAII property, no `vector` metadata, ownership-transferable semantics. + +## Considered Options + +- **Keep `SipiImage` as a god-object** — rejected. Six responsibility groups is the textbook over-loaded class; every consumer pays for behaviours they don't use. The Rust port has nothing to map this onto. + +- **Split into many tiny modules (one per behaviour)** — rejected. `image_processing/` as a single Bazel package with multiple sub-headers (`crop.h`, `scale.h`, etc.) is sufficient. Per-operation packages would multiply Bazel boilerplate without adding boundary value; nothing meaningful is enforced by separating `crop` from `scale`. + +- **Subclass-based decomposition** (e.g., `Image` base, `ProcessableImage` derived) — rejected. Inheritance hierarchies don't survive the Rust port (Rust uses traits, not class hierarchies). Free-functions-over-value-type maps to Rust traits cleanly. + +- **Keep `byte *pixels`; replace only ownership semantics** — rejected. Manual `new[]`/`delete[]` ownership requires explicit deep-copy ctor, move ctor, move-assignment, and dtor; ~100 lines of mechanical code that vanishes with `vector`. The maintenance cost is permanent. + +- **Use `std::unique_ptr` instead of `std::vector`** — kept as fallback. `vector` is preferred for inherent size-tracking (consistency check vs. claimed dimensions); `unique_ptr` is the second-best if benchmarking shows any regression. The audit PR (step 1) decides empirically. + +- **Defer the decomposition to the Rust rewrite itself** — rejected. The strangler-fig migration is incremental; each Rust-rewritten module needs to interoperate with still-C++ surrounding modules through stable shims. Decomposing now reduces the size of the eventual Rust translation unit and the shim surface. + +## Consequences + +- **`SipiImage` shrinks dramatically**. Public API ~15 methods (geometry, pixel access, metadata accessors, move/copy semantics) instead of ~50. Heavy header transitively-includes (shttps, format-handler types, 5 metadata standards) reduce to ~3 forward declarations + the value-type basics. Bazel `--strict_deps` becomes practical at the `image/` boundary. + +- **Free-function-style image processing**. `image.crop(...)` becomes `Sipi::processing::crop(image, ...)`. ~12 method-to-free-function rewrites at every call site. Lua bindings (`SipiLua.cpp`) need updating; thin facade methods may be retained on `Image` purely for binding ergonomics if Probe 6 finds it cleaner. + +- **No more friend classes**. The 5 `friend class` declarations (`SipiIcc` + 4 format handlers) all go away. Format handlers gain a public `pixels_writable()` API + metadata setters; `SipiIcc::iccFormatter` gains 1-2 new public Image accessors (whatever it needed via friend access today). + +- **No more raw `byte *pixels`**. Replaced by `std::vector` (or `std::unique_ptr` if benchmarking demands). RAII eliminates the explicit deep-copy ctor / move ctor / move-assignment / dtor dance — the compiler's defaults work correctly with a vector member. `cpp-style-guide.md`'s "no raw owning new/delete" rule honoured. + +- **No more HTTP coupling in `SipiImage`**. The `shttps::Connection*` field disappears; `connection()` accessor methods deleted. The HTTP-output sink is a parameter to format-handler `write()` per ADR-0006, not a property of `Image`. Resolves the SIPI ↔ shttps boundary leak this represents. + +- **Static `io` registry leaves**. `Sipi::format_handlers::dispatch(extension) → SipiIO*` (or similar) lives in the format-handler module, where it conceptually belongs. + +- **`app14_transform` field removed**. The JPEG handler inverts CMYK/YCCK at decode time so downstream code sees standard CMYK; no transient flag needed on the universal Image type. + +- **TeeSink for dual-write preservation**. ADR-0006's `OutputSink` variant gains a `TeeSink { std::vector outputs; }` alternative. Tracked in [DEV-6382](https://linear.app/dasch/issue/DEV-6382). + +- **Migration is gated on [DEV-6341](https://linear.app/dasch/issue/DEV-6341) reaching Y+6** for the Bazel-package-dependent steps (move static `io` map, remove friends with public API, extract image-processing). The non-Bazel-dependent steps (raw-pointer audit, vector swap, `app14_transform` removal) can land in CMake era. + +- **Approval-test surface unchanged**. Behaviour preservation is intended throughout the decomposition. Approval goldens stay valid. + +- **Lua-binding surface ([SipiLua.cpp](../../src/SipiLua.cpp))** — Probe 6 resolves. Potentially the Lua API exposes `image:crop(...)` as method-style calls; if so, the Lua binding layer absorbs the C++-method-to-free-function translation transparently. Probe 6 outcome may add facade methods on `Image` purely for binding convenience. + +- **Glossary deltas land in [`UBIQUITOUS_LANGUAGE.md`](../../UBIQUITOUS_LANGUAGE.md)** in the batched edit pass: add **Image processing** (umbrella for the free-function module). Sharpen **Image** (the code-level class becomes a narrow value type; domain term stays correct). + +- **Tracked in Linear** under a new parent issue (sibling to [DEV-6374](https://linear.app/dasch/issue/DEV-6374)) with 8 children corresponding to the staged sub-PRs. Inter-issue dependencies modelled via Linear's `blockedBy` relations. diff --git a/docs/deep-modules.md b/docs/deep-modules.md new file mode 100644 index 00000000..4c520b41 --- /dev/null +++ b/docs/deep-modules.md @@ -0,0 +1,454 @@ +--- +status: in-progress +--- + +# Deep Modules — working design log + +A multi-session exercise to identify Deep Modules (Ousterhout, *A Philosophy of Software Design*) in the SIPI codebase, using the ubiquitous language as a probe and extending the language as gaps surface. The output is the input list and seam shape for the Bazel package layout introduced by [ADR-0003](./adr/0003-module-co-located-source-and-tests.md). + +This file is the durable artifact of the conversation. It is intended to be resumed cold in a later session — every decision and convention needed to continue is written down here. + +## Goal + +Decide which Bazel packages SIPI should ship as, by treating each candidate package as a Deep Module: a directory whose public-header surface is small relative to the complexity hidden behind it. The deliverables, accumulated across sessions, are: + +1. A list of Deep Module candidates with verdicts (deep / shallow / scattered / mis-bounded / god-object) and concrete actions (keep / split / merge / extract / rename). +2. A patch list against [`UBIQUITOUS_LANGUAGE.md`](../UBIQUITOUS_LANGUAGE.md) for terms to add, sharpen, or retire — surfaced as side-effects of the module probes. +3. A short list of follow-up ADRs for decisions that meet the bar (hard to reverse, surprising without context, real trade-off). + +## Scope + +**Unit of "module" = Bazel package.** Per ADR-0003, a SIPI module is a directory under `src/` with co-located `.cpp` / `.h` / `*_test.cpp` and its own `BUILD.bazel` declaring a `cc_library` with explicit `hdrs` and `visibility`. The module's interface is the headers in `hdrs`; its depth is what sits behind them. + +This unit was chosen over two alternatives: + +- **Class-level** (e.g. is `SipiCache` itself deep?). Useful follow-up *inside* a package; not the load-bearing decision. Falls out from getting package boundaries right. +- **Context-level** (split or merge bounded contexts). Already operated on via [`CONTEXT-MAP.md`](../CONTEXT-MAP.md) (SIPI ↔ shttps); a third context is a separate decision tracked by [ADR-0001](./adr/0001-shttps-as-strangler-fig-target.md). + +**Why this matters now.** ADR-0003's reframing: AI-throughput coding has shifted the human's role from writing code to defining and policing architecture. Bazel's `cc_library` + `--strict_deps` + `package_group` + `visibility` turns the package list into build-graph invariants — a forbidden `#include` fails analysis, not code review. Picking the right package list is therefore a leverage point: every package boundary we draw is a rule the build will then enforce automatically. + +## Hypothesis + +> Every term in [`UBIQUITOUS_LANGUAGE.md`](../UBIQUITOUS_LANGUAGE.md) should correspond to either: +> +> 1. a Deep Module that already exists in code (verdict: **deep, keep**), +> 2. a Deep Module that should exist but is scattered across files today (verdict: **scattered, extract**), +> 3. a pure vocabulary item with no module shape (verdict: **language-only**). +> +> Plus a fourth bucket: **modules in code with no glossary term** — these reveal where the language must be *extended*. + +The (2)-bucket cases are the highest-value refactors: the language predicts a module the code hasn't built yet. The (1)-bucket cases tell us which boundaries are healthy and should be promoted to Bazel packages as-is. The (3)-bucket cases sharpen the language. The fourth bucket is where this exercise feeds back into the glossary. + +## Methodology — probe-and-extend + +Chosen over sweep-first. + +- **Sweep-first** (rejected): walk the codebase systematically, surface every concept lacking a glossary name, complete the language, *then* run the completed glossary against the code to find Deep Modules. Thorough; long; risks producing a list disconnected from refactor priorities. +- **Probe-and-extend** (chosen): take the existing glossary as a partial map, run each term against the code, and *whenever a gap forces itself on us* — "I can't classify this because we don't have a name for X" — add X to the glossary in the same pass. Concepts get named in the order they matter for the Deep Module decision. + +Mitigation for (2)'s blind spots: at the end of the term-driven probes, do a short directory-listing-only sweep to confirm no top-level subsystem went un-touched. + +## Probe template + +Each probe produces one row in the [Probe register](#probe-register) below. Fields: + +| Field | What it captures | +| --- | --- | +| **1. Module name** | The future Bazel package (directory under `src/`). | +| **2. Glossary term(s)** | Which `UBIQUITOUS_LANGUAGE.md` term(s) the module implements; flagged if the term is missing. | +| **3. Public interface** | The small set of headers proposed for `hdrs` — what callers see. List the headers, not the symbols. | +| **4. Private surface** | What stays inside: private headers + `.cpp` files. Just the count + a one-line note on what they implement. | +| **5. Depth signal** | Ousterhout-style note: does the interface hide complexity, or pass it through? (Pass-through methods, leaked types, large `hdrs` count, callers needing internal knowledge — all shallow signals.) Include a rough public:private ratio. | +| **6. Verdict** | One of: `deep` / `shallow` / `scattered` / `mis-bounded` / `god-object` / `language-only`. | +| **7. Action** | Concrete next step: `keep` / `split into A,B,C` / `merge with X` / `extract X from Y` / `rename to Z`. Include the rationale in one sentence. | +| **8. Glossary delta** | Terms to add, sharpen, or retire as a side-effect of this probe. Empty if none. | + +Probe rows accumulate; they are not rewritten as the analysis progresses (history of thinking is part of the artifact). If a later probe revises an earlier verdict, append a `**Revised:** — …` line to the original row rather than editing the verdict in place. + +## Probe order + +Chosen to validate the template on easy ground first, then test gap-detection, then tackle the worst offender once vocabulary is loaded: + +1. **`SipiCache`** — likely textbook deep module (small public surface: store/fetch/evict; deep internals: dual-limit LRU + crash recovery). Validates the template. +2. **Identifier resolution / path traversal validation** — scattered-concept candidate. Glossary mentions it inline (`Image root` / `Identifier` definitions) but no obvious file owns it. Tests the (2)-bucket detector. +3. **`SipiImage` (~80 KB hpp+cpp)** — prime god-object suspect. Highest leverage; messiest probe. By this point we should have enough vocabulary to decompose it without flailing. +4. **Subsystem directories that already exist as candidates**: `formats/`, `handlers/`, `iiifparser/`, `metadata/`. Likely already module-shaped; the probe confirms or sharpens their boundaries. +5. **`SipiHttpServer` (~84 KB)** — second-largest file. Probably mostly route logic that should live in `handlers/`; what's left is the server lifecycle. +6. **`SipiLua` (~60 KB)** — large, but bounded by the Lua FFI surface. Probably one module; the question is whether it splits along the three Lua entry points (Init / Preflight / Route handler). +7. **`Backpressure` cluster**: `SipiMemoryBudget` + `SipiRateLimiter` + (admission gate). Glossary defines `Backpressure` as the umbrella — the module probably exists in concept but is split across files. +8. **Curiosities with no glossary term**: `PhpSession`, `Salsah`, `Template`, `SipiReport`, `Logger`. Determine which are live, which are dead, which need a glossary term. +9. **Operational surface**: `SipiMetrics`, `SipiSentry`, `SipiPeakMemory`, `SipiConnectionMetricsAdapter`, config (`SipiConf`). May coalesce into one observability module. +10. **Top-level**: `sipi.cpp` (entry point, ~63 KB) — the CLI vs server-mode split. Likely thin once the rest is modularized; confirm. + +After step 10, the closing sweep: list every directory and loose file under `src/` and confirm each has been classified. + +## Probe register + +Rows are added as probes complete. Format: one section per probe with the 8 template fields. + + + +## Probe 1 — `SipiCache` + +**1. Module name.** `src/cache/{cache.h, cache.cpp, cache_test.cpp}` (currently top-level `src/SipiCache.{cpp,h}` + `include/SipiCache.h`). Future Bazel package `//src/cache:cache`. + +**2. Glossary terms.** Implements **Cache** (defined). Uses **Cache key** + **Canonical URL** (defined). Surfaces three glossary gaps — see [Glossary delta register](#glossary-delta-register). + +**3. Public interface (proposed `hdrs`).** `cache.h` exposing class `Cache` with: +- `Cache(cachedir, max_size_bytes, max_files)` / `~Cache` +- `check(origpath, canonical) → BlockedScope` (RAII pin; replaces `block_file=true` + manual `deblock`) +- `add(origpath, canonical, cachepath, img_w, img_h, tile_w, tile_h, clevels, numpages)` +- `remove(canonical) → bool` +- `purge() → int` +- `getNewCacheFileName() → string` +- `loop(worker, userdata, sort_method)` — admin/Lua iteration +- `stats() → Stats` — collapses the five getters +- Public types kept: `CacheRecord` (callback contract for `loop`), `SortMethod`, `ProcessOneCacheFile`, `Stats`, `BlockedScope`. + +**4. Private surface.** `cache.cpp` + per-module unit test. Internals retained (not in `hdrs`): +- LRU eviction with dual-limit (size + file-count) and 80% low-water mark. +- Crash recovery: orphan scan when `.sipicache` is missing; index-corruption detection (size %% record-size); first-time directory creation. +- Persistence: `FileCacheRecord` on-disk format (now private), serialize on shutdown, deserialize on startup. +- Concurrency: one mutex + atomic counters for cache-used-bytes / nfiles. +- Pinned-file refcount map (`blocked_files`); `BlockedScope` owns the lifecycle. +- `tcompare` (timestamp comparison) — moved from public to private. +- `clearCacheDir` static helper. +- `_compare_access_time_*` / `_compare_fsize_*` static comparators. + +**5. Depth signal.** Core is a genuine deep module — dual-limit LRU + crash recovery + atomic eviction with low-water mark + persistence + concurrency, all behind a small conceptual surface. Today's `include/SipiCache.h` has four shallowness leaks that the proposed `hdrs` removes: +- `tcompare()` is public (pure internal utility). +- `check(block_file=true)` paired with `deblock(name)` — manual pairing, locking concern leaks (RAII `BlockedScope` fixes this). +- `getCacheDir()` exposes filesystem layout to Lua admin scripts (audit during `SipiLua` probe; either remove or fold into `stats()`). +- Five separate getters (`getCacheUsedBytes`, `getMaxCacheSize`, `getNfiles`, `getMaxNfiles`, `getCacheDir`) → one `Stats stats()` struct. +- `FileCacheRecord` (on-disk `char[256]` format) is in the public header but only `SipiCache` itself uses it. + +Plus one larger finding (see #6/#7) — the cache held two responsibilities, and the second is being amputated entirely rather than extracted. + +**6. Verdict.** `deep` post-refactor; today, `deep with shallow leaks and a co-located non-cache responsibility`. + +**7. Action.** + +a. **Promote to Bazel package** per ADR-0003. Co-locate `cache_test.cpp` from current `test/unit/cache/cache.cpp`. + +b. **Tighten public interface** per #3: RAII `BlockedScope`, collapse getters to `Stats stats()`, privatize `FileCacheRecord` and `tcompare`. Mechanical; updates `SipiHttpServer.cpp` and `SipiLua.cpp` call sites. + +c. **Amputate the image-shape responsibility** — delete `SizeRecord`, `sizetable`, and `getSize()` from `SipiCache`. The two structs (`SizeRecord` and `CacheRecord`) overlap because they hold the same data with different keys; `sizetable` was a parasitic side-effect index populated only by `add()`, never independently persisted, surviving eviction (`purge` and `remove` don't clean it up — bug, made moot by the deletion), and never populated for un-cached origpaths. It was barely a cache. Image shape lookup moves to format handlers — see [ADR-0004](./adr/0004-image-shape-ownership.md). + +d. **Audit the Lua admin surface** during the `SipiLua` probe: `SipiLua.cpp` currently uses `getCacheDir`, `loop`, `remove`, `purge`, `add`. Decide whether the Lua-facing API is the same as the C++ public API or a thin facade. The `getCacheDir` exposure is the clearest case for separation. + +**8. Glossary delta.** See [Glossary delta register](#glossary-delta-register) for the full set. Summary: +- Add **Image shape** — intrinsic shape of a source image; read by format handlers. +- Add **Operating mode** with **Server mode** + **CLI mode** sub-terms — the asymmetry between which format handler reads vs. writes is architecturally load-bearing. +- Add **Cache pin** (provisional) — confirm during `SipiHttpServer` probe. +- Sharpen **Essentials packet** — extend schema with image-shape fields per ADR-0004. + +**9. Open questions for later probes.** +- `getCacheDir`'s removal or retention depends on the Lua admin surface (Probe `SipiLua`). +- The exact set of fields in `Stats` depends on what the `/metrics` endpoint and `SipiLua` consumers actually need (Probes `Operational surface` and `SipiLua`). +- `BlockedScope`'s API — pure RAII, or does it expose `blocked()` for the caller to detect "all-blocked, can't add"? — depends on Probe `SipiHttpServer` flow analysis. + +## Probe 2 — `metadata/` + +**1. Module name.** `src/metadata/` (existing). Future Bazel package `//src/metadata:metadata`. One package, not split per glossary umbrella — the language distinguishes Embedded metadata vs. Essentials packet, but in code they share consumers (every format handler + `SipiImage`) and a real layering boundary would not pay back the visibility/dep multiplication. + +**2. Glossary terms.** Implements the **Preservation metadata** umbrella with both subordinate parts: **Embedded metadata** (EXIF, IPTC, XMP, ICC) via `SipiExif`/`SipiIptc`/`SipiXmp`/`SipiIcc`, and **Essentials packet** via `SipiEssentials`. Surfaces two glossary deltas — one sharpening, one new term — see register. + +**3. Public interface (proposed `hdrs`).** + +| Header | Class | Concern | +| --- | --- | --- | +| `metadata/essentials.h` | `SipiEssentials` | SIPI-owned preservation packet schema | +| `metadata/exif.h` | `SipiExif` | EXIF wrapper (exiv2) | +| `metadata/icc.h` | `SipiIcc` | ICC profile wrapper (lcms2) — `iccBytes()` chokepoint per ADR-0002 | +| `metadata/iptc.h` | `SipiIptc` | IPTC wrapper (exiv2) | +| `metadata/xmp.h` | `SipiXmp` | XMP wrapper (exiv2) | + +The umbrella structure shows up as header organisation, not as separate Bazel packages. + +**4. Private surface.** `metadata/internal/icc_normalization.h` (formerly `SipiIccDetail.h`) with `visibility = ["//src/metadata:__pkg__", "//test/unit/sipiicc:__pkg__"]`. Plus the `.cpp` files. Test seam pattern documented (see glossary delta). + +**5. Depth signal.** Mixed at the per-class level: + +| Class | hdr / cpp lines | Verdict | Reason | +| --- | --- | --- | --- | +| `SipiIcc` | 175 / 664 | **deep** | 1:4 ratio; `iccBytes()` chokepoint is the textbook Ousterhout hidden-complexity case (per ADR-0002) | +| `SipiXmp` | 93 / 166 | reasonable | depth in exiv2 RDF/XML wrapping | +| `SipiIptc` | 66 / 59 | reasonable | small but bounded | +| `SipiEssentials` | 209 / 213 | **shallow** | 1:1 ratio; inline `operator<<`, inline `operator std::string()`, 17 getter/setter pairs, pipe-delimited serialization in header | +| `SipiExif` | **321 / 150** | **shallow** | header > cpp; 22 inline `assign_val` template overloads + `typeid`-dispatched template `addKeyVal` pull the entire exiv2 type universe into every consumer's TU | + +Plus one module-level leak: **third-party headers (``, ``) appear in every public metadata header**. Every format handler, `SipiImage.hpp`, and `SipiSentry.h` get exiv2 + lcms2 in their compilation graph and see raw `Exiv2::`/`cms*` types in method signatures. Bazel `--strict_deps` will require either re-exporting these as transitive deps (anti-pattern) or hiding them via pImpl. One fix at the metadata boundary pays back at every consumer. + +**6. Verdict.** Module-level `deep` (foundation layer of image processing — many consumers, real third-party-binding depth), with `shallow` leaks at the per-class level (Exif, Essentials) and a third-party-type leakage at the public-header boundary. + +**7. Action.** + +a. **Promote to Bazel package** `//src/metadata:metadata` per ADR-0003. Co-locate `*_test.cpp` from current `test/unit/sipiicc/`. + +b. **Move `SipiIccDetail.h` → `metadata/internal/icc_normalization.h`** with `visibility` restricted to the package itself + the ICC unit test target. Use as the canonical **Test seam** reference in the glossary. + +c. **Refactor `SipiExif` header**: move all inline `assign_val` and template `addKeyVal` definitions to `.cpp` with explicit instantiations for the types we use (`std::string`, `int`/`long`/`float`/`double` and their vectors, `Exiv2::Rational`/`URational`). Replace `typeid` dispatch with C++20 concepts or explicit specialization (matches the Rust-alignment direction discussed earlier). Hide `Exiv2::Rational` behind a SIPI-defined `Rational = std::pair` if it doesn't break call sites materially. + +d. **Refactor `SipiEssentials`**: tighten public interface to one `parse(bytes) → SipiEssentials` (free function or static factory) + one `serialize() → std::vector` + a small struct of accessors. Drop the inline string-conversion operators. Adopt versioned binary wire format per ADR-0005. Add image-shape fields per ADR-0004 (which lands inside the new format, not the legacy one). + +e. **Modernize C-pointer ownership across the module**: drop the dual-overload pattern (`unsigned char* xxxBytes(unsigned int& len)` + `std::vector xxxBytes()`); keep only the vector form. Matches `cpp-style-guide.md` "no raw owning new/delete" and aligns with Rust's no-raw-pointer-ownership rule. + +f. **Defer**: `SipiImage`-`SipiIcc` friend-class coupling (Probe 3); `SipiSentry::SipiIcc` dependency (Probe 9 — possibly vestigial); Lua-exposure surface (Probe 6). + +**8. Glossary delta.** See [Glossary delta register](#glossary-delta-register). Two adds: +- **Sharpen `Essentials packet`** — note pipe-delimited fragility + planned versioned-binary successor per ADR-0005. +- **Add `Test seam`** — header in module's `internal/` subdirectory with restricted visibility. + +**9. Open questions for later probes.** +- The `SipiImage.hpp:645` `friend class SipiIcc` — necessary coupling, or removable when `iccFormatter(SipiImage*)` is split? Probe 3 (`SipiImage`). +- `include/SipiSentry.h` includes `metadata/SipiIcc.h` — what does Sentry need from ICC? Probably context-data attachment for crash reports; might be vestigial. Probe 9. +- Lua admin surface (`SipiLua`) — does it directly manipulate metadata, or only consume it through `SipiImage`? Probe 6. +- ADR-0005 implementation choice between `tinycbor` / `jsoncons` / a small in-tree CBOR encoder — defer to implementation time; not an architectural decision. + +## Probe 4 — `SipiImage` (god-object decomposition) + +**1. Module name.** Two new Bazel packages, splitting the current `src/SipiImage.{cpp,hpp}` (~2,526 lines combined): + +| New package | Concern | +| --- | --- | +| `src/image/` | Pure value type: geometry, photometric, RAII pixel buffer, metadata composite. ~15 public methods. | +| `src/image_processing/` | Free functions over `const Image&`: crop, scale, rotate, colour convert, channel ops, bit-depth reduction, dithering, watermark application, comparison, arithmetic. | + +Plus three concerns leave the class entirely: +- Static `io` registry → `format_handlers/` (registry of self). +- `shttps::Connection*` field → disappears (subsumed by `OutputSink::HttpSink` per ADR-0006). +- `app14_transform` JPEG marker field → moves into the JPEG decode pipeline (consumed at decode, inverted before `Image` is "complete"). + +**2. Glossary terms.** Implements **Image** (the code-level class; domain-level term unchanged). Surfaces **Image processing** as a new umbrella term. Touches `metadata/`, `format_handlers/`, and the shttps boundary. + +**3. Public interface (proposed `hdrs`).** + +``` +src/image/image.h — class Image (geometry, pixel access, metadata accessors, RAII pixel buffer) +src/image/image_metadata.h — ImageMetadata composite (5 standards bundle; possibly inlined into Image) + +src/image_processing/crop.h +src/image_processing/scale.h — one scale() taking ScalingQuality; replaces 3 named methods +src/image_processing/rotate.h +src/image_processing/color_convert.h — convertYCC2RGB, convert_to_icc +src/image_processing/channel_ops.h — remove_channel, remove_extra_samples +src/image_processing/bit_depth.h — to_8bps, to_bitonal +src/image_processing/watermark.h — DEFERRED to Probe 5 / Watermark glossary entry +src/image_processing/arithmetic.h — operator-, operator+, operator==, compare (or test-only) +``` + +**4. Private surface.** `.cpp` files implementing the above. The IO map / format dispatch moves to `format_handlers/`. The 5 friend classes (4 format handlers + `SipiIcc`) all go away — replaced by public `pixels_writable()` API + metadata setters + 1-2 new accessors for `iccFormatter`. The raw `byte *pixels` becomes `std::vector` (or `unique_ptr` if benchmarking shows any regression — the audit PR decides). + +**5. Depth signal.** Six distinct responsibility groups in one class — textbook god-object: + +| Group | Surface | +| --- | --- | +| Image data container | `nx, ny, nc, bps, pixels (raw!), es, orientation, photo` | +| Metadata holder | 4 `shared_ptr` + `SipiEssentials` value | +| Format I/O facade | static `io` map + `read()` / `readOriginal()` / `write()` / `getDim(filepath)` | +| Image processing | `crop`, `scale`/`scaleFast`/`scaleMedium`, `rotate`, `convertYCC2RGB`, `convertToIcc`, `removeChannel`, `removeExtraSamples`, `to8bps`, `toBitonal`, `add_watermark`, `set_topleft` (12 methods) | +| HTTP integration | `conobj` raw `shttps::Connection*` + `connection()` accessors | +| Algebra / comparison | `operator-=`, `operator-`, `operator+=`, `operator+`, `operator==`, `compare`, `operator<<` | + +Plus **specific code smells**: +- Raw `byte *pixels` (manual `new[]`/`delete[]`; reason for explicit deep-copy + move semantics; violates `cpp-style-guide.md`'s "no raw owning new/delete"). +- 5 `friend class` declarations (encapsulation deliberately broken). +- `app14_transform` field — JPEG-specific Adobe APP14 marker on the universal type. +- Static `io` registry — misplaced concern (registry of format handlers belongs in `format_handlers/`). +- Three scale methods (`scale`, `scaleMedium`, `scaleFast`) — separate names for the same operation at different qualities; unstable API. +- Two `getDim` methods with different semantics (filepath query vs. instance state out-params). +- Inconsistent metadata accessors (`getExif`, `getIcc`, `getXmp` exist; **no `getIptc`** despite `iptc` member). +- Heavy header includes — every consumer transitively pulls shttps + format-handler types + 5 metadata + exiv2 + lcms2. + +**6. Verdict.** **`god-object`** — unambiguous. Highest-leverage decomposition target in the codebase. + +**7. Action.** Per [ADR-0007](./adr/0007-sipiimage-decomposition.md). Eight staged sub-PRs (each independently reversible), tracked under a new Linear parent issue: + +a. **Audit** every internal use of `pixels` (catalog read / write / pointer-pass / arithmetic / `new[]`/`delete[]` cases). Output is a doc + benchmark; no code change. Gates the rest. + +b. **Replace `byte *pixels` with `std::vector`**. RAII; eliminates the explicit copy/move/dtor dance. Performance verified at parity in (a)'s benchmark. + +c. **Remove `app14_transform` field**. JPEG handler inverts CMYK/YCCK at decode time; downstream sees standard CMYK. + +d. **Move static `io` map** to `format_handlers/`. SipiImage stops being a registry of format handlers. + +e. **Remove `conobj` field + `connection()` accessors**. Coupled with [DEV-6382](https://linear.app/dasch/issue/DEV-6382) (OutputSink with TeeSink for dual-write — see ADR-0006 / ADR-0007). + +f. **Add public `pixels_writable()` API + metadata setters**; remove the 4 format-handler friend declarations. + +g. **Extract image-processing methods** to `src/image_processing/` free functions over `const Image&`. ~12 method-to-free-function rewrites at every call site. + +h. **Remove `SipiIcc` friend** (probably needs 1-2 new public Image accessors so `iccFormatter` works without internal access). Move arithmetic operators (`operator-`, `operator+`, `operator==`, `compare`) to free functions in `image_processing/arithmetic.{h,cpp}` or test-only. + +Bazel package promotion (steps d, f, g specifically) gated on [DEV-6341](https://linear.app/dasch/issue/DEV-6341) reaching Y+6. Steps a, b, c can land in CMake era. + +**8. Glossary delta.** Add **Image processing** (umbrella for the free-function module). Sharpen **Image** (the code-level class becomes a narrow value type post-refactor; domain term stays correct). + +**9. Open questions for later probes.** + +- **Lua-binding surface** ([SipiLua.cpp](../src/SipiLua.cpp), Probe 6): the Lua API likely exposes `image:crop(...)` style method-call syntax; if so, the Lua binding layer absorbs the C++-method-to-free-function translation transparently. May require keeping thin facade methods on `Image` purely for binding ergonomics. Probe 6 resolves. +- **Watermark module** (Probe 5): the watermark loading + applying logic might live entirely in route handlers (close to where `Permission` decides to apply it) rather than in `image_processing/`. Defer placement. +- **`ImageMetadata` composite** — own type or just members of `Image`? Decide during implementation; tightly bundled either way. +- **TeeSink composition** preserves the dual-write optimization (encoder writes to HTTP socket *and* cache file simultaneously). Documented in ADR-0006 + ADR-0007; implemented in DEV-6382. + +## Probe 3 — `format_handlers/` (renamed from `formats/`) + +**1. Module name.** `src/format_handlers/` (renamed from `src/formats/` per Probe 1 follow-up). Future Bazel package `//src/format_handlers:format_handlers`. Co-located source / headers / `*_test.cpp` per ADR-0003. The `SipiIO` base-class header migrates from top-level `include/SipiIO.h` into the same package. + +**2. Glossary terms.** Implements **Format handler** (defined). Uses **Codec** (defined; sharpened — see register). The earlier "Master format" term splits into **Service master format** (this module's read-and-write target) and **Archival master format** (out of SIPI server's read path; CLI mode produces in the future workflow). Surfaces the `getDim` → `read_shape` rename plus four new glossary entries — see register. + +**3. Public interface (proposed `hdrs`).** + +| Header | Concern | +| --- | --- | +| `format_handlers/sipi_io.h` | base class `SipiIO` + public types (`SipiImgInfo`, `ScalingMethod` / `ScalingQuality`, `Orientation`, `SubImageInfo`, `SipiCompressionParamName` / `SipiCompressionParams`) | +| `format_handlers/sipi_io_j2k.h` | `SipiIOJ2k` — service-master-format handler (JP2) | +| `format_handlers/sipi_io_tiff.h` | `SipiIOTiff` — bidirectional (pyramidal = service master, plain = output) | +| `format_handlers/sipi_io_jpeg.h` | `SipiIOJpeg` — output-only | +| `format_handlers/sipi_io_png.h` | `SipiIOPng` — output-only | + +**4. Private surface.** Five `.cpp` files totalling 5,461 lines: J2k (1292), Jpeg (1301), Png (629), Tiff (2239). Per-handler `*_test.cpp` files added during package promotion (today: zero unit tests, only approval coverage — explicit ADR-0003 consequence). Per-handler private statics (`write_basic_tags`, `write_subfile`, `parse_photoshop`, etc.) move to anonymous namespaces in their `.cpp`. + +**5. Depth signal.** Per-class header:cpp ratios are textbook deep: + +| Class | hdr / cpp | Verdict | +| --- | --- | --- | +| `SipiIOJ2k` | 64 / 1292 (1:20) | **deep** | +| `SipiIOJpeg` | 66 / 1301 (1:20) | **deep** | +| `SipiIOPng` | 66 / 629 (1:10) | **deep** | +| `SipiIOTiff` | 119 / 2239 (1:19) | **deep** | +| `SipiIO` (base) | 183 / — | bloated public-type header in front of a small virtual contract | + +**Layering oddities** (will fail Bazel `--strict_deps` until ADR-0003 colocation lands): + +- `include/SipiIO.h` is at top-level, not in `include/formats/`. Colocation moves it into the package. +- Format-handler headers do `#include "../../src/SipiImage.hpp"` — cross-tree relative include from `include/formats/` to `src/`, *and* pull the full `SipiImage` definition when only `SipiImage*` is used in signatures. +- Inconsistency: `SipiIOJpeg.h` uses `#include "SipiImage.hpp"` (no `../../`) where the other three use the relative path. + +**API modernization opportunities** (tracked in ADR-0006): + +1. Five `read()` overloads for default arguments → C++ default args. +2. `bool read()` returns → `std::expected`. +3. `SipiImgInfo::success` tri-state enum → `std::expected`. +4. `SipiCompressionParams = unordered_map` (stringly-typed) → `std::variant`. +5. Magic-string filepaths (`"-"` for stdout, `"HTTP"` for HTTP-server output) → `std::variant`. + +**The big ADR-0004 correction.** `SipiIO::getDim(filepath) → SipiImgInfo` *already exists* and returns full image shape (`width, height, tile_w, tile_h, clevels, nc, bps, numpages, orientation, mimetype, resolutions`). ADR-0004's `read_shape` IS `getDim`, renamed. The actual change ADR-0004 makes is *"give the existing `getDim` an Essentials-packet fast path in service-master-format handlers"* — not *"add a new method."* The "nc/bps remain 0" overestimate at `SipiHttpServer.cpp:1563-1566` was a `SipiCache::SizeRecord` limitation, not a `getDim` limitation; ADR-0004 fixes it as a happy side effect. ADR-0004 amended in this same commit. + +**Master/output asymmetry is operational, not class-level.** `SipiIOTiff` handles both pyramidal (service master) and non-pyramidal (output) TIFF. The Essentials-packet fast path is keyed by *packet presence* at runtime, not by class. `SipiIOJ2k` is service-master-only by current operational policy. `SipiIOJpeg` + `SipiIOPng` are output-only by current operational policy (server mode never reads them as source files). + +**Misplaced symbol.** `SipiIOTiff.h:22` declares free function `read_watermark(wmfile, nx, ny, nc) → unsigned char*`. Loads a watermark image. Not TIFF-specific; belongs in a watermark module (pending `Watermark` glossary entry from Probe 1) or absorbed into the request handler that uses it. Defer to Probe 5. + +**Doc-string bugs (code-quality, not architectural).** `SipiIOJ2k.h` says *"JPEG 2000 files using libtiff"* (wrong: Kakadu). `SipiIOJpeg.h` says *"JPEG 2000 files using libtiff"* (wrong on both counts). `SipiIOPng.h` says *"libtiff makes extensive use of `lseek`"* (wrong library). Fix in a one-PR cleanup; tracked as a Linear child issue. + +**6. Verdict.** Module-level **`deep`** — 5,461 lines of codec implementation behind ~315 lines of public headers, codec-specific knowledge cleanly hidden, one-way dependency on `metadata/` + `iiif_parser/`. The base class header is wide but bounded; doesn't undermine the verdict. + +**7. Action.** + +a. **Promote to Bazel package** `//src/format_handlers:format_handlers` per ADR-0003 + Probe 1 follow-up rename. Add per-handler unit tests during promotion (ADR-0003 consequence list). + +b. **Move headers**: `include/SipiIO.h` → `src/format_handlers/sipi_io.h`. `include/formats/SipiIO*.h` + `src/formats/SipiIO*.cpp` → co-located in `src/format_handlers/sipi_io_*.{h,cpp}`. + +c. **Forward-declare `SipiImage`** in headers; full include only in `.cpp`s. Eliminates `../../src/` cross-tree includes. + +d. **Implement ADR-0004** inside the existing `getDim()` virtual, then rename `getDim` → `read_shape` for self-documentation. + +e. **Modernize the API per ADR-0006** — five changes, staged one PR per format handler after package promotion. + +f. **Relocate `read_watermark`** — defer to Probe 5 / `Watermark` glossary entry. + +g. **Defer**: `SipiImage` heavy-include + friend-class coupling (Probe 4); magic `"HTTP"` sentinel resolution (Probe 5). + +**8. Glossary delta.** Five adds + four sharpenings + one rename — see [Glossary delta register](#glossary-delta-register). The most consequential change is splitting the earlier `Master file` / `Master format` rows into `Service master` / `Service master format` (in-server-path) plus new `Archival master` / `Archival master format` (out-of-server-path; SIPI CLI produces, OAIS archive stores). + +**9. Open questions for later probes.** + +- `read_watermark` final placement — Probe 5 + `Watermark` glossary entry. +- `ScalingQuality` placement: it's in the format-handler API but conceptually a rendering concern — Probe 4 (`SipiImage`) may relocate. +- Magic `"HTTP"` filename sentinel: how should "write to HTTP response stream" be modelled once `SipiHttpServer` is decomposed? Probe 5. +- Archival-master workflow: which actor performs the archival → service-master conversion, and where does that code live? Surfaces only when Probe N hits it. +- The `SipiImage` ↔ `SipiIcc` friend coupling (Probe 2 finding) — does it survive the format-handler refactor? Probe 4. + + + +## Glossary delta register + +Pending edits to [`UBIQUITOUS_LANGUAGE.md`](../UBIQUITOUS_LANGUAGE.md), accumulated from probe side-effects. Applied in a single editing pass at the end (or at natural batching points), so the glossary changes once with full context, not once per probe. + +| Term | Action | Source probe | Note | +| --- | --- | --- | --- | +| **Image shape** | add | Probe 1 | Intrinsic shape of a source image: `(img_w, img_h, tile_w, tile_h, clevels, numpages, nc, bps)`. Read by a format handler from the master file. Replaces the parasitic `SipiCache::SizeRecord`. Per ADR-0004. | +| **Operating mode** | add (umbrella) | Probe 1 | Two sub-terms — `Server mode` and `CLI mode`. The asymmetry between which format handler dominates read vs. write is architecturally load-bearing for the service-master-format fast path. | +| **Server mode** | add | Probe 1, refined Probe 3 | Long-running HTTP server. Reads service masters in service master format from `image root`; writes IIIF representations to the cache. The hot path for service-master-format shape reads. | +| **CLI mode** | add | Probe 1, refined Probe 3 | One-shot invocation. **Today**: reads arbitrary source format; writes a service master in service master format with full Essentials packet. **Future direction**: writes archival masters for OAIS-compliant external archive storage; the archival → service-master conversion is a separate workflow step (actor TBD; surfaces only when a later probe hits it). | +| **Service master** | add | Probe 1 (renamed from `Master file`, Probe 3) | The on-disk file under `image root` that SIPI's server reads to fulfill IIIF requests. Codebase variables `infile` / `origpath`. Currently produced by SIPI CLI mode; future workflow has it derived from an `archival master` by a separate conversion step. | +| **Service master format** | add | Probe 1 (renamed from `Master format`, Probe 3) | The format of service masters. Optimized for fast random-access IIIF serving. Currently JP2; pyramidal TIFF is the planned successor. | +| **Archival master** | add | Probe 3 | The format-stable, lossless preservation copy of an image, stored in the OAIS-compliant external archive. Distinct from a `service master` by *purpose*: archival masters prioritize preservation properties (format stability, lossless or near-lossless compression, no service-side optimizations); service masters prioritize fast random access. SIPI CLI mode produces archival masters in the future workflow; SIPI server mode does not read them. | +| **Archival master format** | add | Probe 3 | The format of archival masters. Per user direction, plain (non-pyramidal) TIFF for format-stability reasons — pyramids are a service-side optimization, not a preservation property, and archival policy explicitly rejects them. | +| **Pyramidal TIFF** | add (Probe 1), refined Probe 3 | Probe 1, refined Probe 3 | Multi-resolution TIFF variant storing the same image at multiple decode levels in a single file. Supports efficient decode-level selection without full-resolution decoding. **A service-master-format only** — pyramids are a service-side optimization rejected by the archival master format. Planned successor to JP2 as the sole service master format. | +| **Cache pin** | add (provisional) | Probe 1 | Per-file in-use refcount that prevents eviction while a representation is being served. Currently `SipiCache::blocked_files` + `check(block_file=true)` + `deblock`; refactor to RAII `BlockedScope`. Confirm name during `SipiHttpServer` probe. | +| **Essentials packet** | sharpen | Probe 1 | Existing definition lists original filename, mimetype, hash type, pixel checksum, optional ICC. Extend schema with **image-shape fields** so server-mode shape lookup can read them at known offset rather than parsing the codestream / TIFF tags. Per ADR-0004. The schema is format-agnostic; the embedding mechanism (JP2 box vs TIFF tag) is handler-specific. | +| **Route handler** | sharpen → umbrella | naming discussion (Probe 1 follow-up) | Promote to umbrella term: URL-pattern-bound request logic. Two sub-types depending on implementation language (`C++ route handler`, `Lua route handler`). Resolves the term overload between the existing Lua-only definition and the planned `route_handlers/` C++ directory. | +| **C++ route handler** | add | naming discussion (Probe 1 follow-up) | A `shttps::RequestHandler` callback registered at server startup. Examples: `iiif_handler`, `file_handler`. Compiled in. Lives in `src/route_handlers/` post-Probe 5 refactor. Note: `shttps::RequestHandler` remains the framework type; *C++ route handler* is the SIPI-side domain term for instances of it. | +| **Lua route handler** | add | naming discussion (Probe 1 follow-up) | A Lua script bound to a URL pattern, loaded dynamically. Examples: upload, admin endpoints in `scripts/`. (Existing "Route handler" glossary entry redirects here.) | +| **Format handler** | sharpen | Probe 1 follow-up | Existing definition is correct; note the directory rename `formats/` → `format_handlers/` for self-documentation, matching the glossary term directly. | +| **Essentials packet** | sharpen further | Probe 2 | Current wire format is pipe-delimited text without escaping or schema versioning — brittle (any `\|` in `origname` corrupts parse) and not forward-evolvable. Per ADR-0005 the wire format migrates to versioned CBOR; the in-memory schema is the same. ADR-0004's image-shape additions land in the new wire format. | +| **Test seam** | add | Probe 2 | A header deliberately kept in a module's `internal/` subdirectory with `visibility` restricted to that module + that module's tests. Used to expose pure helpers for explicit testing without broadening production coupling. Canonical example: `metadata/internal/icc_normalization.h` (formerly `SipiIccDetail.h`). The pattern replaces comment-as-policy ("No production code outside X should include this header") with a build-graph invariant. | +| **Codec** | sharpen | Probe 3 | Existing definition is correct. Sharpen with the canonical four-codec list: Kakadu (JP2), libtiff (TIFF), libpng (PNG), libjpeg (JPEG). Note `webp` is in the project's external-deps set but no `SipiIOWebp` class exists today. | +| **`read_shape`** (Format handler API) | rename | Probe 3 | The existing `SipiIO::getDim(filepath) → SipiImgInfo` virtual is renamed to `read_shape` for self-documentation: `getDim` implies just dimensions, but the return type carries full image shape (dimensions + tile + clevels + numpages + nc + bps + orientation + mimetype + sub-image resolutions). Per ADR-0004 (amended). | +| **Object storage** | add | Probe 3 follow-up | The production access model for service masters in the 3-6 month direction: SIPI server reads service masters via S3 range GETs over HTTP. Today's transitional state is NFS-mounted ZFS spinning disk (still network-accessed; round-trip costs already exist). Local-filesystem `image root` becomes the dev/test scenario only. The `cache/` module stays local in both states (performance optimization; cached representations on the hot path can't pay remote-access cost). | +| **Range GET** | add | Probe 3 follow-up | An HTTP `GET` on an S3 object with a `Range:` header bounding the byte range to fetch. The unit of S3 access. Each range GET is a network round-trip (~1-10ms typical); minimizing them is the load-bearing perf goal once SIPI moves to S3. Per ADR-0004: pre-decode reads aim for *one* range GET to fetch the Essentials packet (with shape + file-structure offsets), then *one* targeted range GET for the data SIPI actually needs. | +| **Essentials packet** | sharpen further | Probe 3 | Per ADR-0004 (expanded scope): the packet's role broadens from "shape cache" to "shape + S3-access file-structure index." The load-bearing rationale shifts to remote storage (S3 in 3-6 months, NFS today). Contents: image shape (8 fields), per-format file-structure offsets (TIFF: per-IFD offset/size; JP2: codestream + per-resolution offsets), ICC profile, original filename / mimetype / hash / pixel checksum. Wire format CBOR (ADR-0005). Position: a known fixed prefix offset in the file (TIFF tag in the first IFD; JP2 UUID box near the start) so SIPI can fetch with one range GET of the prefix. | +| **Image processing** | add | Probe 4 | Umbrella term for the free-function module (`src/image_processing/`) over `const Image&`: crop, scale, rotate, colour conversion, channel ops, bit-depth reduction, dithering, watermark application, comparison, arithmetic. Replaces the ~12 image-processing methods on the `SipiImage` god-object per ADR-0007. Free-function-over-value-type maps cleanly to Rust traits in the eventual port. | +| **Image** | sharpen | Probe 4 | The domain-level term ("a pixel-bearing artefact processed through the IIIF pipeline") stays correct conceptually. The *code-level* `Image` class becomes a narrow value type post-refactor (geometry + photometric + RAII pixel buffer + metadata composite; ~15 public methods) per ADR-0007. Image-processing behaviour moves to free functions in `image_processing/`. | +| **Tee sink** | add (provisional) | Probe 4 | Composition primitive in the `OutputSink` variant per ADR-0006: `TeeSink { std::vector outputs; }` broadcasts each output chunk to multiple sub-sinks. Preserves SIPI's existing dual-write optimization (encoder writes simultaneously to HTTP socket + cache file). Generalises to write-through to S3 / other sinks. Provisional naming — confirm during DEV-6382 implementation. | + +## Candidate gaps already spotted (pre-probe) + +Concepts I expect we'll need to add or sharpen, surfaced from re-reading existing materials before any code archaeology. These are *predictions* — confirm or discard during probes. + +- **Watermark** — appears as a sub-field of `Permission`, never elevated. Has cache-key implications. +- **Page** — embedded inside `Identifier`'s definition ("page ordinal for multi-page resources"). Multi-page TIFF / PDF handling is likely its own concern. +- **Path resolution / sandboxing** — "with traversal validation" appears inside `Image root`'s definition. The validation is a real act, likely a deep module. +- **Configuration** — `Config` not a domain term; `SipiConf` exists in code. +- **Conversion** — the project overview's verb of record ("efficient image format conversions while preserving metadata"); not in the glossary. +- **Mime detection** — libmagic is a dep; mime-typing has security implications. Unnamed. +- **Sentry / error reporting** — operational; mentioned in code, absent from language. +- **Tile** — IIIF tiles are a derived form of `Region` with strong cache implications. Unnamed. +- **CLI mode vs server mode** — the binary has two operating modes; in project overview, not in glossary. + +## Cross-references + +- [`UBIQUITOUS_LANGUAGE.md`](../UBIQUITOUS_LANGUAGE.md) — canonical SIPI glossary. The probe input. +- [`CONTEXT-MAP.md`](../CONTEXT-MAP.md) — SIPI ↔ shttps boundary. Bounds the scope of this exercise to SIPI-side modules; shttps is treated as one external module. +- [`CONTEXT.md`](../CONTEXT.md) — SIPI-side seam types. Names the four primary seam types (`Server`, `Connection`, `RequestHandler`, `LuaServer`). +- [ADR-0001 — shttps as strangler-fig target](./adr/0001-shttps-as-strangler-fig-target.md) — long-term direction; constrains how we reshape SIPI ↔ shttps seams. +- [ADR-0003 — Module-co-located source and tests](./adr/0003-module-co-located-source-and-tests.md) — defines the unit ("Bazel package"), the file layout, and the `--strict_deps` enforcement model. Read first. +- [`shttps/CONTEXT.md`](../shttps/CONTEXT.md) — out of scope for this exercise *except* to confirm that the SIPI-side modules respect the documented seam. + +## Resume protocol + +To continue this exercise in a later session: + +1. Read this file end-to-end. Then re-read [`UBIQUITOUS_LANGUAGE.md`](../UBIQUITOUS_LANGUAGE.md) and [ADR-0003](./adr/0003-module-co-located-source-and-tests.md). That is the minimum cold-start context. +2. Find the next un-probed item in [Probe order](#probe-order). If a probe is partially complete, its row in the [Probe register](#probe-register) will say so. +3. Run the probe by reading the relevant headers and `.cpp` files, then fill in the 8 template fields. Append `## Probe N — ` between the `` / `` markers. +4. If the probe surfaces a glossary gap, add a row to the [Glossary delta register](#glossary-delta-register) (do *not* edit `UBIQUITOUS_LANGUAGE.md` in the same pass — batch language edits). +5. If the probe surfaces a decision that meets the ADR bar (hard to reverse, surprising without context, real trade-off), draft the ADR under `docs/adr/`. The probe row should reference it. +6. Continue until [Probe order](#probe-order) is exhausted, then run the closing directory sweep. +7. Apply the [Glossary delta register](#glossary-delta-register) to `UBIQUITOUS_LANGUAGE.md` in one batched edit. +8. The output of the exercise is: the populated probe register, the patched glossary, and a short list of new ADRs. The Bazel migration plan referenced in ADR-0003 (Y+8a..Y+8e) consumes this output to choose its per-PR scope. + +## Method invariants (don't drift on these across sessions) + +- **Bazel package = the unit of module.** Don't slip into class-level analysis except as a follow-up *inside* a package. +- **Probe-and-extend, not sweep-first.** Don't start cataloguing concepts that haven't surfaced from a probe. +- **Append, don't rewrite.** Probe rows are history. If a verdict revises, add a `**Revised:**` line; don't edit the original. +- **Batch glossary edits.** Add rows to the delta register per probe; apply to `UBIQUITOUS_LANGUAGE.md` in one editing pass. +- **ADRs are sparing.** Only when all three of (hard to reverse, surprising without context, real trade-off) hold. +- **Bound by [ADR-0001](./adr/0001-shttps-as-strangler-fig-target.md).** Reshaping SIPI ↔ shttps seams is bounded by the strangler-fig direction. SIPI-side modules can absorb work re-homed *from* shttps (see `CONTEXT.md` "Re-homing schedule"), but new SIPI → shttps coupling is out of scope. +- **Aligned with the Bazel migration's Y+8 tracker ([DEV-6353](https://linear.app/dasch/issue/DEV-6353)).** The per-module Bazel-package promotions surfaced by these probes (one per module: `cache/`, `metadata/`, `format_handlers/`, `iiif_parser/`, …) are exactly Y+8a..Y+8e in the migration plan — five sequential mechanical PRs, gated on Y+7 cleanup ([DEV-6349](https://linear.app/dasch/issue/DEV-6349)) merging. Probe outcomes that promote existing directories to Bazel packages should be considered Y+8 work and dependency-modelled against DEV-6353. Probe outcomes that *create new packages* (e.g. `image/` and `image_processing/` from Probe 4) are also Y+8-positioned but extend the plan's currently enumerated scope; they may land as additional sub-PRs (Y+8f, etc.) when DEV-6353 is decomposed (per its note: "Decompose when DEV-6349 is in QA"). +- **Module directory naming.** `snake_case` for compound words (`iiif_parser/`, `format_handlers/`, `route_handlers/`, `image_shape/`); single word otherwise (`cache/`, `metadata/`, `backpressure/`). Plural for collections of sibling types (`format_handlers/` — four format-handler classes; `route_handlers/` — multiple route-callback functions); singular for topics, mass nouns, or single concepts (`cache/`, `metadata/`, `backpressure/`, `iiif_parser/`). The `name` answers *what kind of thing this directory is about*, not *how many things are inside*. Aligns with Rust target, Google C++ Style Guide, Abseil / Chromium / BDE conventions. The `.cpp` / `.h` *file*-naming convention (PascalCase `SipiCache.cpp` vs snake_case `cache.cpp` vs BDE-style `sipi_cache.cpp`) is a separate, deferred decision — out of scope for the deep-modules exercise; will get its own ADR if and when it lands. +- **Disambiguate overloaded terms with umbrella + sub-types.** When a glossary term naturally covers multiple variants (different implementation language, different layer, different lifecycle), promote it to an umbrella in `UBIQUITOUS_LANGUAGE.md` and define each variant as a sub-type. Example: `Route handler` umbrella with `C++ route handler` + `Lua route handler` sub-types. This is preferred over silent overload (the source of `CONTEXT.md`'s mid-paragraph clarifications about `RequestHandler` vs `Route handler` and the two `file_handler`s). +- **Rust-aligned, transitional C++.** SIPI's C++ codebase is transitional ahead of the strangler-fig migration to Rust ([ADR-0001](./adr/0001-shttps-as-strangler-fig-target.md)). When choosing between a more-ergonomic C++ pattern that won't survive the Rust port and a less-ergonomic one that will, prefer the latter. Examples: `std::expected` over `absl::StatusOr` (the former maps directly to Rust's `Result`; the latter implies adopting Abseil, a C++-only commitment); `std::variant` over inheritance hierarchies (maps to Rust enums / sum types); RAII + `unique_ptr` over exception-based ownership (maps to Rust's move semantics). Cosmetic ergonomic gaps (e.g. `std::expected`'s lack of a `?` operator) are addressed by small SIPI-local helpers (e.g. a `SIPI_TRY` macro), not by adopting upstream libraries that don't outlive the C++ codebase. +- **Remote-access discipline.** Service masters are accessed remotely — NFS-mounted ZFS today, S3 in 3-6 months. Format-handler implementations and pre-decode logic minimize I/O operations: ideally one fixed-offset prefix read to fetch the Essentials packet (shape + file-structure offsets), then one targeted read for the data needed. Walking IFD chains, parsing box hierarchies one box at a time, or doing repeated small reads to discover offsets are anti-patterns — each is a network round trip. Local cache stays local (performance layer); only service-master-source reads pay remote-access cost. Per ADR-0004's expanded scope.