This document turns the meeting-derived architecture into an implementation sequence. It is not a project-management document. It exists to answer one question:
What should be built first so the pipeline reaches a coherent, testable shape quickly?
Deliverables:
- choose the standardized collated convergence format
- decide the exact v1 processed-data entry scope
- document where v1 stops downstream
Why first:
- this unblocks branching and channel design
- without this, analysis and processed-entry work will drift
Deliverables:
- tall raw-image samplesheet schema
- reusable metadata validation
- task-level
load_data.csvgeneration - working illumination correction
- working assay-development branch
- working analysis branch
Why second:
- this is the canonical path
- the processed-data entrypoint should reuse the downstream half of this path rather than invent its own
Deliverables:
- explicit stop after assay-development outputs
- parameterized resume/continue behavior
- documented QC expectations
Why now:
- the meetings repeatedly point to this as a high-value usability pattern
- it reduces wasted compute and fits real cell-painting workflows
Deliverables:
- deterministic
Cytotablehandoff from analysis outputs - stable collated output naming
- tests asserting the collated contract
Why now:
- this is the dependency boundary for both entrypoints
- it lets the repo reach a coherent MVP even if later downstream work is deferred
Deliverables:
- manifest/schema for processed inputs
- format discrimination or explicit format parameter
- conversion path into the standardized collated representation
- shared downstream path after normalization
Why after raw path:
- the processed entrypoint should target an existing stable internal representation
- otherwise both entrypaths will evolve at once and become harder to test
Deliverables:
- decide v1 endpoint:
Cytotable,Cytotable + QC, orpycytominer - if included, implement downstream profile-processing subworkflows against the collated contract
Why last:
- meetings indicate this is still moving
- the repo needs a clean data boundary before layering more downstream logic
- main workflow should branch only at the top-level entrypoint
- downstream of collation, both paths should share code
- modules should accept strongly typed metadata maps plus staged paths
- avoid large implicit directory inputs where a manifest-driven input is possible
The repo should carry both:
- spec tests for the intended grouped interfaces and emitted artifacts
- regression tests for bugs found while wiring
load_data.csv, staging, and grouping
Expected test focus:
- raw-image entrypoint
- processed-data entrypoint
- illumination grouping
- assay-development grouping
- analysis grouping
- collated convergence artifact
- resume/QC stop behavior where practical
The public docs should explain:
- what inputs are accepted
- which workflow stage each entrypoint skips
- what standardized format the workflow converges on
- what parts of the workflow are intentionally deferred
If v1 accepts every historical processed form directly, the normalization logic will dominate the implementation.
If each stage invents its own metadata-to-CSV logic, the pipeline will become hard to reason about and harder to test.
If the workflow is built as a straight-through pipeline first, retrofitting human QC gates later will be awkward.
Several meetings called out snapshot and dev-environment churn. The implementation order should keep tests passing while the interfaces settle.
An MVP is reached when all of the following are true:
- raw-image entrypoint works end-to-end
- assay-development outputs can be reviewed before continuation
- analysis outputs collate into the standardized format
- processed-data entrypoint can enter at one clearly documented supported point
- both entrypoints are covered by nf-test
- docs describe supported contracts without ambiguity
- richer processed-data auto-detection
- additional segmentation backends
- broader custom-channel logic
- broader pooled-workflow reuse
- heavier downstream visualization/profile-analysis features