Skip to content

feat: add pathogen.json schema validation and defect detection to rebuild pipeline#413

Merged
ivan-aksamentov merged 13 commits intomasterfrom
feat/dataset-validation
Mar 4, 2026
Merged

feat: add pathogen.json schema validation and defect detection to rebuild pipeline#413
ivan-aksamentov merged 13 commits intomasterfrom
feat/dataset-validation

Conversation

@ivan-aksamentov
Copy link
Member

@ivan-aksamentov ivan-aksamentov commented Mar 4, 2026

Add JSON schema validation for pathogen.json during dataset rebuild. Validate against the Nextclade schema (fetched from GitHub or loaded from a local directory via --nextclade-schemas-dir), detect known defects with severity, impact and migration guidance, and surface unknown properties with fuzzy matching for typos and misplacement. Output GitHub Actions annotations and a grouped defect summary at the end of each rebuild.

  • Add scripts/lib/schema.py with schema validation, known-defect detection, fuzzy property matching, and CI annotation output
  • Add 9 new migration scripts (008-016) to fix defects: misplaced mutation labels, legacy v2 fields, renamed QC parameters, typos, and misplaced placementMaskRanges
  • Move existing migration scripts (000-007) from scripts/ to migrations/ with consistent naming and updated imports
  • Integrate validation into scripts/rebuild with --nextclade-schemas-dir flag and ValidationContext threading
  • Add JsonPath path-tracking wrapper and type validation to dict_get/dict_set in scripts/lib/container.py
  • Add colored logging with level-based colors, path abbreviation, and NO_COLOR support
  • Add jsonschema and rapidfuzz>=3.0 dependencies

Sibling PR in software: nextstrain/nextclade#1754

- Add type checking before dict traversal to catch malformed JSON early
- Raise TypeError with actionable path info instead of cryptic AttributeError
- Protect both dict_get and dict_set with shared _assert_dict helper
- Validate pathogen.json against official Nextclade schema before processing
- Use matching branch from nextstrain/nextclade if exists, else master
- Provides clear error messages with path and violation details
- Rename scripts to consistent `migrate_NNN_*` pattern
- Update imports from `lib.*` to `scripts.lib.*`
Allows specifying a local directory containing nextclade schema files instead of fetching from GitHub. Supports both absolute and relative paths.
Add targeted checks for common data defects that the generic schema
validator cannot catch with actionable warnings:

- Misplaced mutation label maps (root level instead of mutLabels)
- Legacy v2 nucMutLabelMapReverse fields
- Nonexistent QC scoreWeight on missingData/mixedSites
- Renamed qc.frameShifts.ignoreFrameShifts
- Nonexistent qc.divergence rule
- Renamed geneOrderPreference
- Misplaced placementMaskRanges (belongs in tree.json)
- Misplaced alignmentParams.includeReference/includeNearestNodeInfo
- Typo alignmentParams.excessBandwith

Each warning explains why the field is wrong, what the impact is,
and which migration script fixes it.
Move placementMaskRanges from pathogen.json to tree.json at
.meta.extensions.nextclade.placement_mask_ranges where Nextclade
reads it. Preserves tree.json formatting (compact or indented).
…stream hints

- Add Severity enum (ERROR, WARNING, INFO) to classify defect impact
- Add Defect dataclass with structured fields: problem, impact, migration, upstream_fix
- Add DefectReport to track defects per dataset with upstream repo inference
- Print summary at end of rebuild grouped by severity with upstream hints
- Format messages as human-readable sentences instead of pipe-separated fields
- Deduplicate reports when validation runs multiple times per file
- Color severity levels (green INFO, yellow WARNING, red ERROR)
- Transform absolute paths to relative paths from project root
- Highlight paths in grey for visual separation
- Add line numbers to validation warnings pointing to JSON location
- Support GITHUB_ACTIONS env for CI annotation format with line numbers
- Respect NO_COLOR convention
Restructure print_defect_summary() to group reports by dataset
rather than by severity level, showing per-file defect details
with severity tags for easier triage.
- Skip re-validation of already-processed files
- Suppress generic unknown-property warnings for paths covered by defect checkers
- Remove module-level `_defect_reports` dict and accessor functions
- Add `ValidationContext` dataclass to hold validation state
- Pass context through call chain: main → process_one_collection → validate_pathogen_json
- Make `ctx` parameter required to prevent silent failures
- Build schema index mapping property names to valid JSON paths
- Suggest corrections for typos using RapidFuzz (80% threshold)
- Detect misplaced properties by finding valid locations elsewhere
- Resolve $ref references during schema traversal for accurate paths
@ivan-aksamentov ivan-aksamentov deployed to refs/pull/413/merge March 4, 2026 20:41 — with GitHub Actions Active
@ivan-aksamentov ivan-aksamentov merged commit 1ff3273 into master Mar 4, 2026
2 checks passed
@ivan-aksamentov ivan-aksamentov deleted the feat/dataset-validation branch March 4, 2026 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant