feat: add pathogen.json schema validation and defect detection to rebuild pipeline#413
Merged
ivan-aksamentov merged 13 commits intomasterfrom Mar 4, 2026
Merged
feat: add pathogen.json schema validation and defect detection to rebuild pipeline#413ivan-aksamentov merged 13 commits intomasterfrom
ivan-aksamentov merged 13 commits intomasterfrom
Conversation
- Add type checking before dict traversal to catch malformed JSON early - Raise TypeError with actionable path info instead of cryptic AttributeError - Protect both dict_get and dict_set with shared _assert_dict helper
- Validate pathogen.json against official Nextclade schema before processing - Use matching branch from nextstrain/nextclade if exists, else master - Provides clear error messages with path and violation details
- Rename scripts to consistent `migrate_NNN_*` pattern - Update imports from `lib.*` to `scripts.lib.*`
Allows specifying a local directory containing nextclade schema files instead of fetching from GitHub. Supports both absolute and relative paths.
Add targeted checks for common data defects that the generic schema validator cannot catch with actionable warnings: - Misplaced mutation label maps (root level instead of mutLabels) - Legacy v2 nucMutLabelMapReverse fields - Nonexistent QC scoreWeight on missingData/mixedSites - Renamed qc.frameShifts.ignoreFrameShifts - Nonexistent qc.divergence rule - Renamed geneOrderPreference - Misplaced placementMaskRanges (belongs in tree.json) - Misplaced alignmentParams.includeReference/includeNearestNodeInfo - Typo alignmentParams.excessBandwith Each warning explains why the field is wrong, what the impact is, and which migration script fixes it.
Move placementMaskRanges from pathogen.json to tree.json at .meta.extensions.nextclade.placement_mask_ranges where Nextclade reads it. Preserves tree.json formatting (compact or indented).
…stream hints - Add Severity enum (ERROR, WARNING, INFO) to classify defect impact - Add Defect dataclass with structured fields: problem, impact, migration, upstream_fix - Add DefectReport to track defects per dataset with upstream repo inference - Print summary at end of rebuild grouped by severity with upstream hints - Format messages as human-readable sentences instead of pipe-separated fields - Deduplicate reports when validation runs multiple times per file
- Color severity levels (green INFO, yellow WARNING, red ERROR) - Transform absolute paths to relative paths from project root - Highlight paths in grey for visual separation - Add line numbers to validation warnings pointing to JSON location - Support GITHUB_ACTIONS env for CI annotation format with line numbers - Respect NO_COLOR convention
Restructure print_defect_summary() to group reports by dataset rather than by severity level, showing per-file defect details with severity tags for easier triage.
- Skip re-validation of already-processed files - Suppress generic unknown-property warnings for paths covered by defect checkers
- Remove module-level `_defect_reports` dict and accessor functions - Add `ValidationContext` dataclass to hold validation state - Pass context through call chain: main → process_one_collection → validate_pathogen_json - Make `ctx` parameter required to prevent silent failures
- Build schema index mapping property names to valid JSON paths - Suggest corrections for typos using RapidFuzz (80% threshold) - Detect misplaced properties by finding valid locations elsewhere - Resolve $ref references during schema traversal for accurate paths
This was referenced Mar 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add JSON schema validation for pathogen.json during dataset rebuild. Validate against the Nextclade schema (fetched from GitHub or loaded from a local directory via
--nextclade-schemas-dir), detect known defects with severity, impact and migration guidance, and surface unknown properties with fuzzy matching for typos and misplacement. Output GitHub Actions annotations and a grouped defect summary at the end of each rebuild.scripts/lib/schema.pywith schema validation, known-defect detection, fuzzy property matching, and CI annotation outputplacementMaskRangesscripts/tomigrations/with consistent naming and updated importsscripts/rebuildwith--nextclade-schemas-dirflag andValidationContextthreadingJsonPathpath-tracking wrapper and type validation todict_get/dict_setinscripts/lib/container.pyNO_COLORsupportjsonschemaandrapidfuzz>=3.0dependenciesSibling PR in software: nextstrain/nextclade#1754