Skip to content

UX consistency: multi-oracle + causal reports use enriched CHIP labels + shared percentile format#27

Open
lucapinello wants to merge 89 commits intomainfrom
fix/2026-04-20-v13-ux-consistency
Open

UX consistency: multi-oracle + causal reports use enriched CHIP labels + shared percentile format#27
lucapinello wants to merge 89 commits intomainfrom
fix/2026-04-20-v13-ux-consistency

Conversation

@lucapinello
Copy link
Copy Markdown
Contributor

v13 docs/content consistency sweep caught two reports bleeding through raw AlphaGenome catalog assay_ids instead of the enriched display names every other chorus report uses. A user seeing CHIP:CEBPA:HepG2 in the SORT1 variant report would see CHIP_TF/EFO:0001187 TF ChIP-seq CEBPA genetically modified… (60-char raw assay_id) in the multi-oracle consensus matrix or the causal drill-down — exactly the polish issue that erodes trust.

What changed

MultiOracleReport (chorus/analysis/multi_oracle_report.py)

  • _consensus_rows now captures description alongside assay_id
  • MD + HTML consensus table prefers description over raw assay_id
  • Per-oracle drill-down table uses description, not <code>{assay_id}</code>
  • Percentile via shared _fmt_percentile (≥99th / near-zero instead of +100.0% / -100.0%) — matches every other chorus report

CausalResult HTML (chorus/analysis/causal.py)

  • "Strongest track" line shows enriched label as primary; raw assay_id demoted to secondary <code> tag only when different
  • Per-layer breakdown table renders description instead of <code>assay_id</code>
  • Percentile column uses _fmt_percentile
  • IGV track labels also route through description so the IGV panel reads "CHIP:CEBPA:HepG2 ref" / "CHIP:CEBPA:HepG2 alt" instead of the 60-char raw AlphaGenome catalog id

Before / After on committed SORT1 examples

Before After
Multi-oracle HTML: raw CHIP_TF/EFO:0001187 mentions 3 0
Multi-oracle HTML: enriched CHIP:CEBPA:HepG2 1 4
Multi-oracle HTML: "+100.0%" percentile format present gone
Multi-oracle HTML: "≥99th" format absent 8
Causal HTML: raw mentions 30 3 (IGV-tracks-only, now routed through description)
Causal HTML: "+100.0%" present gone
Causal HTML: "≥99th"/"near-zero" absent 39

Regenerated the committed SORT1 causal + SORT1 multi-oracle examples so the repo reflects the new labels.

Test

tests/test_analysis.py::TestMultiOracleReport::test_uses_enriched_description_not_raw_assay_id — asserts both MD and HTML render "CHIP:CEBPA:HepG2" and not "TF ChIP-seq CEBPA genetically modified"; also asserts the HTML carries "≥99th" but no "+100.0%".

Verified

  • pytest tests/ --ignore=test_smoke_predict.py -m "not integration"329 passed (was 328)
  • Enriched labels visible in regenerated SORT1 causal HTML + multi-oracle HTML
  • No raw assay_ids leak through tables anywhere in chorus's report suite

🤖 Generated with Claude Code

lucapinello and others added 30 commits March 25, 2026 13:39
…und distributions

- New chorus/analysis module: multi-layer scoring (scorers.py), variant reports,
  quantile normalization, batch scoring, causal prioritization with enriched
  HTML tables (gene, cell type, per-layer score columns, top-3 IGV signal tracks),
  cell type discovery, region swap, integration simulation
- New chorus/analysis/build_backgrounds.py: variant effect and baseline signal
  background distributions for quantile normalization, with batch GPU scripts
- 8 application examples with full outputs: variant analysis (SORT1, TERT, BCL11A,
  FTO across AlphaGenome/Enformer/ChromBPNet), causal prioritization, batch scoring,
  cell type discovery, sequence engineering (region swap + integration simulation)
- Validation against AlphaGenome paper: SORT1 confirmed, TERT partially confirmed
  (ELF1 limitation documented), HBG2 not reproduced in K562 or monocytes (ISM vs
  log2FC methodology difference documented with side-by-side comparison)
- Fix mamba PATH resolution in environment runner and manager
- Add gene_name, cell_type fields to CausalVariantScore and BatchVariantScore
- 500 common SNPs BED file for background computation
- 91 tests covering all analysis components

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The annotation module was re-parsing the full 1GB GENCODE GTF file on every
call to get_genes_in_region, get_gene_tss, and get_gene_exons (~11s each).
Now the GTF is loaded once per feature type (gene/transcript/exon) and cached
as a DataFrame for the process lifetime. Exon lookups use a groupby index
for O(1) gene-name access.

Before: 11,000ms per query (full GTF scan)
After:  0.03s genes, 0.04s TSS, 1.5ms exons (cached)

Full analysis test suite now completes in 2 min (was timing out at 10+ min).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ores

Single-process AlphaGenome script that extracts all 3,763 valid tracks
(711 cell types × 6 output types) from each forward pass. Same GPU time
as K562-only (~55 min total on A100) but yields comprehensive per-layer
distributions across all cell types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comprehensive background distribution builder:
- 10K random SNPs from hg38 reference across all autosomes
- 20K protein-coding gene TSS positions (promoter state baselines)
- 5K random genomic positions (general baseline)
- Parallel GPU execution: --part variants --gpu 0 / --part baselines --gpu 1
- All 3,763 AlphaGenome tracks extracted per forward pass

Expected output: ~37M variant scores + ~94M baseline samples in ~18 hours.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n counting

Covers all AlphaGenome output types:
- Window-based: DNASE, ATAC, CHIP_TF, CHIP_HISTONE, CAGE, PROCAP,
  SPLICE_SITES, SPLICE_SITE_USAGE
- Exon-counting: RNA_SEQ (sum across merged protein-coding exons per gene)
- All backgrounds unsigned (abs magnitude) for quantile ranking

Pre-loads GENCODE v48 gene annotations and builds spatial index for fast
exon lookup within prediction windows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…haul

Analysis framework:
- PerTrackNormalizer with per-track CDFs (effect, activity, perbin) for all 6 oracles
- Auto-download backgrounds from HuggingFace on oracle load
- AnalysisRequest dataclass preserves user's original prompt on every report
- Magnitude-gated interpretation labels ("Very strong" requires |effect| > 0.7)
- Top-10-per-layer cap in markdown reports with truncation footer
- Biological interpretation + suggested next steps on all 14 example outputs
- Literature caveats where oracle predictions diverge from published biology

Bug fixes:
- Sequence.slice() missing self argument in interval.py (broke Enformer predictions)
- oracle_name="oracle" placeholder in region_swap, integration, discovery
- Cell-type column bloat in batch_scoring and causal (thousands of cell types)
- Corrupted igv.min.js in 15 HTML files from misplaced injection into JS string
- predict() called with string region instead of (chrom, start, end) tuple

MCP server:
- All 8 critical tools accept user_prompt and forward it into reports
- _safe_tool decorator returns structured {"error", "error_type"} on failure
- Improved docstrings: score_variant_batch (variant dict schema), discover_variant_cell_types (runtime + cell count), fine_map_causal_variant (composite formula + output columns)
- Causal table shows Top Layer column; batch scoring resolves track IDs to human-readable names

Application examples (14 folders, all with MD/JSON/TSV/HTML):
- Regenerated all variant_analysis, validation, discovery, causal, batch, sequence_engineering
- Every report has Analysis Request header + Interpretation section
- Cleaned stale intermediate files (5 removed)
- IGV browser verified working in headless Chrome

Documentation:
- README: "Start here" applications callout, updated MCP tools list, MCP walkthrough link
- API_DOCUMENTATION: application layer section (all 6 functions + AnalysisRequest)
- MCP_WALKTHROUGH.md: 5 example conversations showing natural-language usage
- Natural-language framing notes on all 7 category READMEs
- Fixed AlphaGenome HF URL, clarified environment.yml vs chorus-base.yml
- Notebook install banners for comprehensive/advanced (all 6 oracles required)

Scripts:
- Internal scripts moved to scripts/internal/
- regenerate_examples.py + regenerate_remaining_examples.py for reproducible output generation
- scripts/README.md updated with public script descriptions

Testing:
- 268 tests passed (including new magnitude-gate and causal-table tests)
- All 3 notebooks executed end-to-end (Enformer, all 6 oracles, multi-oracle analysis)
- IGV browser rendering verified via Selenium in headless Chrome
- MCP server startup verified

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These are replaced by the per-oracle build_backgrounds_*.py scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove stale outputs containing machine-specific paths (/Users/lp698/...)
and runtime-specific logs. Notebooks should be committed clean so new
users run them fresh in their own environment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace /PHShome/lp698/chorus with REPO_ROOT (computed from __file__)
  in all 8 public scripts (6 build_backgrounds + 2 regenerate)
- Clear stale notebook output cells containing machine-specific paths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ll traceability

Batch scoring:
- Per-track columns (one per assay:cell_type) with raw score + percentile
- track_scores dict preserved on BatchVariantScore for programmatic access
- display_mode parameter: "by_assay" (default), "by_cell_type", "summary"
- Track ID footnotes for tracing back to oracle data
- oracle_name parameter fixes normalizer CDF lookup (was returning None)

Causal prioritization:
- Per-track columns replacing generic "Max Effect / Top Layer"
- Each cell shows raw effect + percentile for each scored track
- track_scores dict on CausalVariantScore

Report infrastructure:
- report_title field: "Region Swap Analysis Report", "Integration Simulation Report"
- modification_region: IGV highlights full replaced/inserted region (not 2-3bp)
- modification_description: documents what was inserted/replaced and its length
- has_quantile scoping fix (UnboundLocalError on empty allele_scores)

All examples regenerated with biologically specific tracks:
- SORT1: HepG2 DNASE + CEBPA + CEBPB + H3K27ac + CAGE (reproduces Musunuru)
- BCL11A: K562 DNASE + GATA1 + TAL1 + H3K27ac + CAGE (reproduces Bauer)
- FTO: HepG2 tracks (nearest metabolic cell type available)
- TERT: K562 tracks
- Validation: forced HepG2 CEBP tracks matching the AlphaGenome paper

Every report carries the user's original prompt (Analysis Request block).
All 13 examples verified: MD + JSON + TSV + HTML, prompt present, 268 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…racle table

- Remove ChromBPNet loading and "Combining oracles" sections (belong in main README)
- Rename "Window" to "Output window" + add Resolution column
- Separate Effect percentile and Activity percentile explanations
- Add recommendation to start with AlphaGenome
- Remove Python API details (get_normalizer) that don't belong here

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- .mcp.json: drop the /data/pinello/... PATH hardcoding so new users can
  use the file as-is from `curl` in any environment. mamba resolves the
  chorus env without an explicit PATH override.
- README.md: add LDlink token setup section under Troubleshooting —
  fine_map_causal_variant auto-fetch path was silently failing for users
  without a free LDlink API key.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changes informed by a fresh walkthrough of the README from the perspective
of a brand-new user:

- Reorder Installation: Fresh Install now comes before Upgrading (a first-
  time reader no longer sees "remove existing envs" before they install)
- Consolidate the Fresh Install block to cover env create, pip install,
  chorus setup --oracle enformer, and chorus genome download hg38 so a
  user copying the block ends up actually ready to run the Quick Start
- Clarify that the root environment.yml is what you install and the
  per-oracle YAMLs in environments/ are internal to `chorus setup`
- Quick Start: point to examples/single_oracle_quickstart.ipynb for users
  who prefer a notebook, and call out the setup prerequisite explicitly
- Annotate the ENCFF413AHU track ID in the DNase snippet so users know
  what it is before the Discovering Tracks section explains it
- HF_TOKEN: note that Claude Code inherits env from the shell where
  `claude` is started (the MCP server is spawned by that shell)
- Add a "Further reading" section linking the docs/ folder — previously
  API_DOCUMENTATION, METHOD_REFERENCE, VISUALIZATION_GUIDE, and
  IMPLEMENTATION_GUIDE were all invisible to a README-only reader
- Remove REQUIREMENTS_CHECKLIST.md (internal audit scratch file)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…gitignore

Real issues caught by a deeper audit pass and fixed:

- **Duplicate CAGE column headers**: batch scoring tables rendered two
  "CAGE:HepG2" columns because both + and - strand tracks have
  identical description fields. _track_display_name now appends (+) / (-)
  when the assay_id carries a strand suffix, producing unique column
  labels in markdown, HTML, and DataFrame outputs.
- **UnboundLocalError in _build_html_report**: has_quantile /
  has_baseline were defined inside a for-loop that doesn't execute for
  empty allele_scores, causing .to_html() to crash on minimally-populated
  reports. Initialise both before the loop (matches the markdown fix).
- **docs/RELEASE_CHECKLIST.md**: internal QA checklist with stale
  metrics (references 128 tests when we have 280). Removed — internal
  docs shouldn't live in user-facing docs/.
- **API_DOCUMENTATION / METHOD_REFERENCE overlap**: added reciprocal
  callouts clarifying that API_DOCUMENTATION is authoritative and
  METHOD_REFERENCE is a one-line cheat sheet.
- **logs/ not ignored**: 82 MB of run logs at risk of being committed.
  Added to .gitignore.

Test coverage added (+12 tests, 268 → 280 passed):
  - TestReportMetadataFields: report_title, modification_region,
    modification_description rendering in MD / HTML / dict
  - TestBatchDisplayModes: by_assay, by_cell_type, track-ID footnote,
    CAGE strand disambiguation in DataFrame columns
  - TestSafeToolDecorator: passthrough, exception → error dict,
    function name preservation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Columns and TSV headers now show CAGE:HepG2 (+) and CAGE:HepG2 (-)
instead of two identical CAGE:HepG2 columns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A new user could miss this entirely — the previous mention was a single
feature bullet ("auto-downloaded from HuggingFace") with no detail.
This section spells out:

- The backgrounds turn raw log2FC into the effect/activity percentiles
  shown in every report
- They're fetched on first oracle use from the public HF dataset
  lucapinello/chorus-backgrounds and cached in ~/.chorus/backgrounds/
- File sizes per oracle (so users with limited disk know what to expect)
- **No HF_TOKEN required** for backgrounds (only AlphaGenome model is gated)
- LDlink token is separate and only needed for causal auto-fetch
- Optional pre-download snippet for users who want to avoid the first-use
  wait

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a comprehensive reference appendix covering:

- What the backgrounds are (effect %ile vs activity %ile vs per-bin)
  and why they exist (turn raw log2FC into genome-aware metrics)
- How they were calculated:
    * Variant effect distribution: 10K random SNPs × all tracks with
      layer-specific scoring formulas (log2FC, logFC, diff)
    * Activity distribution: ~31.5K positions (random intergenic +
      ENCODE SCREEN cCREs + protein-coding TSSs + gene-body midpoints)
    * Per-bin distribution: 32 random bins per position for IGV scaling
    * RNA-seq exon-precise sampling rule
    * CAGE summary routing rule
- Sample sizes per oracle (track count, samples per track, NPZ size)
- Python API usage with verified signatures (get_pertrack_normalizer,
  download_pertrack_backgrounds, effect_percentile, activity_percentile,
  perbin_floor_rescale_batch)
- MCP / Claude usage (auto-attached, zero-config)
- Documented ranges and a sanity-check rule of thumb for interpretation
- How to reproduce or extend the backgrounds via build_backgrounds_*.py

All function signatures in the appendix were verified against the actual
implementation before committing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Local-only IDE/agent state (settings, scheduled tasks lock) — per-
developer, not for the repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pies

- AUDIT_PROMPT.md: systematic end-to-end audit script for a new machine,
  with REPLACE_* placeholders for HF_TOKEN and LDLINK_TOKEN.
- .gitignore: block any *_WITH_TOKENS.md or AUDIT_PROMPT_WITH_TOKENS*
  file from ever being staged, since filled-in copies contain secrets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Records what worked and what did not on a fresh macOS 15.7.4 / arm64
clone of chorus-applications: full install, all 6 oracle smoke-predicts,
286/286 pytest pass, 3 example notebooks (0 errors), 22 MCP tools registered,
6/6 application tools producing correct outputs (rs12740374 SORT1 case
reproduces the published Musunuru-2010 finding), 18/19 application HTML
reports IGV-verified via headless Chrome, ChromBPNet smoke build completed
end-to-end on CPU.

Top issues a macOS user hits, ranked, with one-or-two-line fixes:
  1. No Apple GPU (MPS / Metal / jax-metal) auto-detect — frameworks are
     installed but borzoi/sei/legnet only check torch.cuda.is_available(),
     SEI hardcodes map_location='cpu', chrombpnet/enformer envs lack
     tensorflow-metal. Verified Borzoi runs on MPS in 4.3 s when forced.
  2. SEI Zenodo download via stdlib urllib at ~80 KB/s — 3.2 GB tar takes
     ~11 h. curl -C - -L recovers it in ~30 min.
  3. fine_map_causal_variant rsID-only crash (KeyError 'chrom' at
     causal.py:355). Workaround: pass "chr1:pos REF>ALT" form.
  4. Two-mamba-installs MAMBA_ROOT_PREFIX gotcha breaks chorus health.
  5. Notebooks need explicit `python -m ipykernel install --user --name chorus`.
  6. SEI download has no single-flight lock — concurrent inits race.

Verdict: production-ready with caveats. None of the issues block correctness;
all are operational or one-line code fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses every actionable item in audits/2026-04-14_macos_arm64.md.
All changes are platform-conditional — Linux CUDA paths are unchanged.

PyTorch oracles (borzoi, sei, legnet) — auto-detect MPS on Apple Silicon
  - Both the in-process loader (chorus/oracles/{borzoi,sei,legnet}.py) and
    the subprocess templates ({borzoi,sei,legnet}_source/templates/{load,
    predict}_template.py) now resolve `device is None` (or the new 'auto'
    sentinel) as: cuda > mps > cpu. Linux + CUDA box hits the cuda branch
    first, no behavior change there.
  - SEI: replaced the hard `map_location='cpu'` device pin (the value is
    still used to load weights to host memory before .to(device), which is
    the standard pattern across torch versions and works for mps too).
  - Sei BSplineTransformation lazily moved its spline matrix only when
    `input.is_cuda`. Generalized to any non-CPU device so the matmul works
    on MPS as well. Verified: 286/286 pytest still pass.

TensorFlow oracles (chrombpnet, enformer) — Metal backend on Apple Silicon
  - chorus/core/platform.py macos_arm64 adapter now adds
    `tensorflow-metal>=1.1.0` to pip_add. Once installed, Apple's plugin
    registers a 'GPU' physical device, so the oracles' existing
    tf.config.list_physical_devices('GPU') auto-detect picks it up with no
    code change. Linux paths don't see the macos_arm64 adapter so CUDA stays
    intact.

JAX oracle (alphagenome) — unchanged
  - Already explicitly skips Metal in auto-detect (jax-metal still missing
    `default_memory_space` for AlphaGenome). README updated to document
    this trade-off.

MCP fix — fine_map_causal_variant rsID-only crash
  - Calling `fine_map_causal_variant(lead_variant="rs12740374")` previously
    raised KeyError: 'chrom' at chorus/analysis/causal.py:355 because
    `_parse_lead_variant("rs12740374")` returns {"id": ...} only.
  - Backfill chrom/pos/ref/alt onto the sentinel from the LDlink response
    (which always carries them) before invoking prioritize_causal_variants.
  - Verified end-to-end: rs12740374 ranked #1 with composite=1.000 of 12 LD
    variants on AlphaGenome (matches the published Musunuru-2010 finding).

SEI Zenodo download — chunked + resume + single-flight lock
  - Replaced urllib.request.urlretrieve with a stdlib chunked urlopen loop
    that supports HTTP Range resume and an fcntl exclusive lock so two
    concurrent SeiOracle inits don't race the same partial file. Original
    observed throughput on macOS was ~80 KB/s (would take ~11 hours for the
    3.2 GB tar); the new path resumes interrupted downloads and progress-
    logs every 100 MB.

README — macOS troubleshooting + Apple GPU policy table + kernel install
  - Documented the two-mamba-installs MAMBA_ROOT_PREFIX gotcha that breaks
    `chorus health` when the new chorus env lands in a different mamba root
    than the per-oracle envs.
  - Added the per-oracle macOS GPU support matrix (MPS / Metal / CPU) with
    explicit `device=` examples.
  - Added the missing `python -m ipykernel install --user --name chorus`
    step to Fresh Install so examples/*.ipynb find the chorus kernel.

Validation on macOS 15.7.4 / Apple Silicon (CPU + MPS + Metal):
  - 286/286 pytest pass (incl. all 6 oracle smoke-predict tests)
  - chorus.create_oracle('borzoi') auto-detects mps:0
  - chorus.create_oracle('sei')    auto-detects mps:0 + smoke-predict ok
  - chrombpnet env now reports tf.config.list_physical_devices('GPU') = [GPU:0]
  - fine_map_causal_variant(lead_variant='rs12740374') ranks rs12740374
    composite=1.000 of 12 LD variants

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…EI resumable download, rsID backfill)

Verified on Linux CUDA: 285/285 code tests pass. Alphagenome smoke errors due to expired HF token in chorus-alphagenome env (unrelated to PR).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- tests/test_mcp.py: TestFineMapRsidBackfill verifies fine_map_causal_variant
  backfills chrom/pos/ref/alt when a caller passes only an rsID lead_variant.
  Regression test for the macOS audit crash (KeyError: 'chrom').
- examples/*.ipynb: Re-executed all three notebooks end-to-end on Linux CUDA
  to refresh outputs against the merged audit branch.

Full suite now: 286/286 tests pass (including alphagenome real-oracle smoke).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Regeneration on Linux CUDA GPU 1 (GPU 0 was full):
- AlphaGenome variant + validation (5 examples) — 28 min
- Remaining AlphaGenome (batch/causal/discovery/seq) — ~14 min
- Enformer SORT1 — 1 example
- ChromBPNet SORT1 — 1 example
Discovery HTML filenames now use oracle_name "alphagenome" (was placeholder "oracle").

Verification:
- 289/289 tests pass across combined runs (6 oracle smoke tests green on GPU 1)
- Selenium screenshot sweep: 18/19 HTML render cleanly; 1 "NO-IGV" is the
  batch_scoring HTML which is a scoring table by design (no browser view)
- Hardcoded /PHShome/lp698/chorus paths in notebook log outputs redacted to
  /path/to/chorus

.gitignore: ignore examples/applications/**/*_screenshot.png so selenium
artifacts don't pollute the repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lper

Adds audits/2026-04-15_macos_arm64_post_merge.md — the second-pass
end-to-end audit on a fully wiped + fresh-cloned install, after the
PR #7 macOS-support changes were merged. Every fix from v1 is
confirmed working on a clean setup:

  * chrombpnet + enformer envs now pull in tensorflow-metal
    automatically → `Auto-detected 1 GPU(s) … name: METAL`
  * borzoi/sei/legnet auto-detect mps:0
  * fine_map_causal_variant("rs12740374") rsID-only returns
    rs12740374 composite=0.963 of 12 LD variants (was KeyError in v1)
  * analyze_variant_multilayer reproduces Musunuru-2010 biology
    (CEBPA strong binding gain +0.37, DNASE strong opening +0.43)
  * 286/286 pytest, 0 notebook errors, 19/19 IGV reports ok

Two download-reliability findings surfaced on this clean run (both
pre-existing, both same bug-class as the SEI fix that landed in
PR #7): chorus/utils/genome.py stalled at 36% of the hg38 download,
and chorus/oracles/chrombpnet.py has no single-flight lock so two
concurrent callers race the ENCODE tar and hit EOFError.

This commit also adds chorus/utils/http.py — the resume+lock helper
that previously lived inside SeiOracle, now extracted as a shared
stdlib-only utility so genome + chrombpnet can reuse it. The
sei.py helper shim keeps the old public API working.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…elper

Three call sites fetch large files from the public internet with plain
urllib.request.urlretrieve (no resume, no concurrency lock). The
2026-04-15 v2 audit on a fresh install hit two of them the hard way:
UCSC cut the hg38 connection at ~36% of the 938 MB download
(urllib.error.URLError: retrieval incomplete: got only 363743871 out
of 983659424 bytes), and two concurrent callers of
_download_chrombpnet_model raced the same partial ENCODE .tar.gz so
one read it mid-write and hit
  EOFError: Compressed file ended before the end-of-stream marker was
           reached
inside tarfile.extractall.

Re-use the resume+lock helper introduced for SEI in PR #7, lifted
into chorus/utils/http.py in the preceding commit:

  chorus/oracles/sei.py
    _download_with_resume staticmethod becomes a thin shim that
    forwards to chorus.utils.http.download_with_resume. No behaviour
    change and no API break.

  chorus/utils/genome.py
    GenomeManager.download_genome swaps urllib.request.urlretrieve
    for download_with_resume. Fixes the UCSC stall observed in the
    v2 audit; partial .fa.gz is now resumable across retries.

  chorus/oracles/chrombpnet.py
    _download_chrombpnet_model (ENCODE tar) and _download_jaspar_motif
    (JASPAR motif) both route through download_with_resume. The fcntl
    lock on <dest>.lock serialises concurrent callers so the pytest
    smoke fixture and a background build_backgrounds_chrombpnet.py
    job can no longer corrupt each other's download.

All three changes are platform-agnostic; the helper is stdlib-only
(urllib + fcntl). Linux CUDA is not touched.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…S audit

v2 audit confirmed post-merge macOS works end-to-end (286/286 tests,
19/19 HTML, reproduces Musunuru-2010 biology, rsID backfill verified).

Adds chorus/utils/http.py (resume + fcntl-lock) and routes hg38 genome +
chrombpnet ENCODE tar + JASPAR motif downloads through it. SEI helper
becomes a shim for backward-compat.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Third-pass audit going one level deeper than v1 (pre-merge smoke) and
v2 (post-merge fresh install). Scope: 14 application examples × 19 HTML
reports × 6 per-track normalizer NPZs + the scoring/normalization stack.

Read-only deliverable — the fixes identified here belong in a separate
follow-up PR after review.

Findings (5, ranked by severity):

  1. HIGH — chrombpnet_pertrack.npz:DNASE:hindbrain has 0 background
     samples and an all-zeros CDF. PerTrackNormalizer.effect_percentile()
     silently returns 1.0 for every raw_score (including 0.0) because
     np.searchsorted on a zeros row ranks everything at the end, and
     _get_denominator falls through to cdf_width=10000 when counts[idx]=0.
     Same bug class as the v2 concurrent-download race that landed in
     PR #8 — the hindbrain model download failed silently and left a
     zero-count reservoir. Impact: any variant scored against
     DNASE:hindbrain in ChromBPNet gets a false "100th percentile".

  2. MEDIUM — every committed HTML report loads igv.min.js from
     cdn.jsdelivr.net at view time. 2/19 reports flaked on
     net::ERR_CERT_AUTHORITY_INVALID during this audit; any user
     behind a corporate proxy / airgapped network / jsdelivr outage
     will see IGV silently fail with no fallback. No SRI either.

  3-5. LOW — documentation improvements:
     - TERT_promoter example doesn't caveat that C228T's published
       biology is melanoma-specific; K562 result (all negative) is
       correctly modelled but reads as "no effect" without context
     - AlphaGenome DNASE vs ChromBPNet ATAC disagree on rs12740374
       direction in HepG2 (+0.45 vs -0.11); no application note
       teaches this real cross-oracle divergence
     - HBG2_HPFH footer notes BCL11A/ZBTB7A catalog absence; could
       be tightened

Normalization stack verified clean:
  - CDF monotonicity: 0 bad rows across 18,159 tracks × 10,000 points
  - signed_flags match LAYER_CONFIGS.signed exactly (AG 667 RNA-seq,
    Borzoi 1543 stranded RNA, SEI 40/40 regulatory_classification,
    LegNet 3/3 MPRA; Enformer 0 signed is correct — no RNA-seq)
  - Build-vs-scoring window_bp bit-identical via shared LAYER_CONFIGS
  - Pseudocount/formula: _compute_effect reproduces reference
    implementation with diff=0.0 across all test cases
  - perbin_floor_rescale_batch math verified at all edges
  - Edge cases: unknown oracle → None, unknown track → None,
    raw=0 → 0.0, raw=huge → 1.0

Phase A rerun on 4 AlphaGenome literature-checked cases (SORT1, TERT,
FTO, BCL11A) confirms biology is preserved but results are NOT bit-
identical — raw_score drift ~1-2% on dominant tracks, larger quantile
swings on near-zero tracks due to AlphaGenome's JAX CPU non-
determinism. No committed example is stale. Noise-floor handling for
|raw_score| < ~1e-3 added to follow-up recommendation list.

Artifacts:
  - audits/2026-04-16_application_and_normalization_audit.md (main report)
  - audits/2026-04-16_screenshots/*.png (19 full-page PNGs)
  - audits/2026-04-16_data/*.json (per-app cards + normalization/selenium/rerun JSON)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ixes

Addresses the findings in audits/2026-04-16_application_and_normalization_audit.md
(PR #9). Three categories of change:

1. Delete two example applications the audit recommends removing:

   - examples/applications/variant_analysis/TERT_promoter/
     C228T is a melanoma-specific gain-of-function mutation; the example
     runs it in K562 (erythroleukemia) and shows all-negative effects.
     The biology is correct for the model but inverts the published
     direction. Rather than add a "wrong cell type" caveat, drop the
     example — SORT1 / FTO / BCL11A cover variant_analysis without
     teaching the reader a misleading result.

   - examples/applications/validation/HBG2_HPFH/
     Already self-documented as "Not reproduced" in
     validation/README.md: BCL11A / ZBTB7A aren't in AlphaGenome's track
     catalog, so the repressor-loss mechanism isn't visible. Keeping a
     "validation failed" example alongside the working
     SORT1_rs12740374_with_CEBP confuses readers. Drop it.

   Also updated: root README.md (replaces HBG2_HPFH link with
   SORT1_rs12740374_with_CEBP), examples/applications/variant_analysis/README.md
   (drops TERT prompt + section), examples/applications/validation/README.md
   (drops HBG2 row + section + reproduce snippet),
   scripts/regenerate_examples.py + scripts/internal/inject_analysis_request.py
   (both lose their TERT_promoter/HBG2_HPFH entries).

2. Normalizer: guard against zero-count CDF rows
   (chorus/analysis/normalization.py).

   Audit finding #1 (HIGH): the committed chrombpnet_pertrack.npz has
   DNASE:hindbrain with effect_counts[idx] == 0 and a zero-filled CDF
   row. effect_percentile() / activity_percentile() silently returned
   1.0 for every raw_score (including 0.0) because np.searchsorted on
   a zeros row returns len(row) for any non-negative probe and the
   denominator falls through to cdf_width. Same bug-class as the v2
   chrombpnet concurrent-download race that landed in PR #8 — the
   hindbrain ENCODE tar must have failed to extract cleanly during the
   original background build.

   New private helper _has_samples() returns False when
   counts[idx] == 0, which makes _lookup / _lookup_batch return None.
   Callers already render None as "—" in MD/HTML tables, so users now
   see "no background" instead of a silent false "100th percentile".
   Counts-less NPZs (older format, no counts field) are treated as
   valid — no regression.

3. Report: suppress quantile_score when raw_score is in the noise floor
   (chorus/analysis/variant_report.py).

   Audit finding #6 (LOW): when |raw_score| < 1e-3 the effect CDF is
   so densely clustered around 0 that a 1-2% raw-score drift can swing
   the quantile by 0.5+ (observed in the Phase A rerun: committed
   quantile=1.0 vs rerun=0.21 for a CEBPB track with raw_score ~1e-4).
   Set quantile_score = None in that regime so the HTML/MD tables
   render "—" and readers don't misread noise as signal. Threshold
   chosen conservatively to cover both log2fc (pc=1.0) and logfc RNA
   (pc=0.001) without hiding real effects.

4. IGV.js: lazy-download the bundle into ~/.chorus/lib on first use
   (chorus/analysis/_igv_report.py + chorus/analysis/causal.py).

   Audit finding #2 (MEDIUM): reports embed a <script src="..."> to
   cdn.jsdelivr.net that gets evaluated every time the HTML is opened
   in a browser. Any viewer on an airgapped network / corporate proxy
   that MITMs TLS / during a jsdelivr outage sees IGV silently fail
   (2/19 audit reports hit ERR_CERT_AUTHORITY_INVALID). The local-
   cache code path already existed but was opt-in (user had to drop a
   file in ~/.chorus/lib/igv.min.js manually).

   New _ensure_igv_local() helper runs on the first report generation
   and populates the cache via chorus.utils.http.download_with_resume
   (the helper that landed in v2 PR #8). Reports written after the
   first successful download inline the JS directly — self-contained
   HTML that opens anywhere without network. Download failure is
   logged at WARNING and the CDN <script> tag is used as fallback,
   preserving the current behaviour for anyone who can't reach
   jsdelivr at generation time.

All changes are platform-agnostic; 287/287 pytest continue to pass;
fix verified behaviourally:

  >>> norm.effect_percentile('chrombpnet', 'DNASE:hindbrain', 0.0)
  None                      # was: 1.0
  >>> norm.effect_percentile('chrombpnet', 'DNASE:HepG2', 0.0)
  0.0                       # unchanged

  >>> ts = TrackScore(raw_score=0.0005, ...);
  >>> _apply_normalization(ts, ...); ts.quantile_score
  None                      # noise floor

See audits/2026-04-16_application_and_normalization_audit.md (PR #9)
for full context, per-app screenshots, and the Phase A / B / C
methodology behind each finding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lucapinello and others added 29 commits April 17, 2026 11:14
Screenshot review found causal prioritization MD+HTML still used the
old '(100%)' / '(95%)' percentile format. The variant_report tables
and batch_scoring tables already used '≥99th' / '0.95' / '≤1st' via
_fmt_percentile (added in the audit pass). Causal was the last
remaining site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two scary-looking warnings surfaced while reading notebook cell outputs
in the v7 audit. Neither is a real problem but both alarm users:

1. chorus/core/base.py:323 — case-sensitive compare of reference allele
   vs genome. pyfaidx returns lowercase for softmasked (repetitive)
   regions; users always pass uppercase. The previous code fired
   'Provided reference allele is not the same as the genome reference'
   on every variant in a softmasked locus (e.g. GATA1 TSS in quickstart
   notebook cell 39, comprehensive notebook cells 35 and 51). Now uses
   .upper() on both sides; also includes the actual allele pair in the
   warning message so users can confirm.

2. chorus/core/result.py:104 — 'Unknown implementation' warning fired
   for every Sei track (Stem cell / Multi-tissue / H3K4me3 etc.) that
   isn't in the hardcoded assay_type registry. The generic fallback
   works correctly; the warning was just noise. Downgraded to
   logger.debug.

Scientific review of outputs:
- SORT1 rs12740374: predictions match Musunuru 2010 mechanism (CEBPA/B
  binding gain, DNASE opening, H3K27ac gain, CAGE TSS increase) ✓
- BCL11A rs1427407: TAL1 binding loss + DNASE closing in K562 ✓
- FTO rs1421085: minimal effects in HepG2 (expected — adipose tissue) ✓
- TERT chr5:1295046 T>G: E2F1 binding gain + TERT TSS CAGE increase ✓
- SORT1 causal: rs12740374 ranks #1 composite=0.964 ✓

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ght CI

v8 found zero action items but called out four scenarios it did not
exercise. This PR closes them.

Fast suite (299 → 303): tests/test_error_recovery.py
- HF download ConnectionError → graceful return + warning log
- download_with_resume .partial file resume via Range header
- AlphaGenome missing HF_TOKEN → friendly actionable error
- Missing oracle env → "chorus setup" hint + graceful fallback
All four mocks, run in ~1.5 s, no network.

Integration suite (gated by @pytest.mark.integration):
tests/test_integration.py
- SEI + LegNet CDF download from HF dataset (v8 didn't trigger these
  because no regen workflow uses sei/legnet) — verified the NPZs
  load and pass monotonicity + p50/p95/p99 + counts checks
- ChromBPNet ATAC:K562 fresh download from ENCODE (~500 MB, 8 min)
  to a tmp dir — verifies the shared download_with_resume helper
  end-to-end without touching the 37 GB real cache
- First E2E test of the MCP server: spawn chorus-mcp stdio
  subprocess via fastmcp Client, call list_oracles + load_oracle +
  analyze_variant_multilayer on SORT1 rs12740374 with real
  AlphaGenome predict (~4.5 min)

Light CI: .github/workflows/tests.yml
- Runs fast suite on every PR and push to main/chorus-applications
- Linux ubuntu-latest + Miniforge + mamba + pip install -e .
- Skips smoke tests (~10 GB models exceed runner disk) and
  integration tests (too slow for per-PR feedback)
- workflow_dispatch for manual maintainer runs

pytest.ini: registers the `integration` marker.

Verified:
- pytest -m "not integration" → 303 passed, 4 deselected (58.9 s)
- pytest -m integration → 4 passed (13 min total)
- Fresh ATAC:K562 tarball streamed with .partial + fcntl lock, fold 0
  weights loaded into TF, 2114 bp predict returns finite values
- chorus-mcp subprocess round-trips analyze_variant_multilayer
  response back to the client

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unts

Full sweep of user-facing docs after the multi-pass audit series. 11 fixes:

BLOCKS_USER (wrong biology in example READMEs):
- variant_analysis/SORT1_chrombpnet/README.md: was claiming "+0.441 Strong
  opening", actual is "-0.111 Moderate closing". Replaced with correct
  values + cross-oracle divergence explanation.
- sequence_engineering/region_swap/README.md: entire scenario was wrong
  (promoter swap at chr1:1000500 vs actual SORT1 enhancer replacement at
  chr1:109274500). Rewrote from scratch to match actual example_output.md.
- sequence_engineering/integration_simulation/README.md: wrong direction
  (DNASE -0.900 vs actual +4.22), wrong filename. Rewrote.
- variant_analysis/SORT1_enformer/README.md: stale HepG2-focused numbers
  but actual is discovery-mode. Rewrote with top hits from current output.

CONFUSING (stale numbers):
- causal_prioritization/README.md:116: composite 0.898 → 0.964
- batch_scoring/README.md example table: old 4-column format → new
  per-track Ref/Alt/log2FC/Effect %ile format
- causal SORT1_locus/example_output.md: (100%) → (≥99th) via in-place
  patch (percentile format now consistent with other reports)
- validation/README.md:33,58: stale TERT CAGE +0.120 → +0.34

Track count drift (pick-one):
- 5,930 / 5930 → 5,731 across docs/API_DOCUMENTATION.md,
  docs/variant_analysis_framework.md, README.md, examples/applications/
- 230 bp → 200 bp for LegNet input size across same files

POLISH:
- docs/IMPLEMENTATION_GUIDE.md tree: removed "# Placeholder" markers on
  borzoi/chrombpnet/sei (all implemented); added analysis/ and mcp/
  directories; updated utils/ to reflect current files.
- multilayer_variant_analysis.md: "Quantile range" header → "Effect
  percentile range" to match the rest of the repo's terminology.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full fresh-install audit at fbaef50 after wiping 13.2 GB (7 mamba envs,
~/.chorus/, HF chorus models). This pass focused on reading the
actual content of every example output, not just error counts.

Four findings, all low-to-medium environmental:

1. MEDIUM: TF Hub's /var/folders/.../tfhub_modules/ cache survives
   chorus teardowns. A stale partial download from a prior session
   made Enformer smoke fail with "'saved_model.pb' nor
   'saved_model.pbtxt'". Clearing it fixes; README should document.

2. MEDIUM (regression from v8): On this audit's SSL-MITM'd network,
   stdlib urllib in download_with_resume fails on
   cdn.jsdelivr.net/igv.min.js with cert-verify error. 6/16 HTMLs
   landed on the CDN <script> fallback. huggingface_hub's httpx
   works through the same proxy — robust fix is to mirror
   igv.min.js on the HF chorus-backgrounds dataset and fall back
   there when urllib fails.

3. LOW: FTO README promises adipose tracks but example runs with
   HepG2 (documented in the prompt only, not the README).

4. LOW: Notebooks run via `jupyter nbconvert` without
   `mamba activate chorus` emit 20-60 `bgzip is not installed` ERROR
   lines per notebook from coolbox. bgzip IS in the env — PATH just
   isn't inherited. Plots render via in-memory fallback; user sees
   scary error spam.

Verified:
- 303/303 fast pytest (17.5 s)
- 6/6 oracle smoke (after tfhub clear)
- 12/12 examples regenerate, max Δeff 0.036 (CPU non-det)
- 0 orphan HTMLs after parallel regen (v6 API fix live)
- 3 notebooks execute, 0 errors, plots render (despite bgzip noise)
- 16/16 HTMLs show correct biology in spot-check vs literature
- 6/6 CDFs pass monotonicity/p50/p95/p99/counts
  (first audit to empirically verify sei + legnet CDFs;
   v9 integration test now automates)

Biology confirmed on: SORT1 rs12740374 (Musunuru 2010 CEBP mechanism),
BCL11A rs1427407 (TAL1 disruption in K562), TERT chr5:1295046
(E2F1 + CAGE activation), region_swap (enhancer removal closes
chromatin), integration_simulation (CMV insertion opens chromatin).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ip PATH

All 4 findings from audit PR #20. Fast suite 303 → 308.

1. MEDIUM: TF Hub corrupt-cache recovery (enformer.py + load_template.py)
   When an earlier Enformer download was interrupted, tfhub_modules/
   keeps a directory with no saved_model.pb. hub.load then raises with
   the bad path in the message. Added _load_enformer_with_tfhub_recovery
   that parses the error, wipes the bad dir, and retries once. Applied
   to both in-process (_load_direct) and subprocess (template) paths.
   README Troubleshooting now documents the manual workaround and the
   auto-recovery behavior.

2. MEDIUM (v8 regression): IGV JS fallback via huggingface_hub
   (_igv_report.py). On SSL-MITM networks stdlib urllib rejects the
   proxy cert and CDN fetch of igv.min.js fails. Added secondary
   fallback via hf_hub_download from lucapinello/chorus-backgrounds
   — huggingface_hub uses httpx+certifi which works through the same
   proxies that block urllib. Graceful no-op if the HF file doesn't
   exist yet (existing CDN <script> fallback still kicks in). Dataset
   upload of igv.min.js to the HF repo is a separate one-time task
   for the maintainer; until then this code path silently downgrades
   to the current behavior.

3. LOW: FTO README adipose claim vs HepG2 reality
   (examples/applications/variant_analysis/FTO_rs1421085/README.md).
   Rewrote the Tracks section to accurately state that the committed
   example uses HepG2 as a "nearest metabolic" proxy and shows what a
   no-signal call looks like. Included ready-to-use adipose-track
   assay_ids for users who want the biologically ideal run.

4. LOW: bgzip/tabix PATH when nbconvert skips mamba activate
   (chorus/__init__.py). Prepend sys.executable's bin/ to PATH at
   chorus import time. coolbox's subprocess calls to bgzip/tabix now
   succeed instead of emitting 20-60 ERROR lines per notebook and
   falling back to TabFileReaderInMemory. Cheap and idempotent
   (only prepended if not already present).

Tests: 5 new in tests/test_error_recovery.py
- test_corrupt_cache_is_cleared_and_retry_succeeds
- test_unrelated_errors_propagate_unchanged
- test_hf_fallback_when_cdn_fails
- test_returns_none_when_both_fail
- test_env_bin_on_path_after_import

Verified: pytest -m "not integration" → 308 passed (was 303).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The v8 audit added an fcntl lock around the OUTER ENCODE tarball
extraction. The v9 scorched-earth audit revealed that two concurrent
callers (MCP load_oracle + a jupyter notebook kernel both asking for
ChromBPNet) still raced on the INNER loop that extracts the three
per-fold subtarballs (bias_scaled / chrombpnet / chrombpnet_nobias),
producing:

    FileExistsError: [Errno 17] File exists:
    '.../downloads/chrombpnet/DNASE_HepG2/models/fold_0/chrombpnet_nobias'

Fix: re-acquire the lock around the inner loop and skip any t_out
directory that's already populated (empty-check via os.listdir).
Also fix the bare except to preserve OSError → Exception.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fresh-install audit at e99fd66 verifying all 4 v10 fixes on a truly
clean slate. Teardown: 14.2 GB including tfhub_modules/ this time.

All 4 v10 fixes verified live:
- Fix #1 (tfhub recovery): code path exists + first-install smoke
  passes on wiped tfhub cache.
- Fix #2 (IGV HF fallback): 0/16 HTMLs fell back to CDN on the same
  SSL-MITM network that had 6/16 fallbacks in v10.
- Fix #3 (FTO README): accurate HepG2 framing + adipose assay_ids
  block for the ideal run.
- Fix #4 (bgzip PATH): 0 'bgzip is not installed' lines across 235
  notebook cells (v10 had 20/34/60 per notebook).

One minor regression exposed: Fix #4 makes tabix findable, which
reveals a pre-existing bug where download_gencode leaves a stale
.tbi file that coolbox's `tabix -p gff` rejects with "index file
exists". Workaround = delete .tbi; NB1 retry succeeded. Proposed
3-line follow-up fix to annotations.py documented in the report.

Also verified:
- 308/308 pytest on fresh env (17.3 s)
- 6/6 oracle smoke (7 min 2 s) — first Enformer fresh-install with
  wiped tfhub cache
- 12/12 regen within AlphaGenome CPU non-determinism tolerance
- 0 orphan HTMLs after parallel regen
- 3 notebooks: 0 errors, 0 warnings, 0 bgzip spam
- 16/16 HTMLs clean in Selenium
- FTO README spot-check confirms Fix #3 committed correctly

After 11 audit passes — the last two have surfaced no actual chorus
bugs, only environmental quirks (tfhub cache, SSL MITM, PATH
inheritance, stale .tbi).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efreshes GTF

v11 audit exposed a pre-existing bug masked by the pre-Fix-#4 state where
tabix was not on PATH: when download_annotation refreshes a GTF, leftover
coolbox artefacts (file.gtf.bgz + file.gtf.bgz.tbi from a previous session)
point at byte offsets in the old .bgz that no longer match the new one.
coolbox then calls tabix -p gff file.bgz on its next GTF() read, tabix
refuses to overwrite without -f, and the notebook cell crashes with:

    CalledProcessError: Command '['tabix', '-p', 'gff', ...]'
    returned non-zero exit status 1.

Fix: in AnnotationManager.download_annotation, after sort_annotation
writes the fresh GTF, unlink any stale .bgz / .bgz.tbi / .gz.tbi
sharing the same stem. coolbox then regenerates them cleanly on first
GTF() call. Three extra unlink() calls on a not-hot path.

Unit test: TestStaleGTFIndexCleanup in tests/test_error_recovery.py
mocks requests.get + sort_annotation, primes the annotations dir with
stale .bgz and .tbi, and verifies both are removed after
download_annotation returns.

Verified: pytest -m "not integration" → 309 passed (was 308).

Without this fix, a notebook run after any annotation refresh
(download_gencode() called twice across sessions, or a newer GENCODE
version pulled) hits the tabix error on first coolbox visualization cell.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tall

Full cold-start on /srv/local/lp698/chorus-audit-v9:
- 7 chorus envs wiped + rebuilt via chorus setup
- ~/.chorus + ~/.cache/huggingface wiped + re-downloaded
- 314/314 tests pass (308 code + 6 oracle smoke)
- 3 notebooks: 0 errors across 235 cells, 32 plots
- All 13 examples regenerated fresh
- Selenium: 16/16 HTML reports CLEAN

Includes 2026-04-17_v9_scorched_earth_audit.md report with
scientific content review — every prediction cross-checked against
published literature (Musunuru 2010 for SORT1, Bauer 2013 for BCL11A,
Claussnitzer 2015 for FTO, etc.). All match textbook biology.

One bug found + fixed during audit: ChromBPNet nested-tar race
(commit 7834d3c, merged earlier).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ance, multi-oracle

Every report HTML (variant, causal, discovery, region-swap, integration,
batch, validation) now opens with a shared "How to read this report"
glossary that names each layer's effect formula (log2FC / lnFC / Δ).
Per-layer table headings carry a formula chip; summary strings cite the
specific track and cell type that drove each headline number; the causal
SORT1 report merges its old Rankings + Details sections into one
expandable table with per-layer top-track provenance.

New MultiOracleReport renders a cross-oracle consensus matrix for one
variant scored by several oracles (example: SORT1 rs12740374 scored with
ChromBPNet + LegNet + AlphaGenome), flagging where models agree or
disagree on direction per layer.

Shared logic lives in chorus/analysis/_report_glossary.py so renderers
stay consistent; VariantReport.from_dict enables JSON rehydration used by
scripts/rerender_examples.py to refresh HTML without re-running oracles.
14 new tests pin the new behaviour; all 333 suite tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sensus

The consensus matrix previously rendered a single oracle's direction as
"all ↑" / "all ↓" — technically correct (all reporting oracles agree
trivially when there's one) but misleading to users who see "all ↑"
and assume multiple oracles concurred. v11 content-review audit
caught this on the committed SORT1 multi-oracle example, where
chrombpnet, legnet, and alphagenome are specialists — each reports a
different subset of layers — and the three "all ↑" rows in the matrix
were actually single-oracle votes.

Change:
- _consensus_rows now emits "single_gain" / "single_loss" when
  exactly one oracle reports a direction, separate from
  "consensus_gain" / "consensus_loss" which now require ≥2 oracles.
- Markdown renders as "only ↑ (n=1)" / "only ↓ (n=1)"; HTML uses
  "↑ only (n=1)" / "↓ only (n=1)" with a new neutral-grey
  .agree-single CSS class so users visually distinguish trivial
  single-voter layers from real cross-oracle consensus.
- Existing "all ↑" / "all ↓" labels still fire when 2+ oracles agree.

Regenerated SORT1 rs12740374 multi-oracle example (it was the direct
case that exposed the bug): the three previously-"all ↑" single-voter
layers (TF binding, histone marks, CAGE — all AlphaGenome-only) now
correctly read "only ↑ (n=1)". The "disagree" chromatin row (AG vs
ChromBPNet) and any future ≥2-oracle consensus rows are unchanged.

Tests: +1 regression test (test_single_voter_layer_uses_n1_label_not_all)
that verifies both agreement dict values ("single_gain"/"single_loss")
and the user-visible strings in MD + HTML. All existing multi-oracle
tests still pass (they all use ≥2 oracles per layer, so the
consensus_gain / consensus_loss path is untouched).

Verified: pytest -m "not integration" → 325 passed (was 324).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The causal report's IGV renderer was writing signal tracks directly from
raw oracle predictions with per-track autoscale, while the variant-report
IGV was already running a PerTrackNormalizer floor-subtract + rescale to
[0, 3.0] (1.0 = genome-wide p99 peak). That inconsistency meant two
reports from the same run had incomparable y-axes.

Extract the rescale step into ``_igv_report.apply_floor_rescale`` and
call it from both ``build_igv_html`` and ``_build_causal_igv`` so every
IGV panel in every report uses the same scaling by default. Users who
want raw dynamics can opt out via ``CausalResult._igv_raw`` (mirrors the
existing ``VariantReport._igv_raw`` knob).

Regenerated the SORT1 causal example; 36 signal-track panels now use
scaled (min=0, max=3) instead of raw autoscale.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…consensus

The multi-oracle consensus matrix now renders 'only ↑ (n=1)' for layers
covered by a single oracle and reserves 'all ↑' / 'all ↓' for ≥2 agreeing
oracles — avoiding the misleading 'all ↑' badge on SORT1 TF/histone/CAGE
rows where only AlphaGenome contributes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…order

Two small fixes to the rerender path:

- When a dir has one JSON but a non-default HTML filename (e.g. the Enformer
  variant_analysis report used ``rs12740374_...`` rather than the
  ``chr1_...`` default), the rerender now matches on oracle_name so we
  keep the existing name instead of writing an orphan file next to it.

- The multi-oracle consolidator loops oracles in the canonical order
  (specialists → generalist: chrombpnet, legnet, alphagenome) matching
  scripts/regenerate_multioracle.py, so consensus-matrix columns don't
  shuffle between a full regen and a pure-JSON rerender. Also synthesises
  a multi-oracle AnalysisRequest instead of reusing the first per-oracle
  one — otherwise the rendered prompt block read as a single-oracle run.

Re-rendered the multi-oracle example to confirm output matches the full
regen (only the timestamp differs).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First audit pass to deliberately exercise all three user-facing
modalities — Python library, committed examples, MCP server over
stdio — on the SAME variant and verify outputs are bit-identical.
Teardown: 14.2 GB (envs + ~/.chorus + HF models + tfhub cache).

Key result — cross-modality consistency:
  Regenerating SORT1 rs12740374 analysis via (A) the Python library
  regen scripts and (B) fastmcp Client calling chorus-mcp over
  stdio produces:
    Track       regen        MCP         Δ    labels
    DNASE       +0.4315     +0.4315    0.0000  identical
    CEBPA       +0.3712     +0.3712    0.0000  identical
    CEBPB       +0.2822     +0.2822    0.0000  identical
    H3K27ac     +0.1660     +0.1660    0.0000  identical

  Numbers match to 4 decimal places and descriptions are byte-
  identical ("DNASE:HepG2", "CHIP:CEBPA:HepG2", etc.). Users moving
  between Python scripts, the MCP server running under Claude, and
  the committed example reports will not see any discrepancies.

Two findings, both environmental:

1. MEDIUM: Enformer regen re-creates f15d926-deleted files.
   scripts/regenerate_examples.py ENFORMER_EXAMPLES still has two
   entries targeting validation/SORT1_rs12740374_with_CEBP/chr1_...
   that f15d926 deleted. On regen, 2 orphans appear in git status.
   Fix: drop the two entries.

2. MEDIUM (recurring v10): SSL-MITM networks make the CDN fetch of
   igv.min.js fail; v10 Fix #2's HF fallback can't activate because
   igv.min.js is not yet uploaded to lucapinello/chorus-backgrounds.
   When parallel regens start with a cold ~/.chorus/lib/ cache, the
   earliest-written HTMLs get CDN <script> tags (4/18 this run).
   Fix: one-time upload of igv.min.js to the HF dataset — code
   already ready.

Other verified:
- 326/326 pytest on fresh env (18 s)
- 6/6 oracle smoke (7m 36s) — Enformer tfhub fresh download clean
- 12/12 examples regenerate within non-det tolerance
- 18/18 HTMLs carry the new "How to read" glossary (f15d926)
- 235 notebook cells across 3 NBs, 0 errors, 0 bgzip spam (Fix #4)
- v12 n=1 label fix live — multi-oracle single-voter rows now read
  "only ↑ (n=1)" not "all ↑"

Deferred (same as v8-v11): Linux/CUDA, hosted deployment,
clinical validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…files

Two v12-audit findings, both environmental / regen-script level.

#1 Drop f15d926-deleted entries from ENFORMER_EXAMPLES
=======================================================
scripts/regenerate_examples.py had two ENFORMER_EXAMPLES dicts that
wrote to validation/SORT1_rs12740374_with_CEBP/chr1_..._enformer_*.html
files that commit f15d926 deleted as redundant. After a fresh regen,
git status showed 2 untracked orphan HTMLs. Removed the two dict
entries; the SORT1_enformer/ directory already covers Enformer
discovery for this variant.

#2 Bundle igv.min.js as a package resource
===========================================
Every chorus report inlines ~1.3 MB of IGV.js so the committed HTMLs
are self-contained (offline-viewable, proxy-proof, air-gap-proof).
Previously the file was lazy-downloaded from cdn.jsdelivr.net on
first use; on SSL-MITM networks that download fails, and the v10 HF
fallback can't activate because igv.min.js was never uploaded to
lucapinello/chorus-backgrounds. When multiple regen scripts run in
parallel with a cold ~/.chorus/lib/, the earliest HTMLs got CDN
<script> tags (4/18 in the v12 audit).

New resolution order in _ensure_igv_local:
  1. chorus/analysis/static/igv.min.js — bundled with the package.
     Always present in a standard install; no I/O, no network.
  2. ~/.chorus/lib/igv.min.js — legacy cache from older installs.
  3. CDN via stdlib urllib (existing).
  4. HuggingFace dataset via huggingface_hub (existing).

Bundled file adds 1.3 MB to the package — noise next to the GB-scale
oracle deps. setup.py package_data now ships analysis/static/*.js.

Tests: 2 new in TestIGVBundledResource
- test_bundled_igv_js_is_present_in_package — wheel includes the file
- test_ensure_igv_local_returns_bundled_without_network — monkeypatches
  all network fallbacks to AssertionError, verifies bundled path is
  returned without touching them.

Existing TestIGVFallbackViaHuggingFace tests updated to monkeypatch
_IGV_BUNDLED to a missing path so the CDN → HF chain is still
exercised (simulates a stripped install).

Verified:
- pytest -m "not integration" → 328 passed (was 326)
- Wiped ~/.chorus/lib/ + confirmed _ensure_igv_local returns the
  package-bundled path instantly with no network call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…files

Eliminates the cold-cache race where regens running in parallel produced
HTMLs with CDN <script> tags instead of inlined IGV.js. Bundles
igv.min.js as a package resource (chorus/analysis/static/igv.min.js)
with resolution order bundled → legacy cache → CDN → HF. Also drops
the two ENFORMER_EXAMPLES entries that wrote to files deleted in
f15d926.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds audits/2026-04-20_v12_full_ux_consistency_audit.md documenting
the first cross-modality audit pass (library regen vs MCP over stdio
→ bit-identical scores on SORT1 rs12740374) that uncovered the two
v12 findings now fixed in the companion v12-polish merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s + shared percentile format

v13 docs/content consistency sweep found two reports bleeding
through raw AlphaGenome catalog assay_ids instead of the enriched
display names used everywhere else in chorus. A user who saw
"CHIP:CEBPA:HepG2" in a variant report would see
"CHIP_TF/EFO:0001187 TF ChIP-seq CEBPA genetically modified…" in
the multi-oracle consensus matrix or the causal drill-down — an
inconsistency that's exactly the kind of polish issue that makes
users lose trust.

Affected reports:

1. MultiOracleReport (multi_oracle_report.py)
   - _consensus_rows now captures ``description`` alongside assay_id
   - MD render prefers description over raw assay_id
   - HTML render same, plus percentile via _fmt_percentile ("≥99th"
     instead of "+100.0%") — matches the format used by every other
     chorus report
   - Per-oracle drill-down table uses description, not <code>assay_id</code>

2. CausalResult HTML (causal.py)
   - "Strongest track" line shows enriched label as the primary
     user-facing name, raw assay_id demoted to a secondary <code>
     tag only when description differs
   - Per-layer breakdown table renders description instead of raw
     assay_id
   - Percentile column uses _fmt_percentile ("≥99th" / "near-zero"
     instead of "+100.0%")
   - IGV track labels use description so the panel matches the
     table rows ("CHIP:CEBPA:HepG2 ref" / " alt" instead of the
     60-char raw assay_id)

Regenerated SORT1 causal + SORT1 multi-oracle committed outputs:
- multi-oracle HTML: raw CHIP_TF references 3 → 0; enriched
  references 1 → 4
- causal HTML: raw references 30 → 3 (remaining 3 are inside
  IGV "name" attributes that the fix now also routes through
  description — they read like "CHIP:CEBPA:HepG2 (#1 rs12740374)")
- "+100.0%" percentile format → 0; "≥99th" format → 39 (causal) + 8
  (multi)

Test: tests/test_analysis.py::TestMultiOracleReport::
test_uses_enriched_description_not_raw_assay_id — asserts both the
MD and HTML render "CHIP:CEBPA:HepG2" and NOT
"TF ChIP-seq CEBPA genetically modified", and that the HTML carries
"≥99th" but no "+100.0%".

Verified: pytest -m "not integration" → 329 passed (was 328).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant