Skip to content

audit: 2026-04-17 v10 fresh-install + content-review audit#20

Open
lucapinello wants to merge 65 commits intomainfrom
audit/2026-04-17-v10-fresh-install-content-review
Open

audit: 2026-04-17 v10 fresh-install + content-review audit#20
lucapinello wants to merge 65 commits intomainfrom
audit/2026-04-17-v10-fresh-install-content-review

Conversation

@lucapinello
Copy link
Copy Markdown
Contributor

Fresh-install audit at fbaef50 with deep content review of every example output — biology direction, summary-sentence-vs-table consistency, label coherence — not just pass/fail counters.

Four findings (all low-to-medium environmental)

  1. MEDIUM — TF Hub's /var/folders/.../tfhub_modules/ cache survives rm -rf ~/.chorus/. A stale partial download made Enformer smoke fail ('saved_model.pb' nor 'saved_model.pbtxt'). Clearing it fixes; worth documenting in README Troubleshooting.

  2. MEDIUM regression from v8 — On SSL-MITM networks, stdlib urllib fails to fetch cdn.jsdelivr.net/igv.min.js with a cert-verify error. 6/16 HTMLs landed on the CDN <script> fallback. huggingface_hub's httpx+certifi works through the same proxy — robust fix: mirror igv.min.js on the HF dataset and fall back there.

  3. LOWFTO_rs1421085/README.md promises adipose tracks; the example actually uses HepG2 ("nearest metabolic" per the prompt only).

  4. LOW — Notebooks emit 20–60 [ERROR] bgzip is not installed lines each when run via jupyter nbconvert without a preceding mamba activate. bgzip IS in the env; PATH just isn't inherited. Plots render via TabFileReaderInMemory fallback.

Verified (beyond numbers)

  • Biology direction on every named example matches literature: SORT1 (Musunuru 2010 CEBP gain), BCL11A (TAL1 disruption in K562), TERT (E2F1 + CAGE activation), region_swap (enhancer removal closes), integration_simulation (CMV promoter opens).
  • Summary sentences match their tables in all 12 MDs.
  • 6/6 CDFs pass empirical checks (first audit with all 6 verified empirically in one pass — sei + legnet were skipped in v6/v8).
  • Zero orphan HTMLs after parallel regen (v6 API fix still holds).
  • Notebook plots render correctly in all 3 notebooks despite bgzip error spam.
  • 303/303 pytest on fresh env (17.5 s).
  • 6/6 oracle smoke passes (after tfhub cache clear).

One content-review observation worth flagging

The discovery example (SORT1_cell_type_screen) ranks LNCaP prostate cancer as the #1 hit (+1.91 DNASE), above any liver cell type — even though rs12740374 is a known liver eQTL. The root cause is that LNCaP has very low baseline SORT1 DNase at this locus, so the relative log2FC is inflated. A newcomer who reads the README would be surprised. Adding one paragraph explaining this teaching moment would help.

Full report: audits/2026-04-17_v10_content_review_audit.md

🤖 Generated with Claude Code

lucapinello and others added 30 commits March 25, 2026 13:39
…und distributions

- New chorus/analysis module: multi-layer scoring (scorers.py), variant reports,
  quantile normalization, batch scoring, causal prioritization with enriched
  HTML tables (gene, cell type, per-layer score columns, top-3 IGV signal tracks),
  cell type discovery, region swap, integration simulation
- New chorus/analysis/build_backgrounds.py: variant effect and baseline signal
  background distributions for quantile normalization, with batch GPU scripts
- 8 application examples with full outputs: variant analysis (SORT1, TERT, BCL11A,
  FTO across AlphaGenome/Enformer/ChromBPNet), causal prioritization, batch scoring,
  cell type discovery, sequence engineering (region swap + integration simulation)
- Validation against AlphaGenome paper: SORT1 confirmed, TERT partially confirmed
  (ELF1 limitation documented), HBG2 not reproduced in K562 or monocytes (ISM vs
  log2FC methodology difference documented with side-by-side comparison)
- Fix mamba PATH resolution in environment runner and manager
- Add gene_name, cell_type fields to CausalVariantScore and BatchVariantScore
- 500 common SNPs BED file for background computation
- 91 tests covering all analysis components

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The annotation module was re-parsing the full 1GB GENCODE GTF file on every
call to get_genes_in_region, get_gene_tss, and get_gene_exons (~11s each).
Now the GTF is loaded once per feature type (gene/transcript/exon) and cached
as a DataFrame for the process lifetime. Exon lookups use a groupby index
for O(1) gene-name access.

Before: 11,000ms per query (full GTF scan)
After:  0.03s genes, 0.04s TSS, 1.5ms exons (cached)

Full analysis test suite now completes in 2 min (was timing out at 10+ min).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ores

Single-process AlphaGenome script that extracts all 3,763 valid tracks
(711 cell types × 6 output types) from each forward pass. Same GPU time
as K562-only (~55 min total on A100) but yields comprehensive per-layer
distributions across all cell types.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comprehensive background distribution builder:
- 10K random SNPs from hg38 reference across all autosomes
- 20K protein-coding gene TSS positions (promoter state baselines)
- 5K random genomic positions (general baseline)
- Parallel GPU execution: --part variants --gpu 0 / --part baselines --gpu 1
- All 3,763 AlphaGenome tracks extracted per forward pass

Expected output: ~37M variant scores + ~94M baseline samples in ~18 hours.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n counting

Covers all AlphaGenome output types:
- Window-based: DNASE, ATAC, CHIP_TF, CHIP_HISTONE, CAGE, PROCAP,
  SPLICE_SITES, SPLICE_SITE_USAGE
- Exon-counting: RNA_SEQ (sum across merged protein-coding exons per gene)
- All backgrounds unsigned (abs magnitude) for quantile ranking

Pre-loads GENCODE v48 gene annotations and builds spatial index for fast
exon lookup within prediction windows.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…haul

Analysis framework:
- PerTrackNormalizer with per-track CDFs (effect, activity, perbin) for all 6 oracles
- Auto-download backgrounds from HuggingFace on oracle load
- AnalysisRequest dataclass preserves user's original prompt on every report
- Magnitude-gated interpretation labels ("Very strong" requires |effect| > 0.7)
- Top-10-per-layer cap in markdown reports with truncation footer
- Biological interpretation + suggested next steps on all 14 example outputs
- Literature caveats where oracle predictions diverge from published biology

Bug fixes:
- Sequence.slice() missing self argument in interval.py (broke Enformer predictions)
- oracle_name="oracle" placeholder in region_swap, integration, discovery
- Cell-type column bloat in batch_scoring and causal (thousands of cell types)
- Corrupted igv.min.js in 15 HTML files from misplaced injection into JS string
- predict() called with string region instead of (chrom, start, end) tuple

MCP server:
- All 8 critical tools accept user_prompt and forward it into reports
- _safe_tool decorator returns structured {"error", "error_type"} on failure
- Improved docstrings: score_variant_batch (variant dict schema), discover_variant_cell_types (runtime + cell count), fine_map_causal_variant (composite formula + output columns)
- Causal table shows Top Layer column; batch scoring resolves track IDs to human-readable names

Application examples (14 folders, all with MD/JSON/TSV/HTML):
- Regenerated all variant_analysis, validation, discovery, causal, batch, sequence_engineering
- Every report has Analysis Request header + Interpretation section
- Cleaned stale intermediate files (5 removed)
- IGV browser verified working in headless Chrome

Documentation:
- README: "Start here" applications callout, updated MCP tools list, MCP walkthrough link
- API_DOCUMENTATION: application layer section (all 6 functions + AnalysisRequest)
- MCP_WALKTHROUGH.md: 5 example conversations showing natural-language usage
- Natural-language framing notes on all 7 category READMEs
- Fixed AlphaGenome HF URL, clarified environment.yml vs chorus-base.yml
- Notebook install banners for comprehensive/advanced (all 6 oracles required)

Scripts:
- Internal scripts moved to scripts/internal/
- regenerate_examples.py + regenerate_remaining_examples.py for reproducible output generation
- scripts/README.md updated with public script descriptions

Testing:
- 268 tests passed (including new magnitude-gate and causal-table tests)
- All 3 notebooks executed end-to-end (Enformer, all 6 oracles, multi-oracle analysis)
- IGV browser rendering verified via Selenium in headless Chrome
- MCP server startup verified

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These are replaced by the per-oracle build_backgrounds_*.py scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove stale outputs containing machine-specific paths (/Users/lp698/...)
and runtime-specific logs. Notebooks should be committed clean so new
users run them fresh in their own environment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace /PHShome/lp698/chorus with REPO_ROOT (computed from __file__)
  in all 8 public scripts (6 build_backgrounds + 2 regenerate)
- Clear stale notebook output cells containing machine-specific paths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ll traceability

Batch scoring:
- Per-track columns (one per assay:cell_type) with raw score + percentile
- track_scores dict preserved on BatchVariantScore for programmatic access
- display_mode parameter: "by_assay" (default), "by_cell_type", "summary"
- Track ID footnotes for tracing back to oracle data
- oracle_name parameter fixes normalizer CDF lookup (was returning None)

Causal prioritization:
- Per-track columns replacing generic "Max Effect / Top Layer"
- Each cell shows raw effect + percentile for each scored track
- track_scores dict on CausalVariantScore

Report infrastructure:
- report_title field: "Region Swap Analysis Report", "Integration Simulation Report"
- modification_region: IGV highlights full replaced/inserted region (not 2-3bp)
- modification_description: documents what was inserted/replaced and its length
- has_quantile scoping fix (UnboundLocalError on empty allele_scores)

All examples regenerated with biologically specific tracks:
- SORT1: HepG2 DNASE + CEBPA + CEBPB + H3K27ac + CAGE (reproduces Musunuru)
- BCL11A: K562 DNASE + GATA1 + TAL1 + H3K27ac + CAGE (reproduces Bauer)
- FTO: HepG2 tracks (nearest metabolic cell type available)
- TERT: K562 tracks
- Validation: forced HepG2 CEBP tracks matching the AlphaGenome paper

Every report carries the user's original prompt (Analysis Request block).
All 13 examples verified: MD + JSON + TSV + HTML, prompt present, 268 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…racle table

- Remove ChromBPNet loading and "Combining oracles" sections (belong in main README)
- Rename "Window" to "Output window" + add Resolution column
- Separate Effect percentile and Activity percentile explanations
- Add recommendation to start with AlphaGenome
- Remove Python API details (get_normalizer) that don't belong here

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- .mcp.json: drop the /data/pinello/... PATH hardcoding so new users can
  use the file as-is from `curl` in any environment. mamba resolves the
  chorus env without an explicit PATH override.
- README.md: add LDlink token setup section under Troubleshooting —
  fine_map_causal_variant auto-fetch path was silently failing for users
  without a free LDlink API key.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changes informed by a fresh walkthrough of the README from the perspective
of a brand-new user:

- Reorder Installation: Fresh Install now comes before Upgrading (a first-
  time reader no longer sees "remove existing envs" before they install)
- Consolidate the Fresh Install block to cover env create, pip install,
  chorus setup --oracle enformer, and chorus genome download hg38 so a
  user copying the block ends up actually ready to run the Quick Start
- Clarify that the root environment.yml is what you install and the
  per-oracle YAMLs in environments/ are internal to `chorus setup`
- Quick Start: point to examples/single_oracle_quickstart.ipynb for users
  who prefer a notebook, and call out the setup prerequisite explicitly
- Annotate the ENCFF413AHU track ID in the DNase snippet so users know
  what it is before the Discovering Tracks section explains it
- HF_TOKEN: note that Claude Code inherits env from the shell where
  `claude` is started (the MCP server is spawned by that shell)
- Add a "Further reading" section linking the docs/ folder — previously
  API_DOCUMENTATION, METHOD_REFERENCE, VISUALIZATION_GUIDE, and
  IMPLEMENTATION_GUIDE were all invisible to a README-only reader
- Remove REQUIREMENTS_CHECKLIST.md (internal audit scratch file)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…gitignore

Real issues caught by a deeper audit pass and fixed:

- **Duplicate CAGE column headers**: batch scoring tables rendered two
  "CAGE:HepG2" columns because both + and - strand tracks have
  identical description fields. _track_display_name now appends (+) / (-)
  when the assay_id carries a strand suffix, producing unique column
  labels in markdown, HTML, and DataFrame outputs.
- **UnboundLocalError in _build_html_report**: has_quantile /
  has_baseline were defined inside a for-loop that doesn't execute for
  empty allele_scores, causing .to_html() to crash on minimally-populated
  reports. Initialise both before the loop (matches the markdown fix).
- **docs/RELEASE_CHECKLIST.md**: internal QA checklist with stale
  metrics (references 128 tests when we have 280). Removed — internal
  docs shouldn't live in user-facing docs/.
- **API_DOCUMENTATION / METHOD_REFERENCE overlap**: added reciprocal
  callouts clarifying that API_DOCUMENTATION is authoritative and
  METHOD_REFERENCE is a one-line cheat sheet.
- **logs/ not ignored**: 82 MB of run logs at risk of being committed.
  Added to .gitignore.

Test coverage added (+12 tests, 268 → 280 passed):
  - TestReportMetadataFields: report_title, modification_region,
    modification_description rendering in MD / HTML / dict
  - TestBatchDisplayModes: by_assay, by_cell_type, track-ID footnote,
    CAGE strand disambiguation in DataFrame columns
  - TestSafeToolDecorator: passthrough, exception → error dict,
    function name preservation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Columns and TSV headers now show CAGE:HepG2 (+) and CAGE:HepG2 (-)
instead of two identical CAGE:HepG2 columns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A new user could miss this entirely — the previous mention was a single
feature bullet ("auto-downloaded from HuggingFace") with no detail.
This section spells out:

- The backgrounds turn raw log2FC into the effect/activity percentiles
  shown in every report
- They're fetched on first oracle use from the public HF dataset
  lucapinello/chorus-backgrounds and cached in ~/.chorus/backgrounds/
- File sizes per oracle (so users with limited disk know what to expect)
- **No HF_TOKEN required** for backgrounds (only AlphaGenome model is gated)
- LDlink token is separate and only needed for causal auto-fetch
- Optional pre-download snippet for users who want to avoid the first-use
  wait

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a comprehensive reference appendix covering:

- What the backgrounds are (effect %ile vs activity %ile vs per-bin)
  and why they exist (turn raw log2FC into genome-aware metrics)
- How they were calculated:
    * Variant effect distribution: 10K random SNPs × all tracks with
      layer-specific scoring formulas (log2FC, logFC, diff)
    * Activity distribution: ~31.5K positions (random intergenic +
      ENCODE SCREEN cCREs + protein-coding TSSs + gene-body midpoints)
    * Per-bin distribution: 32 random bins per position for IGV scaling
    * RNA-seq exon-precise sampling rule
    * CAGE summary routing rule
- Sample sizes per oracle (track count, samples per track, NPZ size)
- Python API usage with verified signatures (get_pertrack_normalizer,
  download_pertrack_backgrounds, effect_percentile, activity_percentile,
  perbin_floor_rescale_batch)
- MCP / Claude usage (auto-attached, zero-config)
- Documented ranges and a sanity-check rule of thumb for interpretation
- How to reproduce or extend the backgrounds via build_backgrounds_*.py

All function signatures in the appendix were verified against the actual
implementation before committing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Local-only IDE/agent state (settings, scheduled tasks lock) — per-
developer, not for the repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pies

- AUDIT_PROMPT.md: systematic end-to-end audit script for a new machine,
  with REPLACE_* placeholders for HF_TOKEN and LDLINK_TOKEN.
- .gitignore: block any *_WITH_TOKENS.md or AUDIT_PROMPT_WITH_TOKENS*
  file from ever being staged, since filled-in copies contain secrets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Records what worked and what did not on a fresh macOS 15.7.4 / arm64
clone of chorus-applications: full install, all 6 oracle smoke-predicts,
286/286 pytest pass, 3 example notebooks (0 errors), 22 MCP tools registered,
6/6 application tools producing correct outputs (rs12740374 SORT1 case
reproduces the published Musunuru-2010 finding), 18/19 application HTML
reports IGV-verified via headless Chrome, ChromBPNet smoke build completed
end-to-end on CPU.

Top issues a macOS user hits, ranked, with one-or-two-line fixes:
  1. No Apple GPU (MPS / Metal / jax-metal) auto-detect — frameworks are
     installed but borzoi/sei/legnet only check torch.cuda.is_available(),
     SEI hardcodes map_location='cpu', chrombpnet/enformer envs lack
     tensorflow-metal. Verified Borzoi runs on MPS in 4.3 s when forced.
  2. SEI Zenodo download via stdlib urllib at ~80 KB/s — 3.2 GB tar takes
     ~11 h. curl -C - -L recovers it in ~30 min.
  3. fine_map_causal_variant rsID-only crash (KeyError 'chrom' at
     causal.py:355). Workaround: pass "chr1:pos REF>ALT" form.
  4. Two-mamba-installs MAMBA_ROOT_PREFIX gotcha breaks chorus health.
  5. Notebooks need explicit `python -m ipykernel install --user --name chorus`.
  6. SEI download has no single-flight lock — concurrent inits race.

Verdict: production-ready with caveats. None of the issues block correctness;
all are operational or one-line code fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses every actionable item in audits/2026-04-14_macos_arm64.md.
All changes are platform-conditional — Linux CUDA paths are unchanged.

PyTorch oracles (borzoi, sei, legnet) — auto-detect MPS on Apple Silicon
  - Both the in-process loader (chorus/oracles/{borzoi,sei,legnet}.py) and
    the subprocess templates ({borzoi,sei,legnet}_source/templates/{load,
    predict}_template.py) now resolve `device is None` (or the new 'auto'
    sentinel) as: cuda > mps > cpu. Linux + CUDA box hits the cuda branch
    first, no behavior change there.
  - SEI: replaced the hard `map_location='cpu'` device pin (the value is
    still used to load weights to host memory before .to(device), which is
    the standard pattern across torch versions and works for mps too).
  - Sei BSplineTransformation lazily moved its spline matrix only when
    `input.is_cuda`. Generalized to any non-CPU device so the matmul works
    on MPS as well. Verified: 286/286 pytest still pass.

TensorFlow oracles (chrombpnet, enformer) — Metal backend on Apple Silicon
  - chorus/core/platform.py macos_arm64 adapter now adds
    `tensorflow-metal>=1.1.0` to pip_add. Once installed, Apple's plugin
    registers a 'GPU' physical device, so the oracles' existing
    tf.config.list_physical_devices('GPU') auto-detect picks it up with no
    code change. Linux paths don't see the macos_arm64 adapter so CUDA stays
    intact.

JAX oracle (alphagenome) — unchanged
  - Already explicitly skips Metal in auto-detect (jax-metal still missing
    `default_memory_space` for AlphaGenome). README updated to document
    this trade-off.

MCP fix — fine_map_causal_variant rsID-only crash
  - Calling `fine_map_causal_variant(lead_variant="rs12740374")` previously
    raised KeyError: 'chrom' at chorus/analysis/causal.py:355 because
    `_parse_lead_variant("rs12740374")` returns {"id": ...} only.
  - Backfill chrom/pos/ref/alt onto the sentinel from the LDlink response
    (which always carries them) before invoking prioritize_causal_variants.
  - Verified end-to-end: rs12740374 ranked #1 with composite=1.000 of 12 LD
    variants on AlphaGenome (matches the published Musunuru-2010 finding).

SEI Zenodo download — chunked + resume + single-flight lock
  - Replaced urllib.request.urlretrieve with a stdlib chunked urlopen loop
    that supports HTTP Range resume and an fcntl exclusive lock so two
    concurrent SeiOracle inits don't race the same partial file. Original
    observed throughput on macOS was ~80 KB/s (would take ~11 hours for the
    3.2 GB tar); the new path resumes interrupted downloads and progress-
    logs every 100 MB.

README — macOS troubleshooting + Apple GPU policy table + kernel install
  - Documented the two-mamba-installs MAMBA_ROOT_PREFIX gotcha that breaks
    `chorus health` when the new chorus env lands in a different mamba root
    than the per-oracle envs.
  - Added the per-oracle macOS GPU support matrix (MPS / Metal / CPU) with
    explicit `device=` examples.
  - Added the missing `python -m ipykernel install --user --name chorus`
    step to Fresh Install so examples/*.ipynb find the chorus kernel.

Validation on macOS 15.7.4 / Apple Silicon (CPU + MPS + Metal):
  - 286/286 pytest pass (incl. all 6 oracle smoke-predict tests)
  - chorus.create_oracle('borzoi') auto-detects mps:0
  - chorus.create_oracle('sei')    auto-detects mps:0 + smoke-predict ok
  - chrombpnet env now reports tf.config.list_physical_devices('GPU') = [GPU:0]
  - fine_map_causal_variant(lead_variant='rs12740374') ranks rs12740374
    composite=1.000 of 12 LD variants

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…EI resumable download, rsID backfill)

Verified on Linux CUDA: 285/285 code tests pass. Alphagenome smoke errors due to expired HF token in chorus-alphagenome env (unrelated to PR).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- tests/test_mcp.py: TestFineMapRsidBackfill verifies fine_map_causal_variant
  backfills chrom/pos/ref/alt when a caller passes only an rsID lead_variant.
  Regression test for the macOS audit crash (KeyError: 'chrom').
- examples/*.ipynb: Re-executed all three notebooks end-to-end on Linux CUDA
  to refresh outputs against the merged audit branch.

Full suite now: 286/286 tests pass (including alphagenome real-oracle smoke).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Regeneration on Linux CUDA GPU 1 (GPU 0 was full):
- AlphaGenome variant + validation (5 examples) — 28 min
- Remaining AlphaGenome (batch/causal/discovery/seq) — ~14 min
- Enformer SORT1 — 1 example
- ChromBPNet SORT1 — 1 example
Discovery HTML filenames now use oracle_name "alphagenome" (was placeholder "oracle").

Verification:
- 289/289 tests pass across combined runs (6 oracle smoke tests green on GPU 1)
- Selenium screenshot sweep: 18/19 HTML render cleanly; 1 "NO-IGV" is the
  batch_scoring HTML which is a scoring table by design (no browser view)
- Hardcoded /PHShome/lp698/chorus paths in notebook log outputs redacted to
  /path/to/chorus

.gitignore: ignore examples/applications/**/*_screenshot.png so selenium
artifacts don't pollute the repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lper

Adds audits/2026-04-15_macos_arm64_post_merge.md — the second-pass
end-to-end audit on a fully wiped + fresh-cloned install, after the
PR #7 macOS-support changes were merged. Every fix from v1 is
confirmed working on a clean setup:

  * chrombpnet + enformer envs now pull in tensorflow-metal
    automatically → `Auto-detected 1 GPU(s) … name: METAL`
  * borzoi/sei/legnet auto-detect mps:0
  * fine_map_causal_variant("rs12740374") rsID-only returns
    rs12740374 composite=0.963 of 12 LD variants (was KeyError in v1)
  * analyze_variant_multilayer reproduces Musunuru-2010 biology
    (CEBPA strong binding gain +0.37, DNASE strong opening +0.43)
  * 286/286 pytest, 0 notebook errors, 19/19 IGV reports ok

Two download-reliability findings surfaced on this clean run (both
pre-existing, both same bug-class as the SEI fix that landed in
PR #7): chorus/utils/genome.py stalled at 36% of the hg38 download,
and chorus/oracles/chrombpnet.py has no single-flight lock so two
concurrent callers race the ENCODE tar and hit EOFError.

This commit also adds chorus/utils/http.py — the resume+lock helper
that previously lived inside SeiOracle, now extracted as a shared
stdlib-only utility so genome + chrombpnet can reuse it. The
sei.py helper shim keeps the old public API working.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…elper

Three call sites fetch large files from the public internet with plain
urllib.request.urlretrieve (no resume, no concurrency lock). The
2026-04-15 v2 audit on a fresh install hit two of them the hard way:
UCSC cut the hg38 connection at ~36% of the 938 MB download
(urllib.error.URLError: retrieval incomplete: got only 363743871 out
of 983659424 bytes), and two concurrent callers of
_download_chrombpnet_model raced the same partial ENCODE .tar.gz so
one read it mid-write and hit
  EOFError: Compressed file ended before the end-of-stream marker was
           reached
inside tarfile.extractall.

Re-use the resume+lock helper introduced for SEI in PR #7, lifted
into chorus/utils/http.py in the preceding commit:

  chorus/oracles/sei.py
    _download_with_resume staticmethod becomes a thin shim that
    forwards to chorus.utils.http.download_with_resume. No behaviour
    change and no API break.

  chorus/utils/genome.py
    GenomeManager.download_genome swaps urllib.request.urlretrieve
    for download_with_resume. Fixes the UCSC stall observed in the
    v2 audit; partial .fa.gz is now resumable across retries.

  chorus/oracles/chrombpnet.py
    _download_chrombpnet_model (ENCODE tar) and _download_jaspar_motif
    (JASPAR motif) both route through download_with_resume. The fcntl
    lock on <dest>.lock serialises concurrent callers so the pytest
    smoke fixture and a background build_backgrounds_chrombpnet.py
    job can no longer corrupt each other's download.

All three changes are platform-agnostic; the helper is stdlib-only
(urllib + fcntl). Linux CUDA is not touched.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…S audit

v2 audit confirmed post-merge macOS works end-to-end (286/286 tests,
19/19 HTML, reproduces Musunuru-2010 biology, rsID backfill verified).

Adds chorus/utils/http.py (resume + fcntl-lock) and routes hg38 genome +
chrombpnet ENCODE tar + JASPAR motif downloads through it. SEI helper
becomes a shim for backward-compat.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Third-pass audit going one level deeper than v1 (pre-merge smoke) and
v2 (post-merge fresh install). Scope: 14 application examples × 19 HTML
reports × 6 per-track normalizer NPZs + the scoring/normalization stack.

Read-only deliverable — the fixes identified here belong in a separate
follow-up PR after review.

Findings (5, ranked by severity):

  1. HIGH — chrombpnet_pertrack.npz:DNASE:hindbrain has 0 background
     samples and an all-zeros CDF. PerTrackNormalizer.effect_percentile()
     silently returns 1.0 for every raw_score (including 0.0) because
     np.searchsorted on a zeros row ranks everything at the end, and
     _get_denominator falls through to cdf_width=10000 when counts[idx]=0.
     Same bug class as the v2 concurrent-download race that landed in
     PR #8 — the hindbrain model download failed silently and left a
     zero-count reservoir. Impact: any variant scored against
     DNASE:hindbrain in ChromBPNet gets a false "100th percentile".

  2. MEDIUM — every committed HTML report loads igv.min.js from
     cdn.jsdelivr.net at view time. 2/19 reports flaked on
     net::ERR_CERT_AUTHORITY_INVALID during this audit; any user
     behind a corporate proxy / airgapped network / jsdelivr outage
     will see IGV silently fail with no fallback. No SRI either.

  3-5. LOW — documentation improvements:
     - TERT_promoter example doesn't caveat that C228T's published
       biology is melanoma-specific; K562 result (all negative) is
       correctly modelled but reads as "no effect" without context
     - AlphaGenome DNASE vs ChromBPNet ATAC disagree on rs12740374
       direction in HepG2 (+0.45 vs -0.11); no application note
       teaches this real cross-oracle divergence
     - HBG2_HPFH footer notes BCL11A/ZBTB7A catalog absence; could
       be tightened

Normalization stack verified clean:
  - CDF monotonicity: 0 bad rows across 18,159 tracks × 10,000 points
  - signed_flags match LAYER_CONFIGS.signed exactly (AG 667 RNA-seq,
    Borzoi 1543 stranded RNA, SEI 40/40 regulatory_classification,
    LegNet 3/3 MPRA; Enformer 0 signed is correct — no RNA-seq)
  - Build-vs-scoring window_bp bit-identical via shared LAYER_CONFIGS
  - Pseudocount/formula: _compute_effect reproduces reference
    implementation with diff=0.0 across all test cases
  - perbin_floor_rescale_batch math verified at all edges
  - Edge cases: unknown oracle → None, unknown track → None,
    raw=0 → 0.0, raw=huge → 1.0

Phase A rerun on 4 AlphaGenome literature-checked cases (SORT1, TERT,
FTO, BCL11A) confirms biology is preserved but results are NOT bit-
identical — raw_score drift ~1-2% on dominant tracks, larger quantile
swings on near-zero tracks due to AlphaGenome's JAX CPU non-
determinism. No committed example is stale. Noise-floor handling for
|raw_score| < ~1e-3 added to follow-up recommendation list.

Artifacts:
  - audits/2026-04-16_application_and_normalization_audit.md (main report)
  - audits/2026-04-16_screenshots/*.png (19 full-page PNGs)
  - audits/2026-04-16_data/*.json (per-app cards + normalization/selenium/rerun JSON)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ixes

Addresses the findings in audits/2026-04-16_application_and_normalization_audit.md
(PR #9). Three categories of change:

1. Delete two example applications the audit recommends removing:

   - examples/applications/variant_analysis/TERT_promoter/
     C228T is a melanoma-specific gain-of-function mutation; the example
     runs it in K562 (erythroleukemia) and shows all-negative effects.
     The biology is correct for the model but inverts the published
     direction. Rather than add a "wrong cell type" caveat, drop the
     example — SORT1 / FTO / BCL11A cover variant_analysis without
     teaching the reader a misleading result.

   - examples/applications/validation/HBG2_HPFH/
     Already self-documented as "Not reproduced" in
     validation/README.md: BCL11A / ZBTB7A aren't in AlphaGenome's track
     catalog, so the repressor-loss mechanism isn't visible. Keeping a
     "validation failed" example alongside the working
     SORT1_rs12740374_with_CEBP confuses readers. Drop it.

   Also updated: root README.md (replaces HBG2_HPFH link with
   SORT1_rs12740374_with_CEBP), examples/applications/variant_analysis/README.md
   (drops TERT prompt + section), examples/applications/validation/README.md
   (drops HBG2 row + section + reproduce snippet),
   scripts/regenerate_examples.py + scripts/internal/inject_analysis_request.py
   (both lose their TERT_promoter/HBG2_HPFH entries).

2. Normalizer: guard against zero-count CDF rows
   (chorus/analysis/normalization.py).

   Audit finding #1 (HIGH): the committed chrombpnet_pertrack.npz has
   DNASE:hindbrain with effect_counts[idx] == 0 and a zero-filled CDF
   row. effect_percentile() / activity_percentile() silently returned
   1.0 for every raw_score (including 0.0) because np.searchsorted on
   a zeros row returns len(row) for any non-negative probe and the
   denominator falls through to cdf_width. Same bug-class as the v2
   chrombpnet concurrent-download race that landed in PR #8 — the
   hindbrain ENCODE tar must have failed to extract cleanly during the
   original background build.

   New private helper _has_samples() returns False when
   counts[idx] == 0, which makes _lookup / _lookup_batch return None.
   Callers already render None as "—" in MD/HTML tables, so users now
   see "no background" instead of a silent false "100th percentile".
   Counts-less NPZs (older format, no counts field) are treated as
   valid — no regression.

3. Report: suppress quantile_score when raw_score is in the noise floor
   (chorus/analysis/variant_report.py).

   Audit finding #6 (LOW): when |raw_score| < 1e-3 the effect CDF is
   so densely clustered around 0 that a 1-2% raw-score drift can swing
   the quantile by 0.5+ (observed in the Phase A rerun: committed
   quantile=1.0 vs rerun=0.21 for a CEBPB track with raw_score ~1e-4).
   Set quantile_score = None in that regime so the HTML/MD tables
   render "—" and readers don't misread noise as signal. Threshold
   chosen conservatively to cover both log2fc (pc=1.0) and logfc RNA
   (pc=0.001) without hiding real effects.

4. IGV.js: lazy-download the bundle into ~/.chorus/lib on first use
   (chorus/analysis/_igv_report.py + chorus/analysis/causal.py).

   Audit finding #2 (MEDIUM): reports embed a <script src="..."> to
   cdn.jsdelivr.net that gets evaluated every time the HTML is opened
   in a browser. Any viewer on an airgapped network / corporate proxy
   that MITMs TLS / during a jsdelivr outage sees IGV silently fail
   (2/19 audit reports hit ERR_CERT_AUTHORITY_INVALID). The local-
   cache code path already existed but was opt-in (user had to drop a
   file in ~/.chorus/lib/igv.min.js manually).

   New _ensure_igv_local() helper runs on the first report generation
   and populates the cache via chorus.utils.http.download_with_resume
   (the helper that landed in v2 PR #8). Reports written after the
   first successful download inline the JS directly — self-contained
   HTML that opens anywhere without network. Download failure is
   logged at WARNING and the CDN <script> tag is used as fallback,
   preserving the current behaviour for anyone who can't reach
   jsdelivr at generation time.

All changes are platform-agnostic; 287/287 pytest continue to pass;
fix verified behaviourally:

  >>> norm.effect_percentile('chrombpnet', 'DNASE:hindbrain', 0.0)
  None                      # was: 1.0
  >>> norm.effect_percentile('chrombpnet', 'DNASE:HepG2', 0.0)
  0.0                       # unchanged

  >>> ts = TrackScore(raw_score=0.0005, ...);
  >>> _apply_normalization(ts, ...); ts.quantile_score
  None                      # noise floor

See audits/2026-04-16_application_and_normalization_audit.md (PR #9)
for full context, per-app screenshots, and the Phase A / B / C
methodology behind each finding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lucapinello and others added 28 commits April 16, 2026 04:08
From the 2026-04-16 new-user usability audit:

HIGH:
- H1: Fix wrong FTO coordinate in variant_analysis/README.md
  (53800954 → 53767042, matching all other references)
- H2: Add @_safe_tool to all 22 MCP tools (was missing on 14;
  unhandled exceptions now return structured error dicts)
- H3: Add htslib to environment.yml (provides bgzip for coolbox)
- H4: Fix python_requires to >=3.10 in setup.py (code uses 3.10+
  syntax like str | None)

MEDIUM:
- M2: Add README.md to marquee SORT1_rs12740374 example directory
- M8: Harmonize discover_variant to accept alt_alleles: list[str]
  (was singular alt_allele, inconsistent with other discovery tools)
- M9: Fix upgrade instruction order in README (remove oracle envs
  before removing the base chorus env that provides the CLI)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Second-pass usability audit findings:

1. Badge colors now match interpretation labels (was: red "Minimal effect"
   badges because _score_color_class used percentile directly; now derives
   class from the interpretation string which applies raw-score gating)

2. IGV CHIP track names now show TF/mark (was: all "CHIP:HepG2"; now uses
   _track_description enrichment for "CHIP:CEBPA:HepG2" etc.)

3. Percentile display: ≥99th / ≤1st instead of "1.000" / "0.000" to
   avoid implying false precision when the background CDF is saturated
   (correct behavior — random SNPs mostly have near-zero effects)

4. getting_started() MCP prompt now recommends high-level tools first
   (analyze_variant_multilayer, discover_variant, score_variant_batch,
   fine_map_causal_variant) instead of only listing low-level primitives

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Second-pass usability audit cleanup:

C5: Discovery sub-reports now carry AnalysisRequest with user prompt
    (regen script patches each per-cell-type report)
C6: Borzoi targets file: strip /home/drk/tillage/datasets/human/ prefix
    from file column (upstream training paths, not used at inference)
C7: HTML <title> now includes report_title + gene_name + position
    (e.g. "Multi-Layer Variant Report — SORT1 — chr1:109274968")

P8: Remove scripts/internal/ from repo (8 machine-specific dev scripts)
P9: Remove audits/2026-04-16_screenshots/ (~18 MB of PNGs)
    Both added to .gitignore to prevent re-commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BLOCKS_USER:
- Add missing __init__.py to 5 oracle source/template dirs (borzoi,
  enformer, alphagenome templates) — non-editable pip install would
  fail with ModuleNotFoundError
- Fix setup.py package_data: replace invalid ../environments/* escape
  with proper per-package data globs + data_files for env YAMLs

CONFUSING:
- Misspelled oracle name now raises ValueError listing valid names
  (was: misleading "not yet implemented" message)
- LegNet cell_types in list_tracks: add WTC11 (was: only HepG2, K562)
- list_tracks unknown oracle now lists valid names in error
- README: fix AlphaGenome track count 5,930 → 5,731 (actual loaded)
- README: fix Sei class count 41 → 40
- README: fix Borzoi track count 7,610 → 7,611
- README: fix oracle.list_tracks() → list_tracks() MCP tool

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cherry-picked from audit/2026-04-16-fresh-install-v4 (preserving our
second/third-pass audit fixes which that branch didn't have).

The macOS CPU-forcing guard in AlphaGenome's load and predict templates
only fired when device was None or started with "cpu". Callers passing
device='cuda:0' bypassed it, letting jax-metal initialize and crash
with "UNIMPLEMENTED: default_memory_space".

Fix: on Darwin, always force JAX_PLATFORMS=cpu unless the caller
explicitly requests Metal. Applied to all three code paths:
- alphagenome.py:_load_direct (env var set before import jax)
- load_template.py
- predict_template.py

Includes macOS v4 audit report: clean-slate install, 6 oracle GPU
verification, 12 example regenerations, 3 notebooks (0 errors), 13
HTML Selenium checks, 7-check normalization audit (all pass).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Walked through Chorus as 4 personas (clinician, bioinformatician,
PhD student, computational biologist). Key changes:

For clinicians:
- Add "Key terms" glossary box at top of README (oracle, track,
  assay_id, effect percentile, log2FC — defined before first use)
- Reword Start-here table: "I have a variant but don't know the
  relevant tissue" instead of bioinformatics framing
- Add Interpretation sections to SORT1, BCL11A, FTO example outputs
  with clinical/biological narrative (LDL cholesterol, sickle cell,
  tissue-specificity explanation)

For bioinformaticians:
- Add VCF parsing snippet to batch scoring README
- Document oracle_name param for normalization
- Note AlphaGenome full track IDs vs short names

For contributors:
- Replace Borzoi→mymodel throughout CONTRIBUTING.md (Borzoi is
  already implemented, was confusing)
- Update Current Priorities to reflect all 6 oracles done

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
User feedback: batch scoring only showed the effect (log2FC + percentile)
per track, hiding the absolute ref and alt values. A +0.4 effect could
mean 10→14 (active region) or 0.001→0.0014 (noise) — impossible to tell
without seeing both alleles.

All three output formats now show full transparency per track:

- TSV/DataFrame: columns _ref, _alt, _log2fc, _effect_pctile, _activity_pctile
  (was: _raw, _pctile, _activity)
- Markdown: 4 sub-columns per track (Ref | Alt | log2FC | Effect %ile)
- HTML: grouped header with colspan, same 4 sub-columns per track

Also uses ≥99th / ≤1st percentile display from the earlier audit fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 new READMEs:
- causal_prioritization/SORT1_locus/
- validation/SORT1_rs12740374_with_CEBP/
- validation/TERT_chr5_1295046/

2 new interpretation sections:
- SORT1_chrombpnet: notes cross-oracle comparison with AlphaGenome
- SORT1_enformer: notes cross-tissue DNASE pattern + 114kb window

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 12 application examples regenerated on GPU 0 with:
- AlphaGenome: 4 variant + 1 validation + batch + causal + discovery +
  2 seq engineering + TERT validation (28 min + 15 min)
- Enformer: 3 SORT1 examples (5 min)
- ChromBPNet: 1 SORT1 example (2 min)

New in regenerated outputs:
- ≥99th / ≤1st percentile display (no more "1.000")
- Badge colors match interpretation labels in HTML
- IGV CHIP track names show TF/mark (CHIP:CEBPA:HepG2)
- HTML titles include report_title + gene + position
- Self-contained igv.min.js (no CDN dependency)
- Batch scoring TSV: expanded ref/alt/log2fc/pctile columns

Re-added interpretation sections to 5 variant analysis examples
(SORT1, BCL11A, FTO, SORT1_chrombpnet, SORT1_enformer) after
regeneration overwrote them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: discover_variant_effects() writes HTML internally BEFORE
analysis_request is patched in regen scripts. Fixed by re-writing HTML
after patching, targeting the exact filenames.

Removed 6 stale HTML files:
- 3 discovery sub-reports (per-cell-type) that duplicated the main
  discovery report but lacked user prompt
- 1 orphaned CELSR2 validation report (unreferenced)
- 1 enformer validation duplicate (enformer_report.html, kept RAW_autoscale)
- 1 enformer discovery duplicate in SORT1_enformer dir

Final screenshot audit: 13/13 HTML reports CLEAN — all have Analysis
Request with user prompt, correct badge colors, enriched CHIP track
names, self-contained IGV, and ≥99th percentile display.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… docstring

From the read-only v5 audit (PR #12) — 3 low-severity findings:

1. MEDIUM: tests/test_analysis.py referenced old batch_scoring column
   names ('_raw', '_pctile') that commit 01d8446 renamed to '_ref',
   '_alt', '_log2fc', '_effect_pctile', '_activity_pctile'. Updated
   assertions to match current scheme; added checks for _ref and _alt
   that weren't previously verified. 279 → 281 passing.

2. MEDIUM: scripts/regenerate_examples.py hardcoded
   "N HepG2/K562 tracks" regardless of actual cell type. Added
   "cell_type" to each of the 4 AlphaGenome variant examples, and
   the tracks_requested string is now derived from that field. Also
   mechanically patched the 4 affected committed example outputs
   (SORT1, BCL11A, FTO, SORT1_CEBP) in MD/JSON/HTML so readers see
   the correct per-example label (e.g. "6 K562 tracks" for BCL11A)
   without waiting for the next regen. All 12 occurrences removed.

3. MINOR: chorus/analysis/batch_scoring.py:82 docstring still listed
   the old _raw / _pctile columns. Updated to reflect current output
   schema.

Verified:
- pytest tests/ --ignore=tests/test_smoke_predict.py → 281 passed
- Selenium re-render of the 4 patched HTMLs confirms the new labels
  appear and no "HepG2/K562" stragglers remain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hrough

From the post-v5-merge audit:

C1: MCP server tracks_requested now derives cell-type label dynamically
    (matches regen script output). Added _describe_tracks_requested
    helper that inspects the variant_result to extract cell types;
    labels uniform cell-type sets as "N HepG2 tracks" and mixed as
    "N tracks". Applied to all 5 tools using the pattern.

C2/C3: Added 15 new tests covering _fmt_percentile (≥99th boundary),
    _score_color_class (interpretation-label-based color), resolved
    device detection (nvidia-smi probe), and _describe_tracks_requested
    (uniform vs mixed cell-type labels). Test count 235 → 250.

P2: MCP_WALKTHROUGH.md — added "Manage loaded oracles" tip covering
    oracle_status and unload_oracle. Fixed stale 5,930 → 5,731 track
    count for AlphaGenome.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause (2 layers):

1. discover_variant_effects and discover_and_report wrote HTML to
   output_path BEFORE AnalysisRequest could be attached (since the
   functions didn't accept one). Regen scripts then attached the
   prompt post-hoc and did a second report.to_html() under a pretty
   filename, leaving the first HTML behind as an orphan.
2. Commit df7d613 deleted the 3 per-cell-type discovery HTMLs from
   the repo, but those files are the actual drill-down output of
   discover_and_report — they were mislabeled as duplicates. There
   is no 'main' discovery HTML; the per-cell-type HTMLs are it.

Clean fix (no glob+remove hack):

- discovery.py: discover_variant_effects gains `analysis_request`
  and `output_filename` kwargs; discover_and_report gains
  `user_prompt` and `tool_name` kwargs. When provided, the
  AnalysisRequest is attached before the first HTML write.
- regenerate_examples.py::regenerate_enformer_discovery and
  regenerate_remaining_examples.py::{regen_discovery, regen_tert_chr5}
  now pass the AnalysisRequest + pretty filename in. The post-hoc
  report.to_html() rewrite and the per-cell-type for-loop that
  patched analysis_request by glob are removed.
- The 3 legitimate discovery HTMLs are re-committed to the repo
  with the user prompt baked in on first write.
- tests/test_analysis.py: 2 new tests verify the new kwargs exist.

Verified:
  - pytest 298 passed (was 296)
  - SORT1_enformer dir contains only rs12740374_SORT1_enformer_report.html
    (no orphan chr*.html)
  - All 3 discovery sub-reports render the "Screen all cell types…"
    prompt in their Analysis Request section
  - `git status` after a fresh regen is clean (no untracked files)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Read-only audit from a first-time user's perspective. The install
path, Minimal Working Example, MCP walkthrough, and CLI all work
out of the box; four low-to-medium polish items identified, all
documentation drift:

- MEDIUM: `chorus list` shows phantom `base` oracle as "✗ Not
  installed" because it scans environments/chorus-*.yml including
  chorus-base.yml (which the install instructions tell users NOT
  to install directly).
- MEDIUM: docs/MCP_WALKTHROUGH.md:38 uses wrong kwarg
  `alt_allele="T"` (singular string); actual signature is
  `alt_alleles: list[str]`. The next example in the same file
  uses the correct form.
- LOW: examples/advanced_multi_oracle_analysis.ipynb cell 1 has
  stale subtitle "using the Enformer oracle" copy-pasted from
  single_oracle_quickstart.ipynb (this is the multi-oracle NB).
- LOW: SORT1_rs12740374/README.md "Key results" table shows
  percentiles 99/98/95/90/88 but example_output.md has all five
  tracks at ≥99th after the v5 _fmt_percentile update.

Verified clean: README Minimal Working Example runs verbatim;
all 10 MCP tools introspect with docstrings + user_prompt; error
messages (`Unknown oracle`, `Valid names`, etc.) are actionable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four polish fixes from the first-user UX audit (PR #16):

1. MEDIUM: `chorus list` no longer shows a phantom `base` entry.
   EnvironmentManager.list_available_oracles() now filters out
   chorus-base.yml (an internal template — the user-facing base
   env is 'chorus' from root environment.yml). Added a guard in
   `chorus setup --oracle base` that prints a helpful message
   pointing to `mamba env create -f environment.yml` instead of
   silently trying to create a chorus-base env. Test added in
   tests/test_core.py.

2. MEDIUM: docs/MCP_WALKTHROUGH.md:38 — fixed `alt_allele="T"`
   (wrong, singular string) → `alt_alleles=["T"]` (plural, list)
   to match the actual MCP tool signature. The adjacent Example 2
   on line 70 already used the correct form; now both match.

3. LOW: examples/advanced_multi_oracle_analysis.ipynb cell 1 —
   replaced the stale "using the Enformer oracle" subtitle
   (copy-pasted from single_oracle_quickstart) with a proper
   multi-oracle description that matches the title.

4. LOW: examples/applications/variant_analysis/SORT1_rs12740374/
   README.md — "Key results" table had stale graduated
   percentiles (99/98/95/90/88); current example_output.md has
   all five tracks at ≥99th after the v5 _fmt_percentile update.
   Refreshed the table with current effect sizes and enriched
   track names (CHIP:CEBPA:HepG2 etc.), added an explanation of
   why the top bucket shows as "≥99th".

Verified: pytest 299 passed (was 298); `chorus list` output no
longer includes the phantom 'base' entry; `chorus setup --oracle
base` prints the friendly error message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Clean-slate audit after v7 UX fixes merged. Deleted 13 GB (7 mamba
envs, ~/.chorus/, HF chorus models) before starting; preserved only
hg38.fa and the repo-internal ChromBPNet/Sei/LegNet weights.

First full audit with zero findings. Every v5/v6/v7 fix held up:

- v5: batch_scoring columns + cell-type label + docstring → live
- v6: discovery API kwargs (analysis_request, output_filename,
  user_prompt) → live and verified zero-orphan after fresh regen
  (git status --short | grep ^?? returns empty)
- v7: phantom `base` filter + walkthrough alt_alleles + NB3 subtitle
  + SORT1 README ≥99th percentiles → live

Results:
- pytest: 299/299 passed on fresh base env (17.7 s)
- smoke: 6/6 oracles pass in 6 min 1 s (AG+Borzoi re-downloaded from HF)
- regen: all 12 examples reproduce within AG CPU non-det tolerance
  (max Δeff 0.036, ChromBPNet 0.0001, Enformer 0.000)
- notebooks: 129 code cells across 3 NBs, 0 errors, 0 warnings, 0 stale
- HTML: 16/16 reports audit clean in Selenium (0 SEVERE, 0 CDN)
- CDFs: 4/6 downloaded, all pass monotonicity + counts + p50/p95/p99

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full cold-start audit at /srv/local/lp698/chorus-audit-v7:
- All 7 chorus envs wiped and rebuilt from scratch via chorus setup
- hg38 + HF backgrounds re-downloaded fresh
- 307/307 tests pass (301 code + 6 oracle smoke)
- 3 notebooks: 0 errors across 235 cells
- All 13 examples regenerated (AG + Enformer + ChromBPNet)
- Selenium: 16/16 HTML reports CLEAN

Includes 2026-04-17_v7_scorched_earth_audit.md report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Screenshot review found causal prioritization MD+HTML still used the
old '(100%)' / '(95%)' percentile format. The variant_report tables
and batch_scoring tables already used '≥99th' / '0.95' / '≤1st' via
_fmt_percentile (added in the audit pass). Causal was the last
remaining site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two scary-looking warnings surfaced while reading notebook cell outputs
in the v7 audit. Neither is a real problem but both alarm users:

1. chorus/core/base.py:323 — case-sensitive compare of reference allele
   vs genome. pyfaidx returns lowercase for softmasked (repetitive)
   regions; users always pass uppercase. The previous code fired
   'Provided reference allele is not the same as the genome reference'
   on every variant in a softmasked locus (e.g. GATA1 TSS in quickstart
   notebook cell 39, comprehensive notebook cells 35 and 51). Now uses
   .upper() on both sides; also includes the actual allele pair in the
   warning message so users can confirm.

2. chorus/core/result.py:104 — 'Unknown implementation' warning fired
   for every Sei track (Stem cell / Multi-tissue / H3K4me3 etc.) that
   isn't in the hardcoded assay_type registry. The generic fallback
   works correctly; the warning was just noise. Downgraded to
   logger.debug.

Scientific review of outputs:
- SORT1 rs12740374: predictions match Musunuru 2010 mechanism (CEBPA/B
  binding gain, DNASE opening, H3K27ac gain, CAGE TSS increase) ✓
- BCL11A rs1427407: TAL1 binding loss + DNASE closing in K562 ✓
- FTO rs1421085: minimal effects in HepG2 (expected — adipose tissue) ✓
- TERT chr5:1295046 T>G: E2F1 binding gain + TERT TSS CAGE increase ✓
- SORT1 causal: rs12740374 ranks #1 composite=0.964 ✓

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ght CI

v8 found zero action items but called out four scenarios it did not
exercise. This PR closes them.

Fast suite (299 → 303): tests/test_error_recovery.py
- HF download ConnectionError → graceful return + warning log
- download_with_resume .partial file resume via Range header
- AlphaGenome missing HF_TOKEN → friendly actionable error
- Missing oracle env → "chorus setup" hint + graceful fallback
All four mocks, run in ~1.5 s, no network.

Integration suite (gated by @pytest.mark.integration):
tests/test_integration.py
- SEI + LegNet CDF download from HF dataset (v8 didn't trigger these
  because no regen workflow uses sei/legnet) — verified the NPZs
  load and pass monotonicity + p50/p95/p99 + counts checks
- ChromBPNet ATAC:K562 fresh download from ENCODE (~500 MB, 8 min)
  to a tmp dir — verifies the shared download_with_resume helper
  end-to-end without touching the 37 GB real cache
- First E2E test of the MCP server: spawn chorus-mcp stdio
  subprocess via fastmcp Client, call list_oracles + load_oracle +
  analyze_variant_multilayer on SORT1 rs12740374 with real
  AlphaGenome predict (~4.5 min)

Light CI: .github/workflows/tests.yml
- Runs fast suite on every PR and push to main/chorus-applications
- Linux ubuntu-latest + Miniforge + mamba + pip install -e .
- Skips smoke tests (~10 GB models exceed runner disk) and
  integration tests (too slow for per-PR feedback)
- workflow_dispatch for manual maintainer runs

pytest.ini: registers the `integration` marker.

Verified:
- pytest -m "not integration" → 303 passed, 4 deselected (58.9 s)
- pytest -m integration → 4 passed (13 min total)
- Fresh ATAC:K562 tarball streamed with .partial + fcntl lock, fold 0
  weights loaded into TF, 2114 bp predict returns finite values
- chorus-mcp subprocess round-trips analyze_variant_multilayer
  response back to the client

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unts

Full sweep of user-facing docs after the multi-pass audit series. 11 fixes:

BLOCKS_USER (wrong biology in example READMEs):
- variant_analysis/SORT1_chrombpnet/README.md: was claiming "+0.441 Strong
  opening", actual is "-0.111 Moderate closing". Replaced with correct
  values + cross-oracle divergence explanation.
- sequence_engineering/region_swap/README.md: entire scenario was wrong
  (promoter swap at chr1:1000500 vs actual SORT1 enhancer replacement at
  chr1:109274500). Rewrote from scratch to match actual example_output.md.
- sequence_engineering/integration_simulation/README.md: wrong direction
  (DNASE -0.900 vs actual +4.22), wrong filename. Rewrote.
- variant_analysis/SORT1_enformer/README.md: stale HepG2-focused numbers
  but actual is discovery-mode. Rewrote with top hits from current output.

CONFUSING (stale numbers):
- causal_prioritization/README.md:116: composite 0.898 → 0.964
- batch_scoring/README.md example table: old 4-column format → new
  per-track Ref/Alt/log2FC/Effect %ile format
- causal SORT1_locus/example_output.md: (100%) → (≥99th) via in-place
  patch (percentile format now consistent with other reports)
- validation/README.md:33,58: stale TERT CAGE +0.120 → +0.34

Track count drift (pick-one):
- 5,930 / 5930 → 5,731 across docs/API_DOCUMENTATION.md,
  docs/variant_analysis_framework.md, README.md, examples/applications/
- 230 bp → 200 bp for LegNet input size across same files

POLISH:
- docs/IMPLEMENTATION_GUIDE.md tree: removed "# Placeholder" markers on
  borzoi/chrombpnet/sei (all implemented); added analysis/ and mcp/
  directories; updated utils/ to reflect current files.
- multilayer_variant_analysis.md: "Quantile range" header → "Effect
  percentile range" to match the rest of the repo's terminology.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full fresh-install audit at fbaef50 after wiping 13.2 GB (7 mamba envs,
~/.chorus/, HF chorus models). This pass focused on reading the
actual content of every example output, not just error counts.

Four findings, all low-to-medium environmental:

1. MEDIUM: TF Hub's /var/folders/.../tfhub_modules/ cache survives
   chorus teardowns. A stale partial download from a prior session
   made Enformer smoke fail with "'saved_model.pb' nor
   'saved_model.pbtxt'". Clearing it fixes; README should document.

2. MEDIUM (regression from v8): On this audit's SSL-MITM'd network,
   stdlib urllib in download_with_resume fails on
   cdn.jsdelivr.net/igv.min.js with cert-verify error. 6/16 HTMLs
   landed on the CDN <script> fallback. huggingface_hub's httpx
   works through the same proxy — robust fix is to mirror
   igv.min.js on the HF chorus-backgrounds dataset and fall back
   there when urllib fails.

3. LOW: FTO README promises adipose tracks but example runs with
   HepG2 (documented in the prompt only, not the README).

4. LOW: Notebooks run via `jupyter nbconvert` without
   `mamba activate chorus` emit 20-60 `bgzip is not installed` ERROR
   lines per notebook from coolbox. bgzip IS in the env — PATH just
   isn't inherited. Plots render via in-memory fallback; user sees
   scary error spam.

Verified:
- 303/303 fast pytest (17.5 s)
- 6/6 oracle smoke (after tfhub clear)
- 12/12 examples regenerate, max Δeff 0.036 (CPU non-det)
- 0 orphan HTMLs after parallel regen (v6 API fix live)
- 3 notebooks execute, 0 errors, plots render (despite bgzip noise)
- 16/16 HTMLs show correct biology in spot-check vs literature
- 6/6 CDFs pass monotonicity/p50/p95/p99/counts
  (first audit to empirically verify sei + legnet CDFs;
   v9 integration test now automates)

Biology confirmed on: SORT1 rs12740374 (Musunuru 2010 CEBP mechanism),
BCL11A rs1427407 (TAL1 disruption in K562), TERT chr5:1295046
(E2F1 + CAGE activation), region_swap (enhancer removal closes
chromatin), integration_simulation (CMV insertion opens chromatin).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lucapinello pushed a commit that referenced this pull request Apr 17, 2026
…ip PATH

All 4 findings from audit PR #20. Fast suite 303 → 308.

1. MEDIUM: TF Hub corrupt-cache recovery (enformer.py + load_template.py)
   When an earlier Enformer download was interrupted, tfhub_modules/
   keeps a directory with no saved_model.pb. hub.load then raises with
   the bad path in the message. Added _load_enformer_with_tfhub_recovery
   that parses the error, wipes the bad dir, and retries once. Applied
   to both in-process (_load_direct) and subprocess (template) paths.
   README Troubleshooting now documents the manual workaround and the
   auto-recovery behavior.

2. MEDIUM (v8 regression): IGV JS fallback via huggingface_hub
   (_igv_report.py). On SSL-MITM networks stdlib urllib rejects the
   proxy cert and CDN fetch of igv.min.js fails. Added secondary
   fallback via hf_hub_download from lucapinello/chorus-backgrounds
   — huggingface_hub uses httpx+certifi which works through the same
   proxies that block urllib. Graceful no-op if the HF file doesn't
   exist yet (existing CDN <script> fallback still kicks in). Dataset
   upload of igv.min.js to the HF repo is a separate one-time task
   for the maintainer; until then this code path silently downgrades
   to the current behavior.

3. LOW: FTO README adipose claim vs HepG2 reality
   (examples/applications/variant_analysis/FTO_rs1421085/README.md).
   Rewrote the Tracks section to accurately state that the committed
   example uses HepG2 as a "nearest metabolic" proxy and shows what a
   no-signal call looks like. Included ready-to-use adipose-track
   assay_ids for users who want the biologically ideal run.

4. LOW: bgzip/tabix PATH when nbconvert skips mamba activate
   (chorus/__init__.py). Prepend sys.executable's bin/ to PATH at
   chorus import time. coolbox's subprocess calls to bgzip/tabix now
   succeed instead of emitting 20-60 ERROR lines per notebook and
   falling back to TabFileReaderInMemory. Cheap and idempotent
   (only prepended if not already present).

Tests: 5 new in tests/test_error_recovery.py
- test_corrupt_cache_is_cleared_and_retry_succeeds
- test_unrelated_errors_propagate_unchanged
- test_hf_fallback_when_cdn_fails
- test_returns_none_when_both_fail
- test_env_bin_on_path_after_import

Verified: pytest -m "not integration" → 308 passed (was 303).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant