Skip to content

Annot demux refactor#73

Merged
AyushSemwal merged 12 commits into
huishenlab:annot_demux_refactorfrom
AyushSemwal:annot_demux_refactor
May 14, 2026
Merged

Annot demux refactor#73
AyushSemwal merged 12 commits into
huishenlab:annot_demux_refactorfrom
AyushSemwal:annot_demux_refactor

Conversation

@AyushSemwal
Copy link
Copy Markdown
Member

  • Default annotation model switched from 10x3p_sc_ont_013 (Conv+CRF) to 10x3p_sc_ont_016 (CNN-BiLSTM-CRF, 3 conv × 64 filters, 1 BiLSTM × 32 units). Empirically validated to ~525 kb on 4× L40S.
  • GPU stability hardening — bounded join→SIGTERM→SIGKILL worker shutdown to prevent NCCL ring corruption across jobs, predict_with_backoff now raises on stale model after K.clear_session(), XLA disabled (tf2crf Viterbi incompatibility on TF 2.15), CUDA_VISIBLE_DEVICES no longer blanked when no GPUs found, model state persisted across length bins.
  • QC enhancements — new --gene-body-bed flag accepting RSeQC-style BED12 (replaces auto-extraction from GTF); single-pass BAM scan is significantly faster.
  • Training/simulation — new --min-flank/--max-flank for terminal cDNA flanks, separate from interior --min/max-spacer; pre-flight read-length math corrected in assess-model.
  • Observability — every pipeline stage reports peak memory + elapsed time; logs and artifacts include the model name.
  • Bug fixesbarcode-correct falls back to bundled seq_orders.yaml when only --model-name is given; extract_annotated_seqs no longer crashes on empty Starts.
  • Documentationresource_requirements.qmd re-anchored on _016 (bpt 6,644 → 2,916, k 73.4 → 32.2 KB/bp); new --max-batch-size per-VRAM-tier tuning guidance; XLA references removed; tentative-figures disclaimer added pending per-GPU benchmarking.

New commands

barcode_correct — standalone Levenshtein correction against a whitelist; resumable, optional inline demux via
--run-demux.
demux_reads — standalone FASTA/FASTQ export from annotations (demuxed or bulk).
generate_whitelist — whitelist-free cell-barcode discovery via knee detection + deletion-neighborhood near-dup merging.
qc_metrics — annotation + BAM-level QC, knee plots, boxplots, MultiQC TSVs, HTML report. ~3.9k lines, no PyArrow.
assess_model — per-segment accuracy assessment for trained models with a generated report.
featurecounts — gene-level count matrix from per-cell BAMs.
Pipeline restructure

You can run barcode_correct and demux_reads against existing annotations — useful for resuming, re-running with different thresholds, or the whitelist-free flow.
annotate_reads and visualize accept preprocessed directories directly.
User-configurable bin width in preprocessing
Option to split concatenated reads during annotation.
Checkpoint/resume across annotate, BC correct, and demux. Optional chunk cleanup. annotations_valid.parquet
auto-removed once the corrected file exists.
Barcode correction handles UMI-less protocols and arbitrary barcode-type combos. All FASTA/FASTQ outputs gzipped.
Model / training

REG and HYB paths dropped. CRF-only. 10x3p_sc_ont_013 is the new default.
seq_orders and training_params moved from TSV → YAML with a cleaner schema.
Dynamic batch sizing now accounts for all layers, not just conv.
Training artifacts versioned and folder naming cleaned up.
QC

qc_metrics.py computes and visualizes multiple QC plots at various levels
Dedup / BAM

UMI dedup significantly faster.
Minor tweaks to split_bam.
Packaging & CI

setup.py → pyproject.toml. Moved to setuptools-scm for versioning.
Fixed the upstream CI breakage: tag_regex + git_describe_command so legacy non-PEP440 tags like v0.2.1_tf2.15.0 are
stripped to 0.2.1.
Added a Docker publish workflow.
pytest now runs on push/PR to dev and annot_demux_refactor.
New unit tests for checkpoint chunk size, _version, and the seq_orders YAML refactor. Integration tests expanded.
Added container_runtime.py helper.
Docs

Quarto site reorganized into per-stage pages: preprocessing, annotation, barcode/demux, align/dedup, QC, split BAM,
visualization, featurecounts.
New model-assessment guide; model-training and read-simulation docs rewritten.
Quick start, usage, install, resource-requirements pages refreshed.
- scripts/barcode_correction.py: when only model_name is given (no explicit
  seq_order_file), default to the bundled utils/seq_orders.yaml instead of
  falling through to the column-overlap heuristic.
- scripts/extract_annotated_seqs.py: guard read[Starts[0]:Ends[0]] with a
  truthiness check on Starts to prevent IndexError when a barcode label is
  structurally absent in a "valid"-architecture read.
- scripts/annotate_new_data.py:
  * Only set CUDA_VISIBLE_DEVICES when GPUs are actually found, instead of
    blanking it to "" and masking parent-primed devices in subprocesses.
  * Disable global XLA JIT — tf2crf Viterbi is not XLA-compatible and caused
    intermittent "Unexpected Event status: 1" miscompiles on TF 2.15.
  * In predict_with_backoff, raise immediately when rebuild_model_fn is None
    after K.clear_session(); also raise on rebuild failure. Looping on a
    stale model after session clear is a documented driver-corruption path.
- wrappers/annotate_reads_wrap.py:
  * Replace unbounded worker.join() with a bounded ladder (join 5s →
    SIGTERM 2s → SIGKILL 2s → close). SIGKILL via SLURM cgroup mid-NCCL
    collective is a known way to leave the ring corrupt across jobs.
Distinguishes terminal flanks (random cDNA at the two read ends, used to
mimic ONT adapter-flank length) from interior spacers (chimeric junctions
between concatenated fragments). Previously both used the RN token and
shared --min/max-spacer; now they're RN_FLANK vs RN_SPACER with independent
ranges.

- scripts/simulate_training_data.py:
  * generate_segment / generate_valid_read accept flank_range.
  * _build_structure_order_and_patterns emits RN_FLANK at the two ends and
    RN_SPACER between concatenated copies.
  * RC / reverse / transform helpers updated to recognize both tokens.
  * Drops the implicit min(length, 50) cap; bound is now user-controlled.
- main.py: adds --min-flank/--max-flank Typer options to simulate_data and
  assess_model.
- wrappers/simulate_data_wrap.py, wrappers/evaluate_model_wrap.py: thread
  the new args through. evaluate_model_wrap also fixes the pre-flight
  read-length overhead math: was (repeat+1) * max_spacer, now
  2 * max_flank + max(0, repeat-1) * max_spacer.
- docs/webpages/model_training/simulate_data_cli.qmd: documents the flags.
@AyushSemwal AyushSemwal merged commit ddc65c9 into huishenlab:annot_demux_refactor May 14, 2026
1 check passed
@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

❌ Patch coverage is 62.50000% with 138 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
scripts/qc_metrics.py 35.07% 87 Missing ⚠️
wrappers/annotate_reads_wrap.py 30.43% 16 Missing ⚠️
scripts/annotate_new_data.py 30.00% 14 Missing ⚠️
wrappers/qc_metrics_wrap.py 77.41% 7 Missing ⚠️
scripts/available_gpus.py 57.14% 6 Missing ⚠️
scripts/annotate_reads.py 50.00% 4 Missing ⚠️
wrappers/evaluate_model_wrap.py 76.92% 3 Missing ⚠️
wrappers/generate_whitelist_wrap.py 91.66% 1 Missing ⚠️
Files with missing lines Coverage Δ
main.py 91.73% <ø> (ø)
scripts/barcode_correction.py 60.60% <100.00%> (+0.92%) ⬆️
scripts/discover_barcodes.py 73.05% <100.00%> (ø)
scripts/extract_annotated_seqs.py 72.86% <100.00%> (+0.10%) ⬆️
scripts/simulate_training_data.py 88.05% <100.00%> (+0.05%) ⬆️
scripts/split_bam_file.py 78.69% <ø> (-0.19%) ⬇️
utils/__init__.py 60.00% <100.00%> (+2.85%) ⬆️
wrappers/align_wrap.py 96.55% <100.00%> (+0.25%) ⬆️
wrappers/barcode_correction_wrap.py 68.40% <100.00%> (+0.84%) ⬆️
wrappers/dedup_wrap.py 81.81% <100.00%> (+0.86%) ⬆️
... and 16 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant