Annot demux refactor by AyushSemwal · Pull Request #73 · huishenlab/tranquillyzer

AyushSemwal · 2026-05-14T19:11:08Z

Default annotation model switched from 10x3p_sc_ont_013 (Conv+CRF) to 10x3p_sc_ont_016 (CNN-BiLSTM-CRF, 3 conv × 64 filters, 1 BiLSTM × 32 units). Empirically validated to ~525 kb on 4× L40S.
GPU stability hardening — bounded join→SIGTERM→SIGKILL worker shutdown to prevent NCCL ring corruption across jobs, predict_with_backoff now raises on stale model after K.clear_session(), XLA disabled (tf2crf Viterbi incompatibility on TF 2.15), CUDA_VISIBLE_DEVICES no longer blanked when no GPUs found, model state persisted across length bins.
QC enhancements — new --gene-body-bed flag accepting RSeQC-style BED12 (replaces auto-extraction from GTF); single-pass BAM scan is significantly faster.
Training/simulation — new --min-flank/--max-flank for terminal cDNA flanks, separate from interior --min/max-spacer; pre-flight read-length math corrected in assess-model.
Observability — every pipeline stage reports peak memory + elapsed time; logs and artifacts include the model name.
Bug fixes — barcode-correct falls back to bundled seq_orders.yaml when only --model-name is given; extract_annotated_seqs no longer crashes on empty Starts.
Documentation — resource_requirements.qmd re-anchored on _016 (bpt 6,644 → 2,916, k 73.4 → 32.2 KB/bp); new --max-batch-size per-VRAM-tier tuning guidance; XLA references removed; tentative-figures disclaimer added pending per-GPU benchmarking.

New commands barcode_correct — standalone Levenshtein correction against a whitelist; resumable, optional inline demux via --run-demux. demux_reads — standalone FASTA/FASTQ export from annotations (demuxed or bulk). generate_whitelist — whitelist-free cell-barcode discovery via knee detection + deletion-neighborhood near-dup merging. qc_metrics — annotation + BAM-level QC, knee plots, boxplots, MultiQC TSVs, HTML report. ~3.9k lines, no PyArrow. assess_model — per-segment accuracy assessment for trained models with a generated report. featurecounts — gene-level count matrix from per-cell BAMs. Pipeline restructure You can run barcode_correct and demux_reads against existing annotations — useful for resuming, re-running with different thresholds, or the whitelist-free flow. annotate_reads and visualize accept preprocessed directories directly. User-configurable bin width in preprocessing Option to split concatenated reads during annotation. Checkpoint/resume across annotate, BC correct, and demux. Optional chunk cleanup. annotations_valid.parquet auto-removed once the corrected file exists. Barcode correction handles UMI-less protocols and arbitrary barcode-type combos. All FASTA/FASTQ outputs gzipped. Model / training REG and HYB paths dropped. CRF-only. 10x3p_sc_ont_013 is the new default. seq_orders and training_params moved from TSV → YAML with a cleaner schema. Dynamic batch sizing now accounts for all layers, not just conv. Training artifacts versioned and folder naming cleaned up. QC qc_metrics.py computes and visualizes multiple QC plots at various levels Dedup / BAM UMI dedup significantly faster. Minor tweaks to split_bam. Packaging & CI setup.py → pyproject.toml. Moved to setuptools-scm for versioning. Fixed the upstream CI breakage: tag_regex + git_describe_command so legacy non-PEP440 tags like v0.2.1_tf2.15.0 are stripped to 0.2.1. Added a Docker publish workflow. pytest now runs on push/PR to dev and annot_demux_refactor. New unit tests for checkpoint chunk size, _version, and the seq_orders YAML refactor. Integration tests expanded. Added container_runtime.py helper. Docs Quarto site reorganized into per-stage pages: preprocessing, annotation, barcode/demux, align/dedup, QC, split BAM, visualization, featurecounts. New model-assessment guide; model-training and read-simulation docs rewritten. Quick start, usage, install, resource-requirements pages refreshed.

updated schema

- scripts/barcode_correction.py: when only model_name is given (no explicit seq_order_file), default to the bundled utils/seq_orders.yaml instead of falling through to the column-overlap heuristic. - scripts/extract_annotated_seqs.py: guard read[Starts[0]:Ends[0]] with a truthiness check on Starts to prevent IndexError when a barcode label is structurally absent in a "valid"-architecture read.

- scripts/annotate_new_data.py: * Only set CUDA_VISIBLE_DEVICES when GPUs are actually found, instead of blanking it to "" and masking parent-primed devices in subprocesses. * Disable global XLA JIT — tf2crf Viterbi is not XLA-compatible and caused intermittent "Unexpected Event status: 1" miscompiles on TF 2.15. * In predict_with_backoff, raise immediately when rebuild_model_fn is None after K.clear_session(); also raise on rebuild failure. Looping on a stale model after session clear is a documented driver-corruption path. - wrappers/annotate_reads_wrap.py: * Replace unbounded worker.join() with a bounded ladder (join 5s → SIGTERM 2s → SIGKILL 2s → close). SIGKILL via SLURM cgroup mid-NCCL collective is a known way to leave the ring corrupt across jobs.

Distinguishes terminal flanks (random cDNA at the two read ends, used to mimic ONT adapter-flank length) from interior spacers (chimeric junctions between concatenated fragments). Previously both used the RN token and shared --min/max-spacer; now they're RN_FLANK vs RN_SPACER with independent ranges. - scripts/simulate_training_data.py: * generate_segment / generate_valid_read accept flank_range. * _build_structure_order_and_patterns emits RN_FLANK at the two ends and RN_SPACER between concatenated copies. * RC / reverse / transform helpers updated to recognize both tokens. * Drops the implicit min(length, 50) cap; bound is now user-controlled. - main.py: adds --min-flank/--max-flank Typer options to simulate_data and assess_model. - wrappers/simulate_data_wrap.py, wrappers/evaluate_model_wrap.py: thread the new args through. evaluate_model_wrap also fixes the pre-flight read-length overhead math: was (repeat+1) * max_spacer, now 2 * max_flank + max(0, repeat-1) * max_spacer. - docs/webpages/model_training/simulate_data_cli.qmd: documents the flags.

…scan

…cross bins

… tuning guidance

codecov · 2026-05-14T19:22:31Z

Codecov Report

❌ Patch coverage is 62.50000% with 138 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
scripts/qc_metrics.py	35.07%	87 Missing ⚠️
wrappers/annotate_reads_wrap.py	30.43%	16 Missing ⚠️
scripts/annotate_new_data.py	30.00%	14 Missing ⚠️
wrappers/qc_metrics_wrap.py	77.41%	7 Missing ⚠️
scripts/available_gpus.py	57.14%	6 Missing ⚠️
scripts/annotate_reads.py	50.00%	4 Missing ⚠️
wrappers/evaluate_model_wrap.py	76.92%	3 Missing ⚠️
wrappers/generate_whitelist_wrap.py	91.66%	1 Missing ⚠️

Files with missing lines	Coverage Δ
main.py	`91.73% <ø> (ø)`
scripts/barcode_correction.py	`60.60% <100.00%> (+0.92%)`	⬆️
scripts/discover_barcodes.py	`73.05% <100.00%> (ø)`
scripts/extract_annotated_seqs.py	`72.86% <100.00%> (+0.10%)`	⬆️
scripts/simulate_training_data.py	`88.05% <100.00%> (+0.05%)`	⬆️
scripts/split_bam_file.py	`78.69% <ø> (-0.19%)`	⬇️
utils/__init__.py	`60.00% <100.00%> (+2.85%)`	⬆️
wrappers/align_wrap.py	`96.55% <100.00%> (+0.25%)`	⬆️
wrappers/barcode_correction_wrap.py	`68.40% <100.00%> (+0.84%)`	⬆️
wrappers/dedup_wrap.py	`81.81% <100.00%> (+0.86%)`	⬆️
... and 16 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

AyushSemwal added 12 commits April 17, 2026 12:39

Merge pull request huishenlab#70 from huishenlab/annot_demux_refactor

2fa3378

updated schema

Merge pull request huishenlab#71 from huishenlab/dev

263f51d

updated schema

docs: document --min-flank/--max-flank in assess_model.qmd

98a1cb3

meta: include model name in log output and artifacts

10a6796

feat(qc): switch gene body coverage to --gene-body-bed; speed up BAM …

da8b1d9

…scan

feat(gpu): separate detected vs in-use logging; persist model state a…

1bac018

…cross bins

feat(logging): report peak memory + elapsed time across

c831433

feat: switch default model to 10x3p_sc_ont_016; doc: --max-batch-size…

382c5fd

… tuning guidance

AyushSemwal merged commit ddc65c9 into huishenlab:annot_demux_refactor May 14, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annot demux refactor#73

Annot demux refactor#73
AyushSemwal merged 12 commits into
huishenlab:annot_demux_refactorfrom
AyushSemwal:annot_demux_refactor

AyushSemwal commented May 14, 2026

Uh oh!

Uh oh!

codecov Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AyushSemwal commented May 14, 2026

Uh oh!

Uh oh!

codecov Bot commented May 14, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant