Annot demux refactor#73
Merged
AyushSemwal merged 12 commits intoMay 14, 2026
Merged
Conversation
New commands barcode_correct — standalone Levenshtein correction against a whitelist; resumable, optional inline demux via --run-demux. demux_reads — standalone FASTA/FASTQ export from annotations (demuxed or bulk). generate_whitelist — whitelist-free cell-barcode discovery via knee detection + deletion-neighborhood near-dup merging. qc_metrics — annotation + BAM-level QC, knee plots, boxplots, MultiQC TSVs, HTML report. ~3.9k lines, no PyArrow. assess_model — per-segment accuracy assessment for trained models with a generated report. featurecounts — gene-level count matrix from per-cell BAMs. Pipeline restructure You can run barcode_correct and demux_reads against existing annotations — useful for resuming, re-running with different thresholds, or the whitelist-free flow. annotate_reads and visualize accept preprocessed directories directly. User-configurable bin width in preprocessing Option to split concatenated reads during annotation. Checkpoint/resume across annotate, BC correct, and demux. Optional chunk cleanup. annotations_valid.parquet auto-removed once the corrected file exists. Barcode correction handles UMI-less protocols and arbitrary barcode-type combos. All FASTA/FASTQ outputs gzipped. Model / training REG and HYB paths dropped. CRF-only. 10x3p_sc_ont_013 is the new default. seq_orders and training_params moved from TSV → YAML with a cleaner schema. Dynamic batch sizing now accounts for all layers, not just conv. Training artifacts versioned and folder naming cleaned up. QC qc_metrics.py computes and visualizes multiple QC plots at various levels Dedup / BAM UMI dedup significantly faster. Minor tweaks to split_bam. Packaging & CI setup.py → pyproject.toml. Moved to setuptools-scm for versioning. Fixed the upstream CI breakage: tag_regex + git_describe_command so legacy non-PEP440 tags like v0.2.1_tf2.15.0 are stripped to 0.2.1. Added a Docker publish workflow. pytest now runs on push/PR to dev and annot_demux_refactor. New unit tests for checkpoint chunk size, _version, and the seq_orders YAML refactor. Integration tests expanded. Added container_runtime.py helper. Docs Quarto site reorganized into per-stage pages: preprocessing, annotation, barcode/demux, align/dedup, QC, split BAM, visualization, featurecounts. New model-assessment guide; model-training and read-simulation docs rewritten. Quick start, usage, install, resource-requirements pages refreshed.
updated schema
- scripts/barcode_correction.py: when only model_name is given (no explicit seq_order_file), default to the bundled utils/seq_orders.yaml instead of falling through to the column-overlap heuristic. - scripts/extract_annotated_seqs.py: guard read[Starts[0]:Ends[0]] with a truthiness check on Starts to prevent IndexError when a barcode label is structurally absent in a "valid"-architecture read.
- scripts/annotate_new_data.py:
* Only set CUDA_VISIBLE_DEVICES when GPUs are actually found, instead of
blanking it to "" and masking parent-primed devices in subprocesses.
* Disable global XLA JIT — tf2crf Viterbi is not XLA-compatible and caused
intermittent "Unexpected Event status: 1" miscompiles on TF 2.15.
* In predict_with_backoff, raise immediately when rebuild_model_fn is None
after K.clear_session(); also raise on rebuild failure. Looping on a
stale model after session clear is a documented driver-corruption path.
- wrappers/annotate_reads_wrap.py:
* Replace unbounded worker.join() with a bounded ladder (join 5s →
SIGTERM 2s → SIGKILL 2s → close). SIGKILL via SLURM cgroup mid-NCCL
collective is a known way to leave the ring corrupt across jobs.
Distinguishes terminal flanks (random cDNA at the two read ends, used to
mimic ONT adapter-flank length) from interior spacers (chimeric junctions
between concatenated fragments). Previously both used the RN token and
shared --min/max-spacer; now they're RN_FLANK vs RN_SPACER with independent
ranges.
- scripts/simulate_training_data.py:
* generate_segment / generate_valid_read accept flank_range.
* _build_structure_order_and_patterns emits RN_FLANK at the two ends and
RN_SPACER between concatenated copies.
* RC / reverse / transform helpers updated to recognize both tokens.
* Drops the implicit min(length, 50) cap; bound is now user-controlled.
- main.py: adds --min-flank/--max-flank Typer options to simulate_data and
assess_model.
- wrappers/simulate_data_wrap.py, wrappers/evaluate_model_wrap.py: thread
the new args through. evaluate_model_wrap also fixes the pre-flight
read-length overhead math: was (repeat+1) * max_spacer, now
2 * max_flank + max(0, repeat-1) * max_spacer.
- docs/webpages/model_training/simulate_data_cli.qmd: documents the flags.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
10x3p_sc_ont_013(Conv+CRF) to10x3p_sc_ont_016(CNN-BiLSTM-CRF, 3 conv × 64 filters, 1 BiLSTM × 32 units). Empirically validated to ~525 kb on 4× L40S.predict_with_backoffnow raises on stale model afterK.clear_session(), XLA disabled (tf2crf Viterbi incompatibility on TF 2.15),CUDA_VISIBLE_DEVICESno longer blanked when no GPUs found, model state persisted across length bins.--gene-body-bedflag accepting RSeQC-style BED12 (replaces auto-extraction from GTF); single-pass BAM scan is significantly faster.--min-flank/--max-flankfor terminal cDNA flanks, separate from interior--min/max-spacer; pre-flight read-length math corrected inassess-model.barcode-correctfalls back to bundledseq_orders.yamlwhen only--model-nameis given;extract_annotated_seqsno longer crashes on emptyStarts.resource_requirements.qmdre-anchored on_016(bpt 6,644 → 2,916, k 73.4 → 32.2 KB/bp); new--max-batch-sizeper-VRAM-tier tuning guidance; XLA references removed; tentative-figures disclaimer added pending per-GPU benchmarking.