Skip to content

Zig drop-in replacement for bcrham and ig-sw (zero C++ deps)#339

Open
matsen wants to merge 14 commits intomainfrom
zig-core-integration
Open

Zig drop-in replacement for bcrham and ig-sw (zero C++ deps)#339
matsen wants to merge 14 commits intomainfrom
zig-core-integration

Conversation

@matsen
Copy link
Collaborator

@matsen matsen commented Mar 11, 2026

Summary

Wires in matsengrp/partis-zig-core as a zero-dependency drop-in replacement for both the C++ bcrham binary and the C ig-sw binary. Both Zig binaries are selected at runtime via environment variables, so the default C++ path is fully preserved.

Changes

Three hunks in partitiondriver.py (~6 lines total):

  • get_hmm_cmd_str: PARTIS_ZIG_CORE_EXE env var overrides the bcrham path when set
  • print_partition_dbgfo: guard — Zig binary doesn't emit bcrham timing/debug strings
  • check_wait_times: guard — same reason

Two hunks in waterer.py:

  • __init__: skip ig-sw binary existence check when PARTIS_ZIG_IGSW_EXE is set
  • get_ig_sw_cmd_str: PARTIS_ZIG_IGSW_EXE env var overrides the ig-sw path when set

One new script:

  • bin/install-zig-backend.sh: downloads Zig if needed, clones partis-zig-core, builds both static binaries, prints the export lines

Install

bin/install-zig-backend.sh
# copy-paste the two `export PARTIS_ZIG_CORE_EXE=...` and `export PARTIS_ZIG_IGSW_EXE=...` lines printed by the script
partis annotate ...                 # now uses both Zig binaries

No GSL, OpenBLAS, yaml-cpp, or other C++ library dependencies. Works on Linux x86_64 and macOS ARM64 out of the box.

Validation

Correctness: Ran partis cache-parameters on 36 BCR sequences from test/ref-results/test/simu.yaml with both backends:

  • Gene calls (V/D/J): identical across all 36 sequences
  • Deletion lengths, insertion sequences: identical
  • All 26 estimated parameter CSVs: byte-for-bit identical

ig-sw (ksw.zig): Verified on macOS ARM64 — 36/36 sequences pass smith-waterman with the Zig ig-sw, matching the C binary. Note: ksw.zig replaces the SSE2-dependent ksw.c with cross-platform @Vector SIMD, enabling builds on ARM without x86 intrinsics.

Test suite: test/test.py --quick passes with both Zig binaries substituted.

Equivalence: 18 checkpoint-level JSON comparisons between C++ and Zig all pass.

Backward compatibility

Unset both PARTIS_ZIG_CORE_EXE and PARTIS_ZIG_IGSW_EXE → original C++ bcrham and C ig-sw used, no behaviour change.

matsen and others added 6 commits March 10, 2026 04:41
When PARTIS_ZIG_CORE_EXE is set, partitiondriver uses that executable
instead of packages/ham/bcrham. This enables end-to-end testing of the
Zig port (partis-zig-core) against the partis Python test suite without
modifying any existing behavior.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Zig binary doesn't emit bcrham's timing and debug strings to stdout,
so print_partition_dbgfo() and check_wait_times() crash when trying to
summarize None values. Skip both when PARTIS_ZIG_CORE_EXE is set.

Also use top-level `os` import instead of local `import os as _os`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All three zig-core-integration hunks now have inline comments explaining
the change (the other two guards already had comments).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Downloads Zig 0.15.2 if not present, clones matsengrp/partis-zig-core,
builds a static binary with no GSL/OpenBLAS deps, and prints the
PARTIS_ZIG_CORE_EXE export line to use it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- waterer.py: skip ig-sw existence check when PARTIS_ZIG_IGSW_EXE is set
- waterer.py: use PARTIS_ZIG_IGSW_EXE binary in get_ig_sw_cmd_str when set
- bin/install-zig-backend.sh: print PARTIS_ZIG_IGSW_EXE export alongside
  existing PARTIS_ZIG_CORE_EXE export

With both env vars set, partis runs end-to-end with no C binaries.
validate-partis-suite.sh --quick PASS with full Zig pipeline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@matsen
Copy link
Collaborator Author

matsen commented Mar 11, 2026

Verified on Linux x86_64

bin/install-zig-backend.sh
export PARTIS_ZIG_CORE_EXE=.../zig-out/bin/partis-zig-core
export PARTIS_ZIG_IGSW_EXE=.../zig-out/bin/partis-zig-igsw
partis annotate --infname test/ref-results/test/simu.yaml \
  --parameter-dir test/ref-results/test/parameters/simu \
  --outfname /tmp/zig-test-out.yaml

Build succeeded, annotate ran to completion: 36/36 sequences annotated.

@matsen matsen changed the title Zig drop-in replacement for bcrham (no GSL/OpenBLAS deps) Zig drop-in replacement for bcrham and ig-sw (zero C++ deps) Mar 11, 2026
@psathyrella
Copy link
Owner

This looks great! Thanks for working on this.

A few quick thoughts

  • I'm not sure I understand why it's worth making the zig binary without debug printing or timing info. The debug printing is really the basis for figuring out what's going on with weird samples and finding bugs in a way that can't be replicated with debuggers. The timing info is ingested by other code so I think is also likely to break things if we change or remove it.
  • cache-parameters only runs viterbi, and I didn't run a coverage tool but it seems like since this doesn't run forward or hierarchical agglomeration it probably only uses ~half the ham code. I mentioned elsewhere the test commands we should run, so I'm assuming this is just an initial partial test implementation.
  • the packages/ham/ git submodule seems to point to a nonexistent ham commit -- I imagine this is supposed to point to the zig ham repo? Also it looks like ig-sw is still pointing to, well, ig-sw so I think the zig version isn't plugged in yet? In either case I didn't figure out how to have a look through any @copilot actual zig code.
  • I assume the environment variable approach to switching is the best for reasons I don't understand, but I would just note that I don't think anything else in partis is configured with environment variables, so it might be good to at least document it carefully, if not switch some other things to use it for more consistency.

Copy link

Copilot AI commented Mar 11, 2026

@psathyrella I've opened a new pull request, #340, to work on those changes. Once the pull request is ready, I'll request review from you.

@matsen
Copy link
Collaborator Author

matsen commented Mar 11, 2026

Hey Duncan! Thanks for the comment. I'll work on all these bits.

@matsen
Copy link
Collaborator Author

matsen commented Mar 11, 2026

@psathyrella if it seems like this has legs, I wonder what you would think about bringing the Zig code to be part of the partis repo. It would simplify install and future development.

matsen and others added 4 commits March 11, 2026 14:42
… CLI args

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- test/test.py: add --bcrham-binary and --ig-sw-binary passthrough args
- partitiondriver.py: remove path-comparison guards in print_partition_dbgfo() and check_wait_times() now that Zig binary emits bcrham-compatible debug strings to stdout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
partitiondriver.py / utils.py:
- check_wait_times: skip timing check if any bcrham time is None
- print_partition_dbgfo: show 'n/a' when timing is None
- summarize_bcrham_dbgstrs: filter None before min/max

zig-backend/partis-zig-core: bump to commit with use-after-free, double-free,
  missing cache writes, and ksw zero-length panic fixes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@matsen
Copy link
Collaborator Author

matsen commented Mar 12, 2026

Here's a progress update after running the full test suite (test/test.py) with both --bcrham-binary and --ig-sw-binary pointing to the Zig binaries. All tests through cache-parameters-data now pass (green). The simulate step fails because R is not installed in this environment — same behavior as with the C++ binaries.

Bugs found and fixed

glomerator.zig — memory safety:

  1. Use-after-free in calculateNaiveSeq: was returning a []const u8 slice into result.best_event.?.naive_seq, but defer result.deinit() freed that memory before the caller could use it. Fixed by changing the return type to ![]u8 and dupe-ing the string before the defer fires.

  2. Double-free in cluster: a shallow clone of initial_partition deferred freeing its string items, but those same strings were owned by self.initial_partition. The initial variable was dead code — removed it entirely.

  3. writeCacheFile never called: cacheNaiveSeqs and cluster both computed naive sequences but never wrote them to output_cachefname, leaving 0-byte hmm_cached_info.csv files in subworkdirs. merge_files skips 0-byte files, so the cache was silently lost and the non-empty subworkdirs caused os.rmdir to fail with ENOTEMPTY. Fixed by calling writeCacheFile() in both paths.

  4. Wrong buffered-writer API in writeCacheFile: used f.writer() without a buffer argument; fixed to f.writer(&wr_buf) with an explicit flush().

ksw.zig — integer overflow panics on zero-length sequences:

The test data (test/mishmash.fa) includes a repetitive crap-1 NUKES sequence that results in zero-length query or target sequences being passed to the alignment functions. Zig debug mode catches the unsigned underflow that C silently handled as a signed negative:

  • ksw_u8 and ksw_i16: slen = 0H0[slen - 1] panics. Fixed with early if (slen == 0) return g_defr.
  • ksw_global: tlen_u = 0tlen_u - 1 panics in the backtrack phase. Fixed with early if (qlen_u == 0 or tlen_u == 0) return 0.

partitiondriver.py / utils.py — Python None guards:

The Zig bcrham doesn't emit the same timing line as C++ bcrham, so bcrham_time is None. Added guards in summarize_bcrham_dbgstrs (filter None before min/max), print_partition_dbgfo (show "n/a"), and check_wait_times (skip the check).

Test run command

test/test.py \
  --bcrham-binary zig-backend/partis-zig-core/zig-out/bin/partis-zig-core \
  --ig-sw-binary  zig-backend/partis-zig-core/zig-out/bin/partis-zig-igsw

All 9 runnable steps pass. The submodule has been pushed to matsengrp/partis-zig-core on branch zig-core-integration.

… fixes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@matsen
Copy link
Collaborator Author

matsen commented Mar 12, 2026

--paired test update — all paired tests now pass (through cache-parameters-data). simulate still fails due to R not being installed, same as before.

Two more bugs fixed while getting --paired working:

dp_handler.zig — GPA memory leak (errdeferdefer for qstrs)

In fillTrellis, when the cached-trellis branch is taken, qstrs is populated but ownership is never transferred to a TrellisEntry. The errdefer only fires on error, so on normal return qstrs leaked. Changed to defer — safe because the no-cache branch sets qstrs = .{} before storing, so the defer runs on an empty list.

trellis.zig / dp_handler.zig — dangling self-referential _ptr fields (GPF)

Trellis stores both owned data (e.g. viterbi_indices: ArrayListUnmanaged(i32)) and pointer-to-active-data (e.g. viterbi_indices_ptr: ?*ArrayListUnmanaged(i32)). The non-cached viterbi path sets viterbi_indices_ptr = &self.viterbi_indices (self-referential).

TrellisEntry stores Trellis by value inside an ArrayListUnmanaged. When a second entry is appended for the same gene (different kset, no query string match), the list may reallocate its backing store, copying all entries to a new address and freeing the old one. The first entry's viterbi_indices_ptr then points into freed memory — a dangling pointer causing a General Protection Exception when the cached trellis path is taken for a subsequent kset.

Fix: add Trellis.fixupPtrs() which resets all _ptr fields to point to self.*, and call it on all entries after every cache_list.append.

Also added length == 0 and bounds guards to viterbiIndicesAt / endingViterbiLogProbAt / endingForwardLogProbAt to prevent usize underflow when zero-length k_j ksets are tried on light-chain sequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants