perf: remove memory bottleneck in stages 5 and 6 by 0xAndoroid · Pull Request #1328 · a16z/jolt

0xAndoroid · 2026-03-06T16:44:11Z

No description provided.

quangvdao · 2026-03-06T22:50:43Z

It seems like these changes will affect performance. Should we gate this and/or make the number of delayed binding rounds for RaPolynomial more dynamic?

Signed-off-by: Andrew Tretyakov <42178850+0xAndoroid@users.noreply.github.com>

Behind the `monitor` feature, use tikv-jemallocator as the global allocator with dirty_decay_ms:0 and muzzy_decay_ms:0. This forces immediate page return to the OS so RSS accurately reflects live heap usage — the system allocator holds freed pages indefinitely, inflating RSS from ~2 GB (actual) to ~5 GB (watermark).

Stage 5: Replace dense ra_polys materialization (8×T×32B = 2GB at 2^23) with lazy RaPolynomial<u16, F> using combined expanding-table lookups. Each virtual RA poly stores a 64K-entry combined table + per-cycle u16 keys (4B/cycle). Automatically materializes to dense after 3 cycle rounds at T/8 length (0.25GB). Falls back to dense for large K configs. Stage 6: Add SharedRaRound4 to SharedRaPolynomials state machine, delaying materialization by one extra round. This staggers the booleanity materialization (now at T/16) from InstructionRaSumcheck's materialization (at T/8), preventing simultaneous peak allocations. Booleanity dense polys: 1.31GB → 0.66GB.

Delays RaPolynomial materialization by one more round (Round3→Round4 instead of Round3→RoundN). Dense polys materialize at T/16 instead of T/8, halving peak allocation during stage 6 InstructionRa transition.

SharedRaPolynomials now stores Arc<Vec<RaIndices>> instead of owned Vec<RaIndices>. BooleanitySumcheckProver::initialize returns the shared Arc so InstructionRaSumcheckProver can reuse it. Currently InstructionRa still creates transposed per-poly indices from the shared data; full deduplication requires changing the product-sum evaluation functions.

InstructionRaSumcheckProver now uses SharedRaPolynomials directly instead of creating 32 separate transposed Vec<Option<u8>> arrays. The shared non-transposed Vec<RaIndices> is accessed via get_bound_coeff(poly_idx, j) during the sumcheck. New compute_shared_ra_sum_of_products_evals_d{4,8,16} functions in mles_product_sum.rs provide the same eval_prod pattern but read from SharedRaPolynomials.

RamRaVirtualSumcheckProver now reads RAM RA indices from the shared Arc<Vec<RaIndices>> instead of re-reading the trace. Saves 0.13 GB of transposed RAM index storage and avoids a full trace iteration.

Further delays booleanity materialization to T/32, staggering it from InstructionRa's Round4→RoundN at T/16. This prevents simultaneous dense poly allocation spikes in stage 6.

The LazyTraceIterator holds the emulator state (including HashMap-based memory) which persists until the prover is dropped. After streaming witness commitment finishes, this data is never used again. Drop it immediately to free the emulator's memory map.

Move tikv-jemallocator from monitor-only to the prover feature so jemalloc with aggressive page purging is used in all builds, not just profiling. This ensures RSS accurately reflects live heap.

… state machine Replace hand-unrolled Round1-4 (RaPolynomial) and Round1-5 (SharedRaPolynomials) with a single TableRound type per enum. Each bind doubles table groups; materialization triggers at a configurable threshold (8 / 16 groups respectively). -515 lines, no behavioral change.

0xAndoroid · 2026-03-07T20:42:54Z

@quangvdao yes. I'm still experimenting with this at different trace lengths.

0xAndoroid force-pushed the test/ram branch from faabb9a to d8ec577 Compare March 6, 2026 19:44

0xAndoroid added 11 commits March 7, 2026 02:48

hotfix: remove native

cfb3ce5

Signed-off-by: Andrew Tretyakov <42178850+0xAndoroid@users.noreply.github.com>

perf: add Round4 to RaPolynomial for smaller dense materialization

1f3b538

Delays RaPolynomial materialization by one more round (Round3→Round4 instead of Round3→RoundN). Dense polys materialize at T/16 instead of T/8, halving peak allocation during stage 6 InstructionRa transition.

perf: RamRaVirtual also uses shared RA indices

0f7a5e5

RamRaVirtualSumcheckProver now reads RAM RA indices from the shared Arc<Vec<RaIndices>> instead of re-reading the trace. Saves 0.13 GB of transposed RAM index storage and avoids a full trace iteration.

perf: add Round5 to SharedRaPolynomials (T/32 materialization)

b364c05

Further delays booleanity materialization to T/32, staggering it from InstructionRa's Round4→RoundN at T/16. This prevents simultaneous dense poly allocation spikes in stage 6.

feat: use jemalloc for all prover builds

73e1d54

Move tikv-jemallocator from monitor-only to the prover feature so jemalloc with aggressive page purging is used in all builds, not just profiling. This ensures RSS accurately reflects live heap.

0xAndoroid force-pushed the test/ram branch from ea3e87f to 78fdee8 Compare March 7, 2026 07:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: remove memory bottleneck in stages 5 and 6#1328

perf: remove memory bottleneck in stages 5 and 6#1328
0xAndoroid wants to merge 11 commits intomainfrom
test/ram

0xAndoroid commented Mar 6, 2026

Uh oh!

quangvdao commented Mar 6, 2026

Uh oh!

0xAndoroid commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

0xAndoroid commented Mar 6, 2026

Uh oh!

quangvdao commented Mar 6, 2026

Uh oh!

0xAndoroid commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants