Skip to content

perf: remove memory bottleneck in stages 5 and 6#1328

Draft
0xAndoroid wants to merge 11 commits intomainfrom
test/ram
Draft

perf: remove memory bottleneck in stages 5 and 6#1328
0xAndoroid wants to merge 11 commits intomainfrom
test/ram

Conversation

@0xAndoroid
Copy link
Copy Markdown
Collaborator

No description provided.

@quangvdao
Copy link
Copy Markdown
Contributor

It seems like these changes will affect performance. Should we gate this and/or make the number of delayed binding rounds for RaPolynomial more dynamic?

Signed-off-by: Andrew Tretyakov <42178850+0xAndoroid@users.noreply.github.com>
Behind the `monitor` feature, use tikv-jemallocator as the global
allocator with dirty_decay_ms:0 and muzzy_decay_ms:0. This forces
immediate page return to the OS so RSS accurately reflects live
heap usage — the system allocator holds freed pages indefinitely,
inflating RSS from ~2 GB (actual) to ~5 GB (watermark).
Stage 5: Replace dense ra_polys materialization (8×T×32B = 2GB at 2^23)
with lazy RaPolynomial<u16, F> using combined expanding-table lookups.
Each virtual RA poly stores a 64K-entry combined table + per-cycle u16
keys (4B/cycle). Automatically materializes to dense after 3 cycle
rounds at T/8 length (0.25GB). Falls back to dense for large K configs.

Stage 6: Add SharedRaRound4 to SharedRaPolynomials state machine,
delaying materialization by one extra round. This staggers the
booleanity materialization (now at T/16) from InstructionRaSumcheck's
materialization (at T/8), preventing simultaneous peak allocations.
Booleanity dense polys: 1.31GB → 0.66GB.
Delays RaPolynomial materialization by one more round (Round3→Round4
instead of Round3→RoundN). Dense polys materialize at T/16 instead of
T/8, halving peak allocation during stage 6 InstructionRa transition.
SharedRaPolynomials now stores Arc<Vec<RaIndices>> instead of owned
Vec<RaIndices>. BooleanitySumcheckProver::initialize returns the
shared Arc so InstructionRaSumcheckProver can reuse it. Currently
InstructionRa still creates transposed per-poly indices from the
shared data; full deduplication requires changing the product-sum
evaluation functions.
InstructionRaSumcheckProver now uses SharedRaPolynomials directly
instead of creating 32 separate transposed Vec<Option<u8>> arrays.
The shared non-transposed Vec<RaIndices> is accessed via
get_bound_coeff(poly_idx, j) during the sumcheck. New
compute_shared_ra_sum_of_products_evals_d{4,8,16} functions in
mles_product_sum.rs provide the same eval_prod pattern but read
from SharedRaPolynomials.
RamRaVirtualSumcheckProver now reads RAM RA indices from the shared
Arc<Vec<RaIndices>> instead of re-reading the trace. Saves 0.13 GB
of transposed RAM index storage and avoids a full trace iteration.
Further delays booleanity materialization to T/32, staggering it from
InstructionRa's Round4→RoundN at T/16. This prevents simultaneous
dense poly allocation spikes in stage 6.
The LazyTraceIterator holds the emulator state (including HashMap-based
memory) which persists until the prover is dropped. After streaming
witness commitment finishes, this data is never used again. Drop it
immediately to free the emulator's memory map.
Move tikv-jemallocator from monitor-only to the prover feature so
jemalloc with aggressive page purging is used in all builds, not
just profiling. This ensures RSS accurately reflects live heap.
… state machine

Replace hand-unrolled Round1-4 (RaPolynomial) and Round1-5
(SharedRaPolynomials) with a single TableRound type per enum.
Each bind doubles table groups; materialization triggers at a
configurable threshold (8 / 16 groups respectively).

-515 lines, no behavioral change.
@0xAndoroid
Copy link
Copy Markdown
Collaborator Author

@quangvdao yes. I'm still experimenting with this at different trace lengths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants