perf: remove memory bottleneck in stages 5 and 6#1328
Draft
0xAndoroid wants to merge 11 commits intomainfrom
Draft
perf: remove memory bottleneck in stages 5 and 6#13280xAndoroid wants to merge 11 commits intomainfrom
0xAndoroid wants to merge 11 commits intomainfrom
Conversation
Contributor
|
It seems like these changes will affect performance. Should we gate this and/or make the number of delayed binding rounds for |
Signed-off-by: Andrew Tretyakov <42178850+0xAndoroid@users.noreply.github.com>
Behind the `monitor` feature, use tikv-jemallocator as the global allocator with dirty_decay_ms:0 and muzzy_decay_ms:0. This forces immediate page return to the OS so RSS accurately reflects live heap usage — the system allocator holds freed pages indefinitely, inflating RSS from ~2 GB (actual) to ~5 GB (watermark).
Stage 5: Replace dense ra_polys materialization (8×T×32B = 2GB at 2^23) with lazy RaPolynomial<u16, F> using combined expanding-table lookups. Each virtual RA poly stores a 64K-entry combined table + per-cycle u16 keys (4B/cycle). Automatically materializes to dense after 3 cycle rounds at T/8 length (0.25GB). Falls back to dense for large K configs. Stage 6: Add SharedRaRound4 to SharedRaPolynomials state machine, delaying materialization by one extra round. This staggers the booleanity materialization (now at T/16) from InstructionRaSumcheck's materialization (at T/8), preventing simultaneous peak allocations. Booleanity dense polys: 1.31GB → 0.66GB.
Delays RaPolynomial materialization by one more round (Round3→Round4 instead of Round3→RoundN). Dense polys materialize at T/16 instead of T/8, halving peak allocation during stage 6 InstructionRa transition.
SharedRaPolynomials now stores Arc<Vec<RaIndices>> instead of owned Vec<RaIndices>. BooleanitySumcheckProver::initialize returns the shared Arc so InstructionRaSumcheckProver can reuse it. Currently InstructionRa still creates transposed per-poly indices from the shared data; full deduplication requires changing the product-sum evaluation functions.
InstructionRaSumcheckProver now uses SharedRaPolynomials directly
instead of creating 32 separate transposed Vec<Option<u8>> arrays.
The shared non-transposed Vec<RaIndices> is accessed via
get_bound_coeff(poly_idx, j) during the sumcheck. New
compute_shared_ra_sum_of_products_evals_d{4,8,16} functions in
mles_product_sum.rs provide the same eval_prod pattern but read
from SharedRaPolynomials.
RamRaVirtualSumcheckProver now reads RAM RA indices from the shared Arc<Vec<RaIndices>> instead of re-reading the trace. Saves 0.13 GB of transposed RAM index storage and avoids a full trace iteration.
Further delays booleanity materialization to T/32, staggering it from InstructionRa's Round4→RoundN at T/16. This prevents simultaneous dense poly allocation spikes in stage 6.
The LazyTraceIterator holds the emulator state (including HashMap-based memory) which persists until the prover is dropped. After streaming witness commitment finishes, this data is never used again. Drop it immediately to free the emulator's memory map.
Move tikv-jemallocator from monitor-only to the prover feature so jemalloc with aggressive page purging is used in all builds, not just profiling. This ensures RSS accurately reflects live heap.
… state machine Replace hand-unrolled Round1-4 (RaPolynomial) and Round1-5 (SharedRaPolynomials) with a single TableRound type per enum. Each bind doubles table groups; materialization triggers at a configurable threshold (8 / 16 groups respectively). -515 lines, no behavioral change.
Collaborator
Author
|
@quangvdao yes. I'm still experimenting with this at different trace lengths. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.