perf(order_book): use try_emplace for baseline price levels by div0rce · Pull Request #138 · div0rce/quant-systems-lab

div0rce · 2026-06-24T23:16:56Z

Summary

Flamegraph-driven optimization of the hot path. The flamegraph's hot std::_Rb_tree::_M_emplace_unique + pmr allocation sits in the baseline OrderBook::level_for:

bids_.emplace(price, Level{...});   // before

std::map::emplace allocates a map node and constructs the value before checking for a duplicate key, then frees the node when the price level already exists. With a bounded price band the levels almost always already exist in steady state, so every resting order paid a wasted pmr node alloc+free.

try_emplace (C++17) does not allocate or construct when the key is present:

bids_.try_emplace(price);           // after

The intrusive store already uses try_emplace (append_resting/its level_for); this brings the baseline path in line.

Why it's safe (determinism preserved)

Semantics are identical: an absent level is still inserted empty with the map's pmr resource (scoped-allocator propagation — no explicit allocator arg needed), and erase_level_if_empty still prunes emptied levels, so best_bid/best_ask and the "no empty levels" invariant are unchanged.

Differential/determinism fixtures byte-identical across g++/clang++ and vs the committed copies.
OCaml differential oracle passes.
make check 265/265, make asan 265/265.

Measured win (controlled A/B)

Release -O3, baseline storage, qsl-bench profile 3 (the steady-state deep-book workload the flamegraph profiles), same host, back-to-back, 5 runs each:

	ops/sec	median
before (`emplace`)	8.05 / 8.03 / 7.85 / 7.93 / 8.06 M	~8.03M
after (`try_emplace`)	8.46 / 8.55 / 8.47 / 8.44 / 8.46 M	~8.46M

Non-overlapping ranges → ~+5% throughput, real (not noise). Back-to-back A/B controls for machine load.

Why `results/latest.txt` is unchanged

The committed micro-benchmark order_book add/mod/cancel churns single-order levels (each cancel empties the level, so the next add re-creates it → the key is absent every time and try_emplace cannot help), so it does not exercise the optimized existing-level fast path. The win is on the steady-state path with multiple orders per level. Regenerating latest.txt would only inject run-to-run variance (verified: a quiet-canary run leaves those scenarios within noise), so it is intentionally left at its quiet-machine snapshot rather than presenting variance as change.

🤖 Generated with Claude Code

The flamegraph's hot _Rb_tree::_M_emplace_unique + pmr allocation is in the baseline OrderBook::level_for: std::map::emplace(price, Level{...}) allocates a map node and constructs the value *before* checking for a duplicate key, then frees the node when the price level already exists. With a bounded price band the levels almost always already exist in steady state, so every resting order paid a wasted pmr node alloc+free. try_emplace (C++17) does not allocate or construct when the key is present — the intrusive store already uses it; this brings the baseline path in line. Semantics are identical: an absent level is still inserted empty with the map's pmr resource (scoped-allocator propagation), and erase_level_if_empty still prunes it, so best_bid/best_ask and the "no empty levels" invariant are unchanged. Measured A/B (Release -O3, baseline storage, qsl-bench profile 3s, same host, 5 runs each, non-overlapping ranges): ~8.03M -> ~8.46M ops/sec, ~+5%. Determinism preserved: fixtures byte-identical across g++/clang++ and vs the committed copies; OCaml differential passes. make check/asan 265/265. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-24T23:17:04Z

Warning

Review limit reached

@div0rce, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 6 minutes and 11 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9488137b-540c-41bf-972e-fcbd0a62bfbe

📥 Commits

Reviewing files that changed from the base of the PR and between c6740e5 and 9c8daa3.

📒 Files selected for processing (1)

src/engine/order_book.cpp

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/orderbook-try-emplace

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

codescene-delta-analysis

Our agent can fix these. Install it.

Gates Passed
6 Quality Gates Passed

Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.

#139) Refreshes the committed flamegraph so its Source digest matches current src/ (PRs #136 #137 #138; Dirty inputs: no). The try_emplace optimization (#138) is visible: the previously top-3 std::_Rb_tree::_M_emplace_unique allocation frame (~1137 samples) is gone from the hot stacks, which are now the actual matching work — cancel, match_baseline, rest, new_limit. ~20k samples, zero [unknown]. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…ct (#147) * docs: overhaul all stale docs for the post-v0.2.1 (v0.2.2) state Full staleness audit of every prose doc against current main. The anchors were frozen at v0.2.1 / 263 tests / "no active milestone" while 12 PRs (#135-#146) had merged and are unreleased. - Resume anchors (PROGRESS.md, HANDOFF.md): Current state / Last action / Next action / test count (263->270) brought current; the two duplicate frozen anchors and the stale macOS benchmark numbers fixed; a dated v0.2.2 log entry. - CLAUDE.md + AGENTS.md: the post-M35 roadmap-memory section now records the post-v0.2.1 hardening + perf wave (identical edit in both). - CHANGELOG.md: new [0.2.2] section (decoder enum rejection #136, network/CLI hardening #137/#140/#141/#143, real UBSan abort gate #142, ocaml diff_report #144, try_emplace ~+5% #138, index load-factor ~+18.6% #145). CMakeLists version 0.2.1 -> 0.2.2. - README: benchmark/flamegraph/limitations sections reflect the engine wins (measured on the profile workload, not the micro-bench table) and the gateway hardening; release_readiness 270/270 + UBSan gate + v0.2.2 scope. - Networking docs (socket_gateway, socket_hardening): connection cap, EINTR retry, transient-accept survival, fd-exhaustion handling, UDP send-error counting. replay_and_recovery: decode_command now rejects out-of-domain enums. binary_protocol/differential_testing/fix_protocol/SECURITY/recruiting_notes/ CONTRIBUTING: smaller accuracy updates. results/README: add the three socket artifacts. make check 270/270. Stale results/*.txt provenance digests regenerated separately. pool_backed_storage.md table follows its artifact regeneration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * results: regenerate stale provenance artifacts for the v0.2.2 source The post-v0.2.1 source changes (#135-#146) left 14 results/*.txt with stale Source digests (the authoritative staleness signal per the provenance rules, not commit-hash drift). Regenerated via their make targets so each declares Dirty inputs: no against current HEAD: differential, pool_backed_storage, allocator_experiment, recovery_benchmarks, false_sharing_study, perf_stat_linux (partial PMU, QSL_PERF_ALLOW_PARTIAL), perf_report_linux, numa_affinity_study (linux-constrained), socket_load_summary, socket_profile_loopback, socket_stress_summary, dpdk_environment, nic_offload_environment. docs/pool_backed_storage.md: refreshed the median table, digest reference, and qualitative ordering from the regenerated artifact (contiguous fastest on four of five workloads; intrusive leads dense). The baseline rows now include the try_emplace (#138) and index load-factor (#145) wins. Honesty notes: these were measured on a thermally-warmed M2 from a long session, so absolute values run higher than a cool-host snapshot — the relative orderings and provenance digests are the load-bearing content, and every artifact carries a hardware/build-dependence caveat. latest.txt is regenerated separately on a cooled host to keep its headline micro-benchmark numbers representative. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * results: regenerate latest.txt (cool host) and sync README/PROGRESS numbers latest.txt regenerated on a thermally-recovered host so the headline micro-bench numbers are representative (protocol canary 16.1 ns/op), with a fresh Source digest (Dirty inputs: no) matching current source. The README benchmark table and the PROGRESS measured-results section are aligned to it: order_book ~90, protocol ~16, gateway ~102, matching ~91, replay ~101 ns. The matching/replay rows are slightly faster than the prior committed run (the v0.2.2 engine wins showing on the resting-order paths); the order_book micro-bench is unchanged (near-empty index, so it does not exercise the load-factor win). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * results: re-regenerate env-check artifacts against the final v0.2.2 docs dpdk_environment.txt and nic_offload_environment.txt include README.md (and CLAUDE.md/AGENTS.md/results/README.md) in their digest scope, so the README number-sync re-staled them. Regenerated against the final committed docs; Dirty inputs: no. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…e wins (#148) Adds the missing performance-evidence report. Profiles the matching-engine hot path with Linux perf + flamegraphs on ARM64 (Apple M2, Fedora Asahi), confirms order-book insertion + matching dominate, and documents the before -> after change in latency, throughput, and CPU counters for the two v0.2.2 optimizations (#138 try_emplace level_for, #145 index load-factor 0.25). Headline (qsl-perfeval, steady-state deep book, baseline storage, Release): throughput 8.89M -> 11.13M orders/sec (+25.2%) p99 latency 250ns -> 208ns (-16.8%) cycles/order 348.2 -> 288.4 (-17.2%) instr/order 1239 -> 1143 (-7.8%); IPC 3.56 -> 3.96 branch-miss 2.02% -> 1.81% allocs/order 1.106 -> 1.106 (UNCHANGED) cache-miss unavailable (Apple Silicon PMU lacks cache counters; #90) Honesty: the counters correct the original #138 rationale — the win is fewer cycles/instructions per order (shorter hash probe chains + no throwaway per-insert pmr::list construction), NOT fewer allocations (libstdc++ map::emplace checks the key before allocating). Latency includes ~12ns steady_clock overhead (reported); cache-miss rate is reported unavailable, never estimated. New tooling: - apps/qsl-perfeval: a dedicated evidence harness (separate binary so its global operator-new alloc counter + per-op timing cannot perturb qsl-bench/latest.txt). Reports orders/sec, mean/p50/p99 latency, allocations/order; run under perf stat/record for counters + flamegraphs. - docs/performance/{before,after}.svg (perf call-graph flamegraphs), docs/performance/perf-stat.txt (raw counters + metadata + #90 caveat). - qsl_perfeval_smoke CTest. make check/asan 271/271; CodeScene clean; determinism unaffected (no engine change here). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

codescene-delta-analysis Bot approved these changes Jun 24, 2026

View reviewed changes

div0rce merged commit 1094c84 into main Jun 24, 2026
8 checks passed

div0rce deleted the perf/orderbook-try-emplace branch June 24, 2026 23:19

div0rce mentioned this pull request Jun 24, 2026

perf(flamegraph): regenerate artifact after order-book/network changes #139

Merged

div0rce mentioned this pull request Jun 25, 2026

perf(order_book): cap the order-index hash load factor for shorter probe chains #145

Merged

div0rce mentioned this pull request Jun 25, 2026

docs: PERFORMANCE.md — before/after perf evidence for the v0.2.2 engine wins #148

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(order_book): use try_emplace for baseline price levels#138

perf(order_book): use try_emplace for baseline price levels#138
div0rce merged 1 commit into
mainfrom
perf/orderbook-try-emplace

div0rce commented Jun 24, 2026

Uh oh!

coderabbitai Bot commented Jun 24, 2026

Review limit reached

Uh oh!

codescene-delta-analysis Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

div0rce commented Jun 24, 2026

Summary

Why it's safe (determinism preserved)

Measured win (controlled A/B)

Why results/latest.txt is unchanged

Uh oh!

coderabbitai Bot commented Jun 24, 2026

Review limit reached

Uh oh!

codescene-delta-analysis Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why `results/latest.txt` is unchanged