perf(order_book): cap the order-index hash load factor for shorter probe chains#145
Conversation
…obe chains The order index_ (OrderId -> Locator unordered_map) is the busiest data structure on the engine hot path: every new_limit does a duplicate-id find + a resting insert, every cancel/modify does a find + erase, and every maker fill does an erase — 1-4 point lookups per engine op. At the default max_load_factor of 1.0 a busy book runs the index near fully loaded, so probe chains (and thus every lookup) are long. Capping max_load_factor at 0.25 keeps the table sparse and probe chains short. Measured A/B (Release -O3, baseline storage, qsl-bench profile 3s, same host, 5 runs each, non-overlapping ranges): ~8.50M -> ~10.08M ops/sec, ~+18.6%. A load-factor sweep showed the win plateaus below ~0.25 (0.5:+10%, 0.25:+18%, 0.125:+20%), so 0.25 captures most of it for a modest memory tradeoff (more empty buckets) rather than benchmark-tuning a fixed bucket count. Determinism preserved: index_ is used only for find/insert/erase/size — never iterated for output — so changing its bucket count cannot affect emitted events or snapshots (those iterate the ordered bids_/asks_ maps). Verified: fixtures byte-identical across g++/clang++ and vs the committed copies; OCaml differential passes. make check/asan 270/270 (asan now under the strict UBSan gate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe ChangesOrder book index load factor
Sequence Diagram(s)Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Our agent can fix these. Install it.
Gates Passed
6 Quality Gates Passed
Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.
… index load-factor win (#146) Refreshes the committed flamegraph so its Source digest matches current src/ and cmake/ after PRs #140-#145 (Dirty inputs: no). ~20k samples, zero [unknown]. The index load-factor cap (#145) is visible: the order-index point-lookup frames (contains/cancel find) are lighter relative to the inherent matching work. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a8a9767e49
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| contiguous_(storage == Storage::Contiguous ? std::make_unique<ContiguousStore>() : nullptr) { | ||
| // The order index is on the hot path: every new/cancel/modify/fill does 1-4 point lookups on | ||
| // it. Capping the load factor at 0.25 (vs the default 1.0) keeps probe chains short, which | ||
| // measurably speeds the whole engine on a busy book — a measured ~+18% on the steady-state |
There was a problem hiding this comment.
Record the benchmark claim in generated results
This newly hand-written source comment introduces a ~+18% performance number, but the committed benchmark artifacts were not regenerated for this source change: results/flamegraph.txt still reports the old flamegraph-benchmark digest, while this edit changes that digest because src/ is in the artifact's provenance inputs. docs/benchmarking.md says performance numbers must be produced by the committed harness and recorded under results/; as-is, reviewers only have stale artifacts for the optimized code, so either commit a metadata-rich generated result for this change or remove the numeric claim from the comment.
Useful? React with 👍 / 👎.
…ct (#147) * docs: overhaul all stale docs for the post-v0.2.1 (v0.2.2) state Full staleness audit of every prose doc against current main. The anchors were frozen at v0.2.1 / 263 tests / "no active milestone" while 12 PRs (#135-#146) had merged and are unreleased. - Resume anchors (PROGRESS.md, HANDOFF.md): Current state / Last action / Next action / test count (263->270) brought current; the two duplicate frozen anchors and the stale macOS benchmark numbers fixed; a dated v0.2.2 log entry. - CLAUDE.md + AGENTS.md: the post-M35 roadmap-memory section now records the post-v0.2.1 hardening + perf wave (identical edit in both). - CHANGELOG.md: new [0.2.2] section (decoder enum rejection #136, network/CLI hardening #137/#140/#141/#143, real UBSan abort gate #142, ocaml diff_report #144, try_emplace ~+5% #138, index load-factor ~+18.6% #145). CMakeLists version 0.2.1 -> 0.2.2. - README: benchmark/flamegraph/limitations sections reflect the engine wins (measured on the profile workload, not the micro-bench table) and the gateway hardening; release_readiness 270/270 + UBSan gate + v0.2.2 scope. - Networking docs (socket_gateway, socket_hardening): connection cap, EINTR retry, transient-accept survival, fd-exhaustion handling, UDP send-error counting. replay_and_recovery: decode_command now rejects out-of-domain enums. binary_protocol/differential_testing/fix_protocol/SECURITY/recruiting_notes/ CONTRIBUTING: smaller accuracy updates. results/README: add the three socket artifacts. make check 270/270. Stale results/*.txt provenance digests regenerated separately. pool_backed_storage.md table follows its artifact regeneration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * results: regenerate stale provenance artifacts for the v0.2.2 source The post-v0.2.1 source changes (#135-#146) left 14 results/*.txt with stale Source digests (the authoritative staleness signal per the provenance rules, not commit-hash drift). Regenerated via their make targets so each declares Dirty inputs: no against current HEAD: differential, pool_backed_storage, allocator_experiment, recovery_benchmarks, false_sharing_study, perf_stat_linux (partial PMU, QSL_PERF_ALLOW_PARTIAL), perf_report_linux, numa_affinity_study (linux-constrained), socket_load_summary, socket_profile_loopback, socket_stress_summary, dpdk_environment, nic_offload_environment. docs/pool_backed_storage.md: refreshed the median table, digest reference, and qualitative ordering from the regenerated artifact (contiguous fastest on four of five workloads; intrusive leads dense). The baseline rows now include the try_emplace (#138) and index load-factor (#145) wins. Honesty notes: these were measured on a thermally-warmed M2 from a long session, so absolute values run higher than a cool-host snapshot — the relative orderings and provenance digests are the load-bearing content, and every artifact carries a hardware/build-dependence caveat. latest.txt is regenerated separately on a cooled host to keep its headline micro-benchmark numbers representative. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * results: regenerate latest.txt (cool host) and sync README/PROGRESS numbers latest.txt regenerated on a thermally-recovered host so the headline micro-bench numbers are representative (protocol canary 16.1 ns/op), with a fresh Source digest (Dirty inputs: no) matching current source. The README benchmark table and the PROGRESS measured-results section are aligned to it: order_book ~90, protocol ~16, gateway ~102, matching ~91, replay ~101 ns. The matching/replay rows are slightly faster than the prior committed run (the v0.2.2 engine wins showing on the resting-order paths); the order_book micro-bench is unchanged (near-empty index, so it does not exercise the load-factor win). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * results: re-regenerate env-check artifacts against the final v0.2.2 docs dpdk_environment.txt and nic_offload_environment.txt include README.md (and CLAUDE.md/AGENTS.md/results/README.md) in their digest scope, so the README number-sync re-staled them. Regenerated against the final committed docs; Dirty inputs: no. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…e wins (#148) Adds the missing performance-evidence report. Profiles the matching-engine hot path with Linux perf + flamegraphs on ARM64 (Apple M2, Fedora Asahi), confirms order-book insertion + matching dominate, and documents the before -> after change in latency, throughput, and CPU counters for the two v0.2.2 optimizations (#138 try_emplace level_for, #145 index load-factor 0.25). Headline (qsl-perfeval, steady-state deep book, baseline storage, Release): throughput 8.89M -> 11.13M orders/sec (+25.2%) p99 latency 250ns -> 208ns (-16.8%) cycles/order 348.2 -> 288.4 (-17.2%) instr/order 1239 -> 1143 (-7.8%); IPC 3.56 -> 3.96 branch-miss 2.02% -> 1.81% allocs/order 1.106 -> 1.106 (UNCHANGED) cache-miss unavailable (Apple Silicon PMU lacks cache counters; #90) Honesty: the counters correct the original #138 rationale — the win is fewer cycles/instructions per order (shorter hash probe chains + no throwaway per-insert pmr::list construction), NOT fewer allocations (libstdc++ map::emplace checks the key before allocating). Latency includes ~12ns steady_clock overhead (reported); cache-miss rate is reported unavailable, never estimated. New tooling: - apps/qsl-perfeval: a dedicated evidence harness (separate binary so its global operator-new alloc counter + per-op timing cannot perturb qsl-bench/latest.txt). Reports orders/sec, mean/p50/p99 latency, allocations/order; run under perf stat/record for counters + flamegraphs. - docs/performance/{before,after}.svg (perf call-graph flamegraphs), docs/performance/perf-stat.txt (raw counters + metadata + #90 caveat). - qsl_perfeval_smoke CTest. make check/asan 271/271; CodeScene clean; determinism unaffected (no engine change here). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Flamegraph-driven optimization. After the
try_emplacewin (#138) removed the price-level allocation churn, the remaining hot frames are the order index_ (OrderId -> Locatorunordered_map) point lookups — it's the busiest structure on the engine hot path:new_limit: duplicate-idfind+ restinginsertcancel/modify:find+eraseerase→ 1–4
index_lookups per engine op. At the defaultmax_load_factorof 1.0 a busy book runs the table near fully loaded, so probe chains (and thus every lookup) are long.Change
Cap
max_load_factorat 0.25, keeping the table sparse and probe chains short.Measured win (controlled A/B)
Release
-O3, baseline storage,qsl-bench profile 3(steady-state deep book), same host, back-to-back, 5 runs each:~+18.6%, non-overlapping ranges. A load-factor sweep showed the win plateaus below ~0.25 (0.5→+10%, 0.25→+18%, 0.125→+20%), so 0.25 captures most of it for a modest memory tradeoff (more empty buckets) — chosen as a principled load-factor policy rather than benchmark-tuning a fixed bucket count.
Determinism preserved
index_is used only forfind/insert/erase/size— never iterated for output — so changing its bucket count cannot affect emitted events or snapshots (those iterate the orderedbids_/asks_maps;resting_orders()usesindex_.size()only as areservehint). Verified: fixtures byte-identical across g++/clang++ and vs the committed copies; OCaml differential passes.Verification
make check270/270,make asan270/270 (now under the strict UBSan gate from #142), determinism byte-identical, CodeScene delta clean.🤖 Generated with Claude Code
Summary by CodeRabbit