perf(order_book): cap the order-index hash load factor for shorter probe chains by div0rce · Pull Request #145 · div0rce/quant-systems-lab

div0rce · 2026-06-25T01:55:15Z

Summary

Flamegraph-driven optimization. After the try_emplace win (#138) removed the price-level allocation churn, the remaining hot frames are the order index_ (OrderId -> Locator unordered_map) point lookups — it's the busiest structure on the engine hot path:

new_limit: duplicate-id find + resting insert
cancel/modify: find + erase
every maker fill: erase

→ 1–4 index_ lookups per engine op. At the default max_load_factor of 1.0 a busy book runs the table near fully loaded, so probe chains (and thus every lookup) are long.

Change

Cap max_load_factor at 0.25, keeping the table sparse and probe chains short.

Measured win (controlled A/B)

Release -O3, baseline storage, qsl-bench profile 3 (steady-state deep book), same host, back-to-back, 5 runs each:

	ops/sec	median
before (load factor 1.0)	8.42 / 8.54 / 8.48 / 8.55 / 8.50 M	~8.50M
after (load factor 0.25)	9.99 / 9.80 / 10.21 / 10.08 / 10.21 M	~10.08M

~+18.6%, non-overlapping ranges. A load-factor sweep showed the win plateaus below ~0.25 (0.5→+10%, 0.25→+18%, 0.125→+20%), so 0.25 captures most of it for a modest memory tradeoff (more empty buckets) — chosen as a principled load-factor policy rather than benchmark-tuning a fixed bucket count.

Determinism preserved

index_ is used only for find/insert/erase/size — never iterated for output — so changing its bucket count cannot affect emitted events or snapshots (those iterate the ordered bids_/asks_ maps; resting_orders() uses index_.size() only as a reserve hint). Verified: fixtures byte-identical across g++/clang++ and vs the committed copies; OCaml differential passes.

Verification

make check 270/270, make asan 270/270 (now under the strict UBSan gate from #142), determinism byte-identical, CodeScene delta clean.

🤖 Generated with Claude Code

Summary by CodeRabbit

Refactor
- Improved internal order book initialization to use a more conservative map sizing strategy, which may help keep performance more consistent under higher data volumes.

…obe chains The order index_ (OrderId -> Locator unordered_map) is the busiest data structure on the engine hot path: every new_limit does a duplicate-id find + a resting insert, every cancel/modify does a find + erase, and every maker fill does an erase — 1-4 point lookups per engine op. At the default max_load_factor of 1.0 a busy book runs the index near fully loaded, so probe chains (and thus every lookup) are long. Capping max_load_factor at 0.25 keeps the table sparse and probe chains short. Measured A/B (Release -O3, baseline storage, qsl-bench profile 3s, same host, 5 runs each, non-overlapping ranges): ~8.50M -> ~10.08M ops/sec, ~+18.6%. A load-factor sweep showed the win plateaus below ~0.25 (0.5:+10%, 0.25:+18%, 0.125:+20%), so 0.25 captures most of it for a modest memory tradeoff (more empty buckets) rather than benchmark-tuning a fixed bucket count. Determinism preserved: index_ is used only for find/insert/erase/size — never iterated for output — so changing its bucket count cannot affect emitted events or snapshots (those iterate the ordered bids_/asks_ maps). Verified: fixtures byte-identical across g++/clang++ and vs the committed copies; OCaml differential passes. make check/asan 270/270 (asan now under the strict UBSan gate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-25T01:55:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cfea497a-a1f4-4dce-9a30-182b43e4c937

📥 Commits

Reviewing files that changed from the base of the PR and between 70757b9 and a8a9767.

📒 Files selected for processing (1)

src/engine/order_book.cpp

📝 Walkthrough

Walkthrough

The OrderBook constructor now sets index_ to a max load factor of 0.25F during initialization. No public interface or other constructor behavior changes are present.

Changes

Order book index load factor

Layer / File(s)	Summary
Constructor index load-factor setting `src/engine/order_book.cpp`	`OrderBook` now calls `index_.max_load_factor(0.25F)` after `index_` is initialized.

Sequence Diagram(s)

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

🐰 I hopped by the book with a twitch of my nose,
Set index_ to 0.25F where the spring breeze blows.
One tiny tweak, neat and bright,
In the burrow of code, it feels just right.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description covers summary, changes, and verification, but it omits required template sections like Milestone, DoD, Tests, and Notes.	Add the Milestone, Definition of Done, Tests, and Notes/decisions sections from the template, including the required checklist items and commands run.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly states the order book hash load-factor change and its performance goal.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/orderbook-index-load-factor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

codescene-delta-analysis

Our agent can fix these. Install it.

Gates Passed
6 Quality Gates Passed

Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.

… index load-factor win (#146) Refreshes the committed flamegraph so its Source digest matches current src/ and cmake/ after PRs #140-#145 (Dirty inputs: no). ~20k samples, zero [unknown]. The index load-factor cap (#145) is visible: the order-index point-lookup frames (contains/cancel find) are lighter relative to the inherent matching work. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a8a9767e49

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-25T01:59:27Z

+      contiguous_(storage == Storage::Contiguous ? std::make_unique<ContiguousStore>() : nullptr) {
+    // The order index is on the hot path: every new/cancel/modify/fill does 1-4 point lookups on
+    // it. Capping the load factor at 0.25 (vs the default 1.0) keeps probe chains short, which
+    // measurably speeds the whole engine on a busy book — a measured ~+18% on the steady-state


Record the benchmark claim in generated results

This newly hand-written source comment introduces a ~+18% performance number, but the committed benchmark artifacts were not regenerated for this source change: results/flamegraph.txt still reports the old flamegraph-benchmark digest, while this edit changes that digest because src/ is in the artifact's provenance inputs. docs/benchmarking.md says performance numbers must be produced by the committed harness and recorded under results/; as-is, reviewers only have stale artifacts for the optimized code, so either commit a metadata-rich generated result for this change or remove the numeric claim from the comment.

Useful? React with 👍 / 👎.

…ct (#147) * docs: overhaul all stale docs for the post-v0.2.1 (v0.2.2) state Full staleness audit of every prose doc against current main. The anchors were frozen at v0.2.1 / 263 tests / "no active milestone" while 12 PRs (#135-#146) had merged and are unreleased. - Resume anchors (PROGRESS.md, HANDOFF.md): Current state / Last action / Next action / test count (263->270) brought current; the two duplicate frozen anchors and the stale macOS benchmark numbers fixed; a dated v0.2.2 log entry. - CLAUDE.md + AGENTS.md: the post-M35 roadmap-memory section now records the post-v0.2.1 hardening + perf wave (identical edit in both). - CHANGELOG.md: new [0.2.2] section (decoder enum rejection #136, network/CLI hardening #137/#140/#141/#143, real UBSan abort gate #142, ocaml diff_report #144, try_emplace ~+5% #138, index load-factor ~+18.6% #145). CMakeLists version 0.2.1 -> 0.2.2. - README: benchmark/flamegraph/limitations sections reflect the engine wins (measured on the profile workload, not the micro-bench table) and the gateway hardening; release_readiness 270/270 + UBSan gate + v0.2.2 scope. - Networking docs (socket_gateway, socket_hardening): connection cap, EINTR retry, transient-accept survival, fd-exhaustion handling, UDP send-error counting. replay_and_recovery: decode_command now rejects out-of-domain enums. binary_protocol/differential_testing/fix_protocol/SECURITY/recruiting_notes/ CONTRIBUTING: smaller accuracy updates. results/README: add the three socket artifacts. make check 270/270. Stale results/*.txt provenance digests regenerated separately. pool_backed_storage.md table follows its artifact regeneration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * results: regenerate stale provenance artifacts for the v0.2.2 source The post-v0.2.1 source changes (#135-#146) left 14 results/*.txt with stale Source digests (the authoritative staleness signal per the provenance rules, not commit-hash drift). Regenerated via their make targets so each declares Dirty inputs: no against current HEAD: differential, pool_backed_storage, allocator_experiment, recovery_benchmarks, false_sharing_study, perf_stat_linux (partial PMU, QSL_PERF_ALLOW_PARTIAL), perf_report_linux, numa_affinity_study (linux-constrained), socket_load_summary, socket_profile_loopback, socket_stress_summary, dpdk_environment, nic_offload_environment. docs/pool_backed_storage.md: refreshed the median table, digest reference, and qualitative ordering from the regenerated artifact (contiguous fastest on four of five workloads; intrusive leads dense). The baseline rows now include the try_emplace (#138) and index load-factor (#145) wins. Honesty notes: these were measured on a thermally-warmed M2 from a long session, so absolute values run higher than a cool-host snapshot — the relative orderings and provenance digests are the load-bearing content, and every artifact carries a hardware/build-dependence caveat. latest.txt is regenerated separately on a cooled host to keep its headline micro-benchmark numbers representative. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * results: regenerate latest.txt (cool host) and sync README/PROGRESS numbers latest.txt regenerated on a thermally-recovered host so the headline micro-bench numbers are representative (protocol canary 16.1 ns/op), with a fresh Source digest (Dirty inputs: no) matching current source. The README benchmark table and the PROGRESS measured-results section are aligned to it: order_book ~90, protocol ~16, gateway ~102, matching ~91, replay ~101 ns. The matching/replay rows are slightly faster than the prior committed run (the v0.2.2 engine wins showing on the resting-order paths); the order_book micro-bench is unchanged (near-empty index, so it does not exercise the load-factor win). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * results: re-regenerate env-check artifacts against the final v0.2.2 docs dpdk_environment.txt and nic_offload_environment.txt include README.md (and CLAUDE.md/AGENTS.md/results/README.md) in their digest scope, so the README number-sync re-staled them. Regenerated against the final committed docs; Dirty inputs: no. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…e wins (#148) Adds the missing performance-evidence report. Profiles the matching-engine hot path with Linux perf + flamegraphs on ARM64 (Apple M2, Fedora Asahi), confirms order-book insertion + matching dominate, and documents the before -> after change in latency, throughput, and CPU counters for the two v0.2.2 optimizations (#138 try_emplace level_for, #145 index load-factor 0.25). Headline (qsl-perfeval, steady-state deep book, baseline storage, Release): throughput 8.89M -> 11.13M orders/sec (+25.2%) p99 latency 250ns -> 208ns (-16.8%) cycles/order 348.2 -> 288.4 (-17.2%) instr/order 1239 -> 1143 (-7.8%); IPC 3.56 -> 3.96 branch-miss 2.02% -> 1.81% allocs/order 1.106 -> 1.106 (UNCHANGED) cache-miss unavailable (Apple Silicon PMU lacks cache counters; #90) Honesty: the counters correct the original #138 rationale — the win is fewer cycles/instructions per order (shorter hash probe chains + no throwaway per-insert pmr::list construction), NOT fewer allocations (libstdc++ map::emplace checks the key before allocating). Latency includes ~12ns steady_clock overhead (reported); cache-miss rate is reported unavailable, never estimated. New tooling: - apps/qsl-perfeval: a dedicated evidence harness (separate binary so its global operator-new alloc counter + per-op timing cannot perturb qsl-bench/latest.txt). Reports orders/sec, mean/p50/p99 latency, allocations/order; run under perf stat/record for counters + flamegraphs. - docs/performance/{before,after}.svg (perf call-graph flamegraphs), docs/performance/perf-stat.txt (raw counters + metadata + #90 caveat). - qsl_perfeval_smoke CTest. make check/asan 271/271; CodeScene clean; determinism unaffected (no engine change here). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

codescene-delta-analysis Bot approved these changes Jun 25, 2026

View reviewed changes

div0rce merged commit ebc1c95 into main Jun 25, 2026
8 checks passed

div0rce deleted the perf/orderbook-index-load-factor branch June 25, 2026 01:57

div0rce mentioned this pull request Jun 25, 2026

perf(flamegraph): regenerate artifact after round-2 fixes and index load-factor win #146

Merged

chatgpt-codex-connector Bot reviewed Jun 25, 2026

View reviewed changes

div0rce mentioned this pull request Jun 25, 2026

docs: PERFORMANCE.md — before/after perf evidence for the v0.2.2 engine wins #148

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(order_book): cap the order-index hash load factor for shorter probe chains#145

perf(order_book): cap the order-index hash load factor for shorter probe chains#145
div0rce merged 1 commit into
mainfrom
perf/orderbook-index-load-factor

div0rce commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

codescene-delta-analysis Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

div0rce commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Measured win (controlled A/B)

Determinism preserved

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

codescene-delta-analysis Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

div0rce commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading