Skip to content

perf(order_book): cap the order-index hash load factor for shorter probe chains#145

Merged
div0rce merged 1 commit into
mainfrom
perf/orderbook-index-load-factor
Jun 25, 2026
Merged

perf(order_book): cap the order-index hash load factor for shorter probe chains#145
div0rce merged 1 commit into
mainfrom
perf/orderbook-index-load-factor

Conversation

@div0rce

@div0rce div0rce commented Jun 25, 2026

Copy link
Copy Markdown
Owner

Summary

Flamegraph-driven optimization. After the try_emplace win (#138) removed the price-level allocation churn, the remaining hot frames are the order index_ (OrderId -> Locator unordered_map) point lookups — it's the busiest structure on the engine hot path:

  • new_limit: duplicate-id find + resting insert
  • cancel/modify: find + erase
  • every maker fill: erase

→ 1–4 index_ lookups per engine op. At the default max_load_factor of 1.0 a busy book runs the table near fully loaded, so probe chains (and thus every lookup) are long.

Change

Cap max_load_factor at 0.25, keeping the table sparse and probe chains short.

Measured win (controlled A/B)

Release -O3, baseline storage, qsl-bench profile 3 (steady-state deep book), same host, back-to-back, 5 runs each:

ops/sec median
before (load factor 1.0) 8.42 / 8.54 / 8.48 / 8.55 / 8.50 M ~8.50M
after (load factor 0.25) 9.99 / 9.80 / 10.21 / 10.08 / 10.21 M ~10.08M

~+18.6%, non-overlapping ranges. A load-factor sweep showed the win plateaus below ~0.25 (0.5→+10%, 0.25→+18%, 0.125→+20%), so 0.25 captures most of it for a modest memory tradeoff (more empty buckets) — chosen as a principled load-factor policy rather than benchmark-tuning a fixed bucket count.

Determinism preserved

index_ is used only for find/insert/erase/size — never iterated for output — so changing its bucket count cannot affect emitted events or snapshots (those iterate the ordered bids_/asks_ maps; resting_orders() uses index_.size() only as a reserve hint). Verified: fixtures byte-identical across g++/clang++ and vs the committed copies; OCaml differential passes.

Verification

make check 270/270, make asan 270/270 (now under the strict UBSan gate from #142), determinism byte-identical, CodeScene delta clean.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor
    • Improved internal order book initialization to use a more conservative map sizing strategy, which may help keep performance more consistent under higher data volumes.

…obe chains

The order index_ (OrderId -> Locator unordered_map) is the busiest data structure
on the engine hot path: every new_limit does a duplicate-id find + a resting
insert, every cancel/modify does a find + erase, and every maker fill does an
erase — 1-4 point lookups per engine op. At the default max_load_factor of 1.0 a
busy book runs the index near fully loaded, so probe chains (and thus every
lookup) are long.

Capping max_load_factor at 0.25 keeps the table sparse and probe chains short.
Measured A/B (Release -O3, baseline storage, qsl-bench profile 3s, same host,
5 runs each, non-overlapping ranges): ~8.50M -> ~10.08M ops/sec, ~+18.6%. A
load-factor sweep showed the win plateaus below ~0.25 (0.5:+10%, 0.25:+18%,
0.125:+20%), so 0.25 captures most of it for a modest memory tradeoff (more empty
buckets) rather than benchmark-tuning a fixed bucket count.

Determinism preserved: index_ is used only for find/insert/erase/size — never
iterated for output — so changing its bucket count cannot affect emitted events
or snapshots (those iterate the ordered bids_/asks_ maps). Verified: fixtures
byte-identical across g++/clang++ and vs the committed copies; OCaml differential
passes. make check/asan 270/270 (asan now under the strict UBSan gate).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cfea497a-a1f4-4dce-9a30-182b43e4c937

📥 Commits

Reviewing files that changed from the base of the PR and between 70757b9 and a8a9767.

📒 Files selected for processing (1)
  • src/engine/order_book.cpp

📝 Walkthrough

Walkthrough

The OrderBook constructor now sets index_ to a max load factor of 0.25F during initialization. No public interface or other constructor behavior changes are present.

Changes

Order book index load factor

Layer / File(s) Summary
Constructor index load-factor setting
src/engine/order_book.cpp
OrderBook now calls index_.max_load_factor(0.25F) after index_ is initialized.

Sequence Diagram(s)

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

🐰 I hopped by the book with a twitch of my nose,
Set index_ to 0.25F where the spring breeze blows.
One tiny tweak, neat and bright,
In the burrow of code, it feels just right.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers summary, changes, and verification, but it omits required template sections like Milestone, DoD, Tests, and Notes. Add the Milestone, Definition of Done, Tests, and Notes/decisions sections from the template, including the required checklist items and commands run.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly states the order book hash load-factor change and its performance goal.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/orderbook-index-load-factor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@codescene-delta-analysis codescene-delta-analysis Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our agent can fix these. Install it.

Gates Passed
6 Quality Gates Passed

Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.

@div0rce div0rce merged commit ebc1c95 into main Jun 25, 2026
8 checks passed
@div0rce div0rce deleted the perf/orderbook-index-load-factor branch June 25, 2026 01:57
div0rce added a commit that referenced this pull request Jun 25, 2026
… index load-factor win (#146)

Refreshes the committed flamegraph so its Source digest matches current src/ and
cmake/ after PRs #140-#145 (Dirty inputs: no). ~20k samples, zero [unknown].
The index load-factor cap (#145) is visible: the order-index point-lookup frames
(contains/cancel find) are lighter relative to the inherent matching work.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a8a9767e49

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/engine/order_book.cpp
contiguous_(storage == Storage::Contiguous ? std::make_unique<ContiguousStore>() : nullptr) {
// The order index is on the hot path: every new/cancel/modify/fill does 1-4 point lookups on
// it. Capping the load factor at 0.25 (vs the default 1.0) keeps probe chains short, which
// measurably speeds the whole engine on a busy book — a measured ~+18% on the steady-state

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Record the benchmark claim in generated results

This newly hand-written source comment introduces a ~+18% performance number, but the committed benchmark artifacts were not regenerated for this source change: results/flamegraph.txt still reports the old flamegraph-benchmark digest, while this edit changes that digest because src/ is in the artifact's provenance inputs. docs/benchmarking.md says performance numbers must be produced by the committed harness and recorded under results/; as-is, reviewers only have stale artifacts for the optimized code, so either commit a metadata-rich generated result for this change or remove the numeric claim from the comment.

Useful? React with 👍 / 👎.

div0rce added a commit that referenced this pull request Jun 25, 2026
…ct (#147)

* docs: overhaul all stale docs for the post-v0.2.1 (v0.2.2) state

Full staleness audit of every prose doc against current main. The anchors were
frozen at v0.2.1 / 263 tests / "no active milestone" while 12 PRs (#135-#146)
had merged and are unreleased.

- Resume anchors (PROGRESS.md, HANDOFF.md): Current state / Last action /
  Next action / test count (263->270) brought current; the two duplicate frozen
  anchors and the stale macOS benchmark numbers fixed; a dated v0.2.2 log entry.
- CLAUDE.md + AGENTS.md: the post-M35 roadmap-memory section now records the
  post-v0.2.1 hardening + perf wave (identical edit in both).
- CHANGELOG.md: new [0.2.2] section (decoder enum rejection #136, network/CLI
  hardening #137/#140/#141/#143, real UBSan abort gate #142, ocaml diff_report
  #144, try_emplace ~+5% #138, index load-factor ~+18.6% #145). CMakeLists
  version 0.2.1 -> 0.2.2.
- README: benchmark/flamegraph/limitations sections reflect the engine wins
  (measured on the profile workload, not the micro-bench table) and the gateway
  hardening; release_readiness 270/270 + UBSan gate + v0.2.2 scope.
- Networking docs (socket_gateway, socket_hardening): connection cap, EINTR
  retry, transient-accept survival, fd-exhaustion handling, UDP send-error
  counting. replay_and_recovery: decode_command now rejects out-of-domain enums.
  binary_protocol/differential_testing/fix_protocol/SECURITY/recruiting_notes/
  CONTRIBUTING: smaller accuracy updates. results/README: add the three socket
  artifacts.

make check 270/270. Stale results/*.txt provenance digests regenerated
separately. pool_backed_storage.md table follows its artifact regeneration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* results: regenerate stale provenance artifacts for the v0.2.2 source

The post-v0.2.1 source changes (#135-#146) left 14 results/*.txt with stale
Source digests (the authoritative staleness signal per the provenance rules,
not commit-hash drift). Regenerated via their make targets so each declares
Dirty inputs: no against current HEAD: differential, pool_backed_storage,
allocator_experiment, recovery_benchmarks, false_sharing_study, perf_stat_linux
(partial PMU, QSL_PERF_ALLOW_PARTIAL), perf_report_linux, numa_affinity_study
(linux-constrained), socket_load_summary, socket_profile_loopback,
socket_stress_summary, dpdk_environment, nic_offload_environment.

docs/pool_backed_storage.md: refreshed the median table, digest reference, and
qualitative ordering from the regenerated artifact (contiguous fastest on four of
five workloads; intrusive leads dense). The baseline rows now include the
try_emplace (#138) and index load-factor (#145) wins.

Honesty notes: these were measured on a thermally-warmed M2 from a long session,
so absolute values run higher than a cool-host snapshot — the relative orderings
and provenance digests are the load-bearing content, and every artifact carries
a hardware/build-dependence caveat. latest.txt is regenerated separately on a
cooled host to keep its headline micro-benchmark numbers representative.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* results: regenerate latest.txt (cool host) and sync README/PROGRESS numbers

latest.txt regenerated on a thermally-recovered host so the headline micro-bench
numbers are representative (protocol canary 16.1 ns/op), with a fresh Source
digest (Dirty inputs: no) matching current source. The README benchmark table
and the PROGRESS measured-results section are aligned to it: order_book ~90,
protocol ~16, gateway ~102, matching ~91, replay ~101 ns. The matching/replay
rows are slightly faster than the prior committed run (the v0.2.2 engine wins
showing on the resting-order paths); the order_book micro-bench is unchanged
(near-empty index, so it does not exercise the load-factor win).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* results: re-regenerate env-check artifacts against the final v0.2.2 docs

dpdk_environment.txt and nic_offload_environment.txt include README.md (and
CLAUDE.md/AGENTS.md/results/README.md) in their digest scope, so the README
number-sync re-staled them. Regenerated against the final committed docs;
Dirty inputs: no.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
div0rce added a commit that referenced this pull request Jun 25, 2026
…e wins (#148)

Adds the missing performance-evidence report. Profiles the matching-engine hot
path with Linux perf + flamegraphs on ARM64 (Apple M2, Fedora Asahi), confirms
order-book insertion + matching dominate, and documents the before -> after
change in latency, throughput, and CPU counters for the two v0.2.2 optimizations
(#138 try_emplace level_for, #145 index load-factor 0.25).

Headline (qsl-perfeval, steady-state deep book, baseline storage, Release):
  throughput   8.89M -> 11.13M orders/sec  (+25.2%)
  p99 latency  250ns -> 208ns              (-16.8%)
  cycles/order 348.2 -> 288.4              (-17.2%)
  instr/order  1239  -> 1143               (-7.8%); IPC 3.56 -> 3.96
  branch-miss  2.02% -> 1.81%
  allocs/order 1.106 -> 1.106              (UNCHANGED)
  cache-miss   unavailable (Apple Silicon PMU lacks cache counters; #90)

Honesty: the counters correct the original #138 rationale — the win is fewer
cycles/instructions per order (shorter hash probe chains + no throwaway per-insert
pmr::list construction), NOT fewer allocations (libstdc++ map::emplace checks the
key before allocating). Latency includes ~12ns steady_clock overhead (reported);
cache-miss rate is reported unavailable, never estimated.

New tooling:
- apps/qsl-perfeval: a dedicated evidence harness (separate binary so its global
  operator-new alloc counter + per-op timing cannot perturb qsl-bench/latest.txt).
  Reports orders/sec, mean/p50/p99 latency, allocations/order; run under perf
  stat/record for counters + flamegraphs.
- docs/performance/{before,after}.svg (perf call-graph flamegraphs),
  docs/performance/perf-stat.txt (raw counters + metadata + #90 caveat).
- qsl_perfeval_smoke CTest.

make check/asan 271/271; CodeScene clean; determinism unaffected (no engine
change here).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant