docs: PERFORMANCE.md — before/after perf evidence for the v0.2.2 engine wins by div0rce · Pull Request #148 · div0rce/quant-systems-lab

div0rce · 2026-06-25T03:04:27Z

Summary

Adds the missing performance-evidence report. It profiles the matching-engine hot path with Linux perf + flamegraphs on ARM64 (Apple M2, Fedora Asahi), confirms order-book insertion + matching are the dominant cost, and documents the before → after change in latency, throughput, and CPU counters for the two v0.2.2 optimizations (#138 try_emplace in level_for, #145 index max_load_factor 0.25).

Headline (qsl-perfeval, steady-state deep book, baseline storage, Release)

Metric	Before	After	Δ
Throughput (orders/sec)	8.89 M	11.13 M	+25.2 %
p99 latency	250 ns	208 ns	−16.8 %
Cycles / order	348.2	288.4	−17.2 %
Instructions / order	1239	1143	−7.8 % (IPC 3.56→3.96)
Branch-miss rate	2.02 %	1.81 %	−0.21 pp
Allocations / order	1.106	1.106	unchanged
Cache-miss rate	unavailable	unavailable	— (#90)

Honest mechanism (the point of measuring)

The hardware counters correct the original #138 rationale: the win is fewer cycles/instructions per order (shorter hash-probe chains + no throwaway per-insert pmr::list construction), not fewer allocations — libstdc++ std::map::emplace checks the key before allocating, so allocs/order is unchanged. perf report pins it: level_for 21.3%→17.5% (try_emplace), contains 3.6%→1.3% and OrderBook::cancel 16.0%→13.2% (load-factor). Cache-miss rate is reported unavailable, never estimated — the Apple Silicon PMU doesn't expose cache counters (#90).

What's added

PERFORMANCE.md — the report (table, flamegraphs, perf report/annotate, methodology, reproduction, tuning rationale).
apps/qsl-perfeval — a dedicated evidence harness, a separate binary so its global operator new allocation counter + per-op timing can't perturb qsl-bench/results/latest.txt. Reports orders/sec, mean/p50/p99 latency, allocations/order; run under perf stat/record for counters + flamegraphs.
docs/performance/ — before.svg, after.svg (perf call-graph flamegraphs), perf-stat.txt (raw counters + full metadata: command, flags, hardware, kernel, schedutil governor, perf version, and the M29 follow-up: full cache-counter PMU evidence (bare-metal Apple PMU now partial; cache events unsupported) #90 cache-counter caveat).
qsl_perfeval_smoke CTest.

Verification

make check / make asan 271/271 (the perfeval operator new override coexists with ASan); CodeScene clean; no engine change in this PR, so determinism is unaffected.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added a new performance benchmark command with throughput and latency reporting.
- Added a smoke test to verify the benchmark output format.
- Added performance documentation with measured before/after results and reproduction steps.
Bug Fixes
- Improved order-book hot-path efficiency, resulting in better cycles per order and higher IPC.
- Reduced allocation and lookup overhead during benchmarked workloads.

…e wins Adds the missing performance-evidence report. Profiles the matching-engine hot path with Linux perf + flamegraphs on ARM64 (Apple M2, Fedora Asahi), confirms order-book insertion + matching dominate, and documents the before -> after change in latency, throughput, and CPU counters for the two v0.2.2 optimizations (#138 try_emplace level_for, #145 index load-factor 0.25). Headline (qsl-perfeval, steady-state deep book, baseline storage, Release): throughput 8.89M -> 11.13M orders/sec (+25.2%) p99 latency 250ns -> 208ns (-16.8%) cycles/order 348.2 -> 288.4 (-17.2%) instr/order 1239 -> 1143 (-7.8%); IPC 3.56 -> 3.96 branch-miss 2.02% -> 1.81% allocs/order 1.106 -> 1.106 (UNCHANGED) cache-miss unavailable (Apple Silicon PMU lacks cache counters; #90) Honesty: the counters correct the original #138 rationale — the win is fewer cycles/instructions per order (shorter hash probe chains + no throwaway per-insert pmr::list construction), NOT fewer allocations (libstdc++ map::emplace checks the key before allocating). Latency includes ~12ns steady_clock overhead (reported); cache-miss rate is reported unavailable, never estimated. New tooling: - apps/qsl-perfeval: a dedicated evidence harness (separate binary so its global operator-new alloc counter + per-op timing cannot perturb qsl-bench/latest.txt). Reports orders/sec, mean/p50/p99 latency, allocations/order; run under perf stat/record for counters + flamegraphs. - docs/performance/{before,after}.svg (perf call-graph flamegraphs), docs/performance/perf-stat.txt (raw counters + metadata + #90 caveat). - qsl_perfeval_smoke CTest. make check/asan 271/271; CodeScene clean; determinism unaffected (no engine change here). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-25T03:04:39Z

📝 Walkthrough

Walkthrough

Adds a new qsl-perfeval benchmark binary, wires it into the build and tests, and records performance evidence for matching-engine changes with before/after perf data and reproduction steps.

Changes

Performance evidence harness

Layer / File(s)	Summary
Build target and smoke test `CMakeLists.txt`, `tests/CMakeLists.txt`	Adds the `qsl-perfeval` executable target and a CTest smoke case that runs it in latency mode and checks for `perfeval: latency_ns` output.
Benchmark scaffolding `apps/qsl-perfeval/main.cpp`	Adds the benchmark harness setup, including allocation-counting `new`/`delete` overrides, timing helpers, steady-state order-flow state, and latency-stat helpers.
Benchmark modes and CLI `apps/qsl-perfeval/main.cpp`	Implements the throughput and latency loops, `--latency`/orders parsing, and the `main` entry point that prints summary metrics.
Performance evidence reports `PERFORMANCE.md`, `docs/performance/perf-stat.txt`	Adds the matching-engine performance report, perf-stat before/after counters, profiling notes, reproduction commands, and load-factor sweep results.

Sequence Diagram(s)

sequenceDiagram
  participant user as User
  participant main as qsl-perfeval main
  participant flow as PerfFlow
  participant engine as qsl_core matching engine
  participant out as stdout
  user->>main: run qsl-perfeval [--latency] [orders]
  main->>main: parse_orders(argc, argv, latency)
  alt throughput mode
    main->>flow: run_throughput(orders)
    flow->>engine: submit new_limit and cancel oldest order
    engine-->>flow: order ids
    flow-->>main: orders_per_sec, allocs_per_order
  else latency mode
    main->>flow: run_latency(orders)
    flow->>engine: submit new_limit and cancel oldest order
    engine-->>flow: order ids
    flow-->>main: latency_ns stats and timer overhead
  end
  main->>out: print summary metrics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

codescene-delta-analysis

Poem

A rabbit hopped through perf and time,
With qsl-perfeval in tidy rhyme. 🐇
It counted hops and clocks so neat,
Then printed metrics crisp and sweet.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	It covers the work well, but it omits required template sections like Milestone, DoD checklist, Tests commands, and Notes/decisions.	Add the Milestone section, the DoD checklist, a Tests block with commands run, and a Notes/decisions section matching the template.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise and accurately highlights the main change: adding performance evidence for v0.2.2 engine wins.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/performance-evidence-report

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

codescene-delta-analysis

Gates Passed
6 Quality Gates Passed

See analysis details in CodeScene

Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 40f51d1333

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-25T03:09:35Z

+    // deep.
+    void retire_oldest(core::OrderId oid) {
+        sink += eng.cancel(sym, ring[head]).size();
+        ring[head] = oid;


Track only resting ids in the perf ring

When the random new_limit crosses enough liquidity, the submitted order can fully fill and never rest, but this code still parks that id in the ring. In that workload the later cancel for that slot is a no-op, so the harness is not measuring the documented ~512-deep book with one real maintenance cancel per order; I reproduced this with the same flow at ~159 resting orders after warmup and ~78% cancel failures over 1M cycles, which invalidates the before/after evidence built from this harness.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-25T03:09:35Z

+void *operator new(std::size_t n) {
+    g_allocs.fetch_add(1, std::memory_order_relaxed);
+    if (void *p = std::malloc(n)) {
+        return p;
+    }
+    throw std::bad_alloc();
+}


Count aligned allocations in the harness

This replacement only intercepts the unaligned operator new path, but the baseline book allocates its pmr list/map/unordered_map nodes through std::pmr::new_delete_resource, which on libstdc++ uses aligned allocation (operator new(size_t, align_val_t)) as shown in the committed flamegraphs. Those allocations bypass g_allocs, so allocs_per_order is underreported and the allocation conclusion in PERFORMANCE.md is not trustworthy; locally, adding aligned overloads changed the same run from 1.1059 to 2.6804 allocations/order.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-25T03:09:35Z

+Build:
+  Compiler:   GCC (c++) 16.1.1 20260515 (Red Hat 16.1.1-2)
+  Flags:      Release (-O3 -DNDEBUG) + -fno-omit-frame-pointer -g   (CMake "flamegraph" preset)
+  Binary:     build/flamegraph/qsl-perfeval


Add source-digest provenance to this artifact

AGENTS.md says benchmark/profiling artifacts must record source-digest provenance because the source digest, not a commit hash, is the artifact identity after the migration. This new perf-stat artifact has command/build/host metadata but omits Provenance version, Source digest, Source digest scope, and Dirty inputs, so the before/after numbers cannot be stale-checked against the exact qsl-perfeval/source inputs.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/qsl-perfeval/main.cpp`:
- Around line 107-128: `submit()` is adding every generated `OrderId` to `ring`,
but some orders can fully match and never rest, which makes `retire_oldest()`
cancel invalid ids and lets the harness drift from the intended steady depth.
Update the `submit`/`warmup`/`retire_oldest` flow so `ring` only stores
confirmed resting orders (or otherwise distinguish passive from aggressive
submissions), and ensure only resting ids are rotated and canceled while
maintaining the target `kRing` depth.
- Around line 149-151: The p99 calculation in the latency summary is using an
off-by-one index, so it can select the last sample instead of the zero-based
99th percentile position. Update the logic in the stats computation around the
res.p99_ns assignment to clamp the computed index to the last valid element and
convert the percentile position to zero-based indexing before indexing into lat.
- Around line 207-217: Reject malformed order counts in parse_orders by
validating the full argv token before accepting the result of std::strtoull,
since partial parses and negative-looking inputs are currently treated as valid
counts. Update parse_orders in main.cpp to check that the entire string was
consumed via the end pointer from std::strtoull, and reject tokens that look
like options rather than numeric values before using the parsed order count; if
validation fails, fall back to a usage/error path instead of returning the
parsed value.

In `@PERFORMANCE.md`:
- Around line 83-93: The fenced markdown blocks in PERFORMANCE.md are unlabeled,
so update both fences to use the text language tag. In the performance dump and
the methodology section referenced by the review comment, keep the content
unchanged but label each triple-backtick fence as text so markdownlint passes
and readers get plain-text highlighting.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0a9a8c53-5bd4-47e7-ac7b-2ad5b2ff49ed

📥 Commits

Reviewing files that changed from the base of the PR and between 5569eec and 40f51d1.

⛔ Files ignored due to path filters (2)

docs/performance/after.svg is excluded by !**/*.svg
docs/performance/before.svg is excluded by !**/*.svg

📒 Files selected for processing (5)

CMakeLists.txt
PERFORMANCE.md
apps/qsl-perfeval/main.cpp
docs/performance/perf-stat.txt
tests/CMakeLists.txt

coderabbitai · 2026-06-25T03:12:30Z

+    core::OrderId submit() {
+        const std::uint64_t r = splitmix64(state);
+        const auto side = ((r & 1U) != 0U) ? core::Side::Buy : core::Side::Sell;
+        const auto price = kBase + static_cast<core::Price>((r >> 1) % kBand);
+        const auto qty = 1 + static_cast<core::Quantity>((r >> 8) % 8);
+        const core::OrderId oid = id++;
+        sink += eng.new_limit(sym, oid, side, price, qty, core::TimeInForce::GTC).size();
+        return oid;
+    }
+
+    // Cancel the oldest resting order and park the new id in its slot, holding the book ~kRing
+    // deep.
+    void retire_oldest(core::OrderId oid) {
+        sink += eng.cancel(sym, ring[head]).size();
+        ring[head] = oid;
+        head = (head + 1) % kRing;
+    }
+
+    void warmup() {
+        while (ring.size() < kRing) {
+            ring.push_back(submit());
+        }


🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Track only actually resting orders in ring.

submit() records every new OrderId, but marketable orders can fully fill and never rest. Once one of those ids lands in ring, retire_oldest() starts canceling non-existent orders, so book depth drifts with match rate instead of staying near kRing. That changes the workload this harness claims to measure.
Track only confirmed resting orders, or maintain passive depth separately from aggressive flow.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/qsl-perfeval/main.cpp` around lines 107 - 128, `submit()` is adding every generated `OrderId` to `ring`, but some orders can fully match and never rest, which makes `retire_oldest()` cancel invalid ids and lets the harness drift from the intended steady depth. Update the `submit`/`warmup`/`retire_oldest` flow so `ring` only stores confirmed resting orders (or otherwise distinguish passive from aggressive submissions), and ensure only resting ids are rotated and canceled while maintaining the target `kRing` depth.

coderabbitai · 2026-06-25T03:12:30Z

+    res.mean_ns = static_cast<std::uint32_t>(sum / lat.size());
+    res.p50_ns = lat[lat.size() / 2];
+    res.p99_ns = lat[(lat.size() * 99) / 100];


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Fix the zero-based p99 index.

lat[(lat.size() * 99) / 100] selects the 100th element for a 100-sample run, so p99 is reported as the max. The published latency evidence will be inflated by one bucket.

Proposed fix

- res.p99_ns = lat[(lat.size() * 99) / 100]; + res.p99_ns = lat[((lat.size() - 1) * 99) / 100];

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

res.mean_ns = static_cast<std::uint32_t>(sum / lat.size());

res.p50_ns = lat[lat.size() / 2];

res.p99_ns = lat[(lat.size() * 99) / 100];

res.mean_ns = static_cast<std::uint32_t>(sum / lat.size());

res.p50_ns = lat[lat.size() / 2];

res.p99_ns = lat[((lat.size() - 1) * 99) / 100];

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/qsl-perfeval/main.cpp` around lines 149 - 151, The p99 calculation in the latency summary is using an off-by-one index, so it can select the last sample instead of the zero-based 99th percentile position. Update the logic in the stats computation around the res.p99_ns assignment to clamp the computed index to the last valid element and convert the percentile position to zero-based indexing before indexing into lat.

coderabbitai · 2026-06-25T03:12:30Z

+std::uint64_t parse_orders(int argc, char **argv, bool latency) {
+    for (int i = 1; i < argc; ++i) {
+        const std::string a = argv[i];
+        if (a != "--latency") {
+            const std::uint64_t n = std::strtoull(a.c_str(), nullptr, 10);
+            if (n > 0) {
+                return n;
+            }
+        }
+    }
+    return latency ? 5'000'000ULL : 60'000'000ULL;


🩺 Stability & Availability | 🟠 Major

Reject malformed order counts instead of truncating them.

std::strtoull leaves endlptr pointing at the first unconverted character if a prefix was parsed successfully. For inputs like 123abc or -1 (which wraps to ULLONG_MAX), the code currently accepts the partial or negative-converted value. Since this number feeds directly into reserve() and loop bounds, a typo causes silent OOMs or unexpectedly long runs instead of failing with a usage error.

Validate that the entire string was consumed by checking *endlptr == '\0' and verify that the input didn't look like an unrecognized decimal option (e.g., argv[i][0] != '-' or !startsWith(argv[i], "--") if supported) before using the parsed value.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/qsl-perfeval/main.cpp` around lines 207 - 217, Reject malformed order counts in parse_orders by validating the full argv token before accepting the result of std::strtoull, since partial parses and negative-looking inputs are currently treated as valid counts. Update parse_orders in main.cpp to check that the entire string was consumed via the end pointer from std::strtoull, and reject tokens that look like options rather than numeric values before using the parsed order count; if validation fails, fall back to a usage/error path instead of returning the parsed value.

coderabbitai · 2026-06-25T03:12:30Z

+```
+                              BEFORE        AFTER
+MatchingEngine::new_limit     80.1 %        83.2 %
+  OrderBook::add_limit        69.5 %        74.7 %
+    OrderBook::match_baseline 25.7 %        32.0 %   <- matching
+    OrderBook::rest           33.3 %        31.8 %   <- insertion
+      OrderBook::level_for    21.3 %  ->    17.5 %   <- #138 try_emplace
+  OrderBook::contains          3.6 %  ->     1.3 %   <- #145 load-factor (dup-id lookup)
+MatchingEngine::cancel        18.2 %        15.8 %
+  OrderBook::cancel           16.0 %  ->    13.2 %   <- #145 load-factor (find + erase)
+```


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Label the plain-text fences.

Both fenced blocks are unlabeled, so markdownlint flags them and readers lose syntax highlighting. Please mark the perf dump and methodology block as text.

Suggested fix

-``` +```text

Apply the same text label to both unlabeled fences.

Also applies to: 113-120

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 83-83: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@PERFORMANCE.md` around lines 83 - 93, The fenced markdown blocks in PERFORMANCE.md are unlabeled, so update both fences to use the text language tag. In the performance dump and the methodology section referenced by the review comment, keep the content unchanged but label each triple-backtick fence as text so markdownlint passes and readers get plain-text highlighting.

Source: Linters/SAST tools

….1.0 perf baseline (#150) * fix: address ignored CodeRabbit review findings (qsl-bench, tcp_server, perfeval, flamegraph) Six real findings CodeRabbit flagged on merged PRs and we had not actioned: - qsl-bench profile_seconds_from_args (#135): !(seconds > 0.0) does not reject inf/nan from strtod, and converting a non-finite/unbounded double to clock_type::duration is UB. Now requires std::isfinite and clamps to 3600s. - tcp_server transient_accept_errno (#140): omitted ENOPROTOOPT and EOPNOTSUPP, which Linux accept(2) returns as already-pending per-connection errors and the epoll path already retries. They were wrongly fatal in the threaded acceptor; the set now matches is_transient_accept_error(). - qsl-perfeval ring (#148): submit() ringed every id, but marketable orders that fully fill never rest, so retire cancelled non-existent ids and book depth drifted with match rate. track() now checks eng.contains and rings only resting orders, holding the book genuinely ~kRing deep. - qsl-perfeval p99 (#148): (n*99)/100 selected the max for small n; now uses the zero-based ((n-1)*99)/100 (and p50 likewise). - qsl-perfeval parse_orders (#148): std::strtoull accepted "123abc"/"-1" (wraps to a huge count feeding reserve()/the loop). Now std::from_chars validates the whole token and rejects malformed input with a usage error + exit 2 (test added). - flamegraph.py [unknown] folding (#135): rewritten with a precise, identified rationale. The [unknown] frames are fp-unwinding artifacts (glibc 2.43 malloc fast paths do not preserve x29 -> a spurious frame between resolved operator-new and _mid_memalign/_int_malloc/cfree; plus the vDSO clock_gettime leaf). Each is folded into its resolved caller so every rendered frame is a real symbol and the true operator-new -> malloc chain is revealed. DWARF was verified worse (mangles the _start asm entry into ~3 unknowns/stack). make check/asan 272/272; CodeScene clean. Perf evidence (PERFORMANCE.md, flamegraphs) regenerated with the fixed harness in a follow-up commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf: regenerate evidence with the fixed harness and zero-unknown flamegraphs Re-measured PERFORMANCE.md after the qsl-perfeval ring fix (track only resting orders) on the same host, back-to-back before/after: throughput 9.25M -> 10.76M orders/sec (+16.3%) cycles/order 345.7 -> 297.3 (-14.0%) instr/order 1246 -> 1144 (-8.2%); IPC 3.60 -> 3.85 branch-miss 1.86% -> 1.69% allocs/order 1.108 -> 1.108 (unchanged) p50/p99 ~83 / ~208 ns both (latency distribution unchanged) cache-miss unavailable (Apple Silicon PMU; #90) Honesty corrections from the counters: the win is throughput + cycles/order, not the new_limit latency tail (the earlier 250->208 p99 was thermal warm-up; steady state is ~208 both). Allocations are unchanged (libstdc++ map::emplace checks before allocating). New section identifies every [unknown] frame (fp glibc-malloc boundary artifact + vDSO clock_gettime leaf) and documents that DWARF is worse; all flamegraphs now render with ZERO [unknown] and the libc malloc internals resolved. results/flamegraph.svg regenerated (0 unknown, Dirty inputs: no). PERFORMANCE.md and perf-stat.txt are em-dash-free. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs: purge every em/en dash repo-wide and refresh stale README perf numbers Removed every em dash (U+2014) and en dash (U+2013) from all 66 tracked text files (CLAUDE.md, AGENTS.md, MILESTONES.md, PROGRESS.md, HANDOFF.md, every doc, CHANGELOG, source-comment prose, scripts, tests), replacing with readable ASCII punctuation (comma / colon / period / hyphen by context). 0 em/en dashes remain repo-wide (ocaml/test/fixtures verified clean already). Source changes are comment-only; build + check stays 272/272. Refreshed the README "numbers" block, which still carried the pre-remeasurement figures, to match the corrected PERFORMANCE.md (fixed qsl-perfeval harness): throughput 9.25M -> 10.76M (+16%), cycles/order 345.7 -> 297.3 (-14%), IPC 3.60 -> 3.85, branch-miss 1.86% -> 1.69%, allocs unchanged. Dropped the misleading p99 row (the latency distribution is unchanged; the win is throughput and cycles/order). Tests badge and quality table 271 -> 272. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf: reframe the headline comparison to v0.1.0 baseline -> v0.2.2 The performance evidence now compares the FIRST RELEASE (v0.1.0) to current, the cumulative engine change, instead of just the two most recent micro-opts. The v0.1.0 column was produced by porting the same qsl-perfeval harness into a git worktree at the v0.1.0 tag (identical MatchingEngine API) and running the same release preset back-to-back on the same host. Cumulative (v0.1.0 -> v0.2.2, baseline storage): allocations/order 4.094 -> 1.108 (-73.0%) <- dominant cumulative win branch-miss rate 2.05% -> 1.69% (-17.5% relative) throughput 10.54M -> 10.99M (+4.3%) cycles/order 304.5 -> 290.7 (-4.5%); IPC 3.84 -> 3.94 p50/p99 latency ~83/~209 -> ~83/~208 ns (unchanged) cache-miss unavailable (Apple Silicon PMU; #90) Honest mechanism: the big change since v0.1.0 is allocation traffic (the storage/PMR work), cut 73% even on the default baseline path; throughput/cycles move modestly because the baseline hot path is bound by the ordered-map and intrusive-list ops, not allocation count. The two recent micro-opts (try_emplace, load-factor 0.25) are kept as a labeled v0.2.2 sub-analysis with their perf-report evidence. before.svg is now the v0.1.0 flamegraph, after.svg v0.2.2; both render with ZERO [unknown] and the libc malloc internals resolved. README and perf-stat.txt updated to the v0.1.0 baseline; all em-dash-free. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs: add mermaid diagrams to seven docs that lacked them Visual diagrams (all em-dash-free, all special-char labels quoted): - matching_rules: order matching decision flow (limit/market, cross/rest, IOC) - binary_protocol: frame decode + reject pipeline - persistence: append -> app buffer -> page cache -> disk durability layers - ocaml_verifier: C++ vs OCaml differential + shrink pipeline - concurrency_model: SPSC pipeline (input -> engine -> publisher) with backpressure - memory_ordering: producer/consumer happens-before (release synchronizes-with acquire) - socket_gateway: TcpServer accept loop (transient retry, fd-exhaustion backoff, cap shedding) Docs that already had mermaid (architecture, differential_testing, replay_and_recovery, property_testing, benchmarking, README) are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(perfeval): count over-aligned allocations; correct the -73% claim to -35% Adversarial self-review found a measurement bug in the allocation counter: the global operator-new override only intercepted operator new(size_t) and the array form, missing operator new(size_t, align_val_t). v0.2.2's storage makes ~1.56 OVER-ALIGNED allocations/order (v0.1.0 makes none), so they were uncounted and allocs/order was reported as 1.108 when the true figure is 2.670. The headline "-73% allocations since v0.1.0" was therefore wrong; the real cut is -34.8%. Fixes: - Override every operator new/delete variant (plain + aligned) so the count is complete. The aligned override adds a little work per allocation, which would perturb cycle/throughput numbers, so allocation counting is now compile-time opt-in (QSL_PERFEVAL_COUNT_ALLOCS) behind a second CMake target, qsl-perfeval-allocs. The default qsl-perfeval leaves the allocator untouched (pure performance) and prints allocs_per_order=n/a. - Re-measured both versions cleanly: performance from qsl-perfeval (no instrumentation), allocations from qsl-perfeval-allocs. Frequency-independent counts are the load-bearing metrics (wall-clock throughput is schedutil-noisy). Corrected cumulative v0.1.0 -> v0.2.2 (baseline storage): allocations/order 4.094 -> 2.670 (-34.8%) cycles/order 310.7 -> 289.5 (-6.8%) instructions/order 1215 -> 1157 (-4.7%); IPC 3.91 -> 4.00 branch-miss rate 2.01% -> 1.68% (-16.3% relative) p50/p99 latency 83/209 ns both (unchanged) cache-miss unavailable (Apple Silicon PMU; #90) PERFORMANCE.md, perf-stat.txt, README updated and document the earlier mistake openly. make check/asan 272/272; CodeScene clean; all em-dash-free. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * results: refresh env-check provenance after the doc/em-dash changes dpdk_environment.txt and nic_offload_environment.txt carry several docs (CLAUDE.md, AGENTS.md, MILESTONES.md, results/README.md) in their digest scope; regenerated so they stay Dirty inputs: no after the em-dash purge and edits. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix: address CodeRabbit's review of this PR (incl. two tables my purge broke) CodeRabbit flagged four issues on #150; all real, all fixed: - qsl-perfeval: `qsl-perfeval 1000 2000` silently ran with 2000. Reject a second order count as ambiguous. Extracted parse_args() so main stays under the cyclomatic-complexity threshold (CodeScene). - concurrency_model.md + fix_protocol.md: the em-dash purge turned an em-dash table cell into ", ", producing a malformed `|, |` row. Restored the cells ("nothing" consumed / "(none)" internal field). Repo-wide scan confirms these were the only two corrupted table cells. - persistence.md: the new durability mermaid node said "Durable, survives power loss", overclaiming vs the doc's best-effort fsync contract. Softened. - release_readiness.md: a wrapped line started with "#32" (markdownlint heading false-positive); kept the reference inline. make check 272/272; CodeScene clean; 0 em/en dashes repo-wide. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* chore(release): cut v0.2.2 Finalize the v0.2.2 changelog entry to include this session's documentation overhaul (#147), performance-evidence report (#148), README rebuild (#149), and the bug/style/mermaid sweep (#150) on top of the post-v0.2.1 hardening + perf wave (#135-#146). Fix the test count (272/272) and flip the v0.2.2 resume/release anchors (PROGRESS.md, release_readiness.md) from "in preparation" to released. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs: reconcile test count to 272/272 across release records (CodeRabbit) CodeRabbit flagged docs/release_readiness.md still showing 270/270 while CHANGELOG/PROGRESS said 272/272. The two perfeval tests added this session took the count 270 -> 272, so the current-state and verification claims were stale. Updated all current-state references (release_readiness verification table, PROGRESS.md status + both summary blocks, the CLAUDE.md/AGENTS.md roadmap memory kept in sync, HANDOFF.md), and flipped the v0.2.2 "being cut / next action" phrasing to released. The one remaining 270/270 is a dated entry under PROGRESS.md "Decision log additions" (a correct historical snapshot). All em-dash-free. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

codescene-delta-analysis Bot approved these changes Jun 25, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 25, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

div0rce merged commit 64645fb into main Jun 25, 2026
8 checks passed

div0rce deleted the perf/performance-evidence-report branch June 25, 2026 03:13

coderabbitai Bot mentioned this pull request Jun 25, 2026

fix: extinguish ignored review bugs, purge em dashes, add mermaid, v0.1.0 perf baseline #150

Merged

div0rce mentioned this pull request Jun 25, 2026

chore(release): cut v0.2.2 #151

Merged

Conversation

div0rce commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Headline (qsl-perfeval, steady-state deep book, baseline storage, Release)

Honest mechanism (the point of measuring)

What's added

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

codescene-delta-analysis Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

div0rce commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading