fix(gateway,feed): harden network paths — EINTR, accept fairness, conn cap, UDP send errors#137
Conversation
…n cap, UDP send errors A second Codex bug-hunt pass over the concurrency/network code (the first pass covered the engine/replay/protocol) found four network-path robustness issues: - tcp_server EINTR (major-correctness): write_all() and the serve_connection() read loop treated a signal-interrupted send()/read() (errno==EINTR) as a broken connection and dropped the peer. EINTR is retryable; the epoll path already retries. Both now retry via a read_retry() helper and an EINTR branch in write_all(). - tcp_server unbounded threads (resource starvation): one std::thread per connection with no cap let slow/idle clients exhaust threads/fds/memory. Added TcpServerOptions::max_active_connections (0 = unlimited, default) with load-shed admission control: at the cap a freshly accepted connection is closed instead of spawning another worker. - epoll accept fairness (resource starvation): accept_connections() drained the whole backlog each event-loop turn, so a connection flood could starve already ready client read/write events. Added EpollServerOptions::max_accepts_per_tick (default 64); the listener is level-triggered so remaining backlog is accepted on later turns — no connections lost. - udp_feed send errors (minor): sendto() return was ignored, hiding publish failures. UdpPublisher now counts them (send_failures()); the best-effort feed still does not stop. Refactored the touched functions into small helpers (read_retry, at_capacity, spawn_worker, accepts_per_tick_limit) to keep them under the CodeScene complexity thresholds (delta clean, both files stay 10.0). Tests: new "tcp server caps concurrent connections and sheds load past the cap" exercises the admission control (held slot -> over-cap connection is shed -> slot frees -> re-admits), race-robust via bounded retry. make check 265/265, make asan 265/265, make tsan 20/20. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 14 minutes and 59 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Gates Passed
6 Quality Gates Passed
See analysis details in CodeScene
Quality Gate Profile: Pay Down Tech Debt
Install CodeScene MCP: safeguard and uplift AI-generated code. Catch issues early with our IDE extension and CLI tool.
#139) Refreshes the committed flamegraph so its Source digest matches current src/ (PRs #136 #137 #138; Dirty inputs: no). The try_emplace optimization (#138) is visible: the previously top-3 std::_Rb_tree::_M_emplace_unique allocation frame (~1137 samples) is gone from the hot stacks, which are now the actual matching work — cancel, match_baseline, rest, new_limit. ~20k samples, zero [unknown]. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…xhaustion (#140) A multi-agent bug-hunt (16 finders, 3-lens adversarial verify) found that the accept paths still treated recoverable errors as fatal — the same EINTR class #137 fixed for read()/send(), but on accept() itself: - tcp_server (major): TcpServer::serve_listen_socket did `if (conn < 0) break;`, unconditionally exiting the accept loop. This is the DEFAULT gateway transport. Two reachable triggers: EINTR (any signal to the acceptor thread — debugger, the repo's own perf workflows, job control) and ECONNABORTED (a client that RSTs between the kernel completing the handshake and accept() returning, e.g. connect + SO_LINGER{1,0} close). Either silently and permanently disabled the whole gateway (run() even returns true). Now accept_retry() classifies transient errnos (EINTR/ECONNABORTED/EPROTO/...) and keeps serving; only a genuinely fatal listener error stops the loop. Mirrors the epoll path's is_transient_accept_error(). - epoll_server (minor): accept_one() mapped EMFILE/ENFILE (fd exhaustion) to Fatal, tearing down the whole event loop and dropping every existing client. Now it disarms the level-triggered listener (so it stops busy-spinning epoll_wait — adding EMFILE to the transient Retry set would spin) and keeps serving; close_client_entry re-arms the listener when a client fd frees. Refactored serve_listen_socket into accept_retry/reached_accept_limit helpers and made transient_accept_errno table-driven to stay under CodeScene thresholds (delta clean). make check/asan 265/265, tsan 20/20. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…ct (#147) * docs: overhaul all stale docs for the post-v0.2.1 (v0.2.2) state Full staleness audit of every prose doc against current main. The anchors were frozen at v0.2.1 / 263 tests / "no active milestone" while 12 PRs (#135-#146) had merged and are unreleased. - Resume anchors (PROGRESS.md, HANDOFF.md): Current state / Last action / Next action / test count (263->270) brought current; the two duplicate frozen anchors and the stale macOS benchmark numbers fixed; a dated v0.2.2 log entry. - CLAUDE.md + AGENTS.md: the post-M35 roadmap-memory section now records the post-v0.2.1 hardening + perf wave (identical edit in both). - CHANGELOG.md: new [0.2.2] section (decoder enum rejection #136, network/CLI hardening #137/#140/#141/#143, real UBSan abort gate #142, ocaml diff_report #144, try_emplace ~+5% #138, index load-factor ~+18.6% #145). CMakeLists version 0.2.1 -> 0.2.2. - README: benchmark/flamegraph/limitations sections reflect the engine wins (measured on the profile workload, not the micro-bench table) and the gateway hardening; release_readiness 270/270 + UBSan gate + v0.2.2 scope. - Networking docs (socket_gateway, socket_hardening): connection cap, EINTR retry, transient-accept survival, fd-exhaustion handling, UDP send-error counting. replay_and_recovery: decode_command now rejects out-of-domain enums. binary_protocol/differential_testing/fix_protocol/SECURITY/recruiting_notes/ CONTRIBUTING: smaller accuracy updates. results/README: add the three socket artifacts. make check 270/270. Stale results/*.txt provenance digests regenerated separately. pool_backed_storage.md table follows its artifact regeneration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * results: regenerate stale provenance artifacts for the v0.2.2 source The post-v0.2.1 source changes (#135-#146) left 14 results/*.txt with stale Source digests (the authoritative staleness signal per the provenance rules, not commit-hash drift). Regenerated via their make targets so each declares Dirty inputs: no against current HEAD: differential, pool_backed_storage, allocator_experiment, recovery_benchmarks, false_sharing_study, perf_stat_linux (partial PMU, QSL_PERF_ALLOW_PARTIAL), perf_report_linux, numa_affinity_study (linux-constrained), socket_load_summary, socket_profile_loopback, socket_stress_summary, dpdk_environment, nic_offload_environment. docs/pool_backed_storage.md: refreshed the median table, digest reference, and qualitative ordering from the regenerated artifact (contiguous fastest on four of five workloads; intrusive leads dense). The baseline rows now include the try_emplace (#138) and index load-factor (#145) wins. Honesty notes: these were measured on a thermally-warmed M2 from a long session, so absolute values run higher than a cool-host snapshot — the relative orderings and provenance digests are the load-bearing content, and every artifact carries a hardware/build-dependence caveat. latest.txt is regenerated separately on a cooled host to keep its headline micro-benchmark numbers representative. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * results: regenerate latest.txt (cool host) and sync README/PROGRESS numbers latest.txt regenerated on a thermally-recovered host so the headline micro-bench numbers are representative (protocol canary 16.1 ns/op), with a fresh Source digest (Dirty inputs: no) matching current source. The README benchmark table and the PROGRESS measured-results section are aligned to it: order_book ~90, protocol ~16, gateway ~102, matching ~91, replay ~101 ns. The matching/replay rows are slightly faster than the prior committed run (the v0.2.2 engine wins showing on the resting-order paths); the order_book micro-bench is unchanged (near-empty index, so it does not exercise the load-factor win). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * results: re-regenerate env-check artifacts against the final v0.2.2 docs dpdk_environment.txt and nic_offload_environment.txt include README.md (and CLAUDE.md/AGENTS.md/results/README.md) in their digest scope, so the README number-sync re-staled them. Regenerated against the final committed docs; Dirty inputs: no. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Summary
A second independent Codex bug-hunt pass — focused on the concurrency/network code the first pass (#136) didn't cover — found four network-path robustness issues. The core engine, SPSC ring, FIX adapter, and storage modes were reviewed and reported clean.
Bugs fixed
tcp_serverEINTR (correctness).write_all()and theserve_connection()read loop treated a signal-interruptedsend()/read()(errno == EINTR) as a broken connection and dropped the peer.EINTRis retryable, and the epoll path already retries. Both now retry (via aread_retry()helper + anEINTRbranch inwrite_all()).tcp_serverunbounded threads (resource starvation). Onestd::threadper connection with no cap let slow/idle clients exhaust threads/fds/memory. AddedTcpServerOptions::max_active_connections(0 = unlimited, the default) with load-shed admission control: at the cap a freshly accepted connection is closed rather than spawning another worker.accept_connections()drained the entire backlog each event-loop turn, so a connection flood could starve already-ready client read/write events. AddedEpollServerOptions::max_accepts_per_tick(default 64). The listener is level-triggered, so any remaining backlog is accepted on later turns — no connections are lost.udp_feedsend errors (minor).sendto()'s return was ignored, hiding publish failures.UdpPublishernow counts them viasend_failures(); the best-effort feed still does not stop on a drop.Notes
max_active_connectionsdefaults to unlimited, andmax_accepts_per_tick's 64 only bounds pathological floods.read_retry,at_capacity,spawn_worker,accepts_per_tick_limit) to stay under the CodeScene complexity thresholds (delta clean, both files remain 10.0).Verification
tcp server caps concurrent connections and sheds load past the capexercises admission control (held slot → over-cap connection shed → slot frees → re-admits), made race-robust with a bounded retry.make check265/265,make asan265/265,make tsan20/20; cap test 0 failures over repeated runs.🤖 Generated with Claude Code