fix(evr): bound ConfigRequest DB/cache load + per-session inbound rate limit#501
fix(evr): bound ConfigRequest DB/cache load + per-session inbound rate limit#501thesprockee wants to merge 5 commits into
Conversation
configRequest cached on the client-controlled Type+":"+ID while the storage read keys on Type only, so a remote caller holding the public ServerKey could vary ID every packet to guarantee a cache miss and force one DB read per message; when a stored Config:<Type> object existed the same trick grew the sync.Map without bound. Key the lookup on Type only (matching the storage read), cache the negative result so distinct-ID floods collapse to a single read, and replace the unbounded sync.Map with a bounded (256-entry) TTL'd LRU. Behavior for the four valid Type:ID default pairs and real stored config is preserved. Reachability and amplification are covered by tests that drive the real handler with a stubbed storage read (no live Postgres). Refs BUGS.md SEC-1. Co-Authored-By: Andrew Bates <a@sprock.io> Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>
The Consume loop dispatched every parsed EVR message with no throttle, so one established socket could submit messages as fast as the TCP window allows. Connection-level rate limits do not throttle messages inside an already-established socket, so a per-message flood on one allowed socket force-multiplies SEC-1 into DB exhaustion. Add a per-session token-bucket limiter, checked before dispatch. Default 100 msg/s sustained with a 200 burst (config: socket.inbound_message_ rate_limit_per_sec / _burst; 0 disables). The server treats ~1.3 msg/s as "regularly active" (PingBackoffThreshold/PingPeriodMs), so the default is ~75x a normally active session and generous for connect-time bursts, while capping a single socket far below the thousands/sec an unthrottled flood can push. Drops are audited, never silent: a counter per drop and a throttled warn log naming the client IP, dropped message type, and running total. Refs BUGS.md SEC-2. Co-Authored-By: Andrew Bates <a@sprock.io> Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>
There was a problem hiding this comment.
Pull request overview
This PR hardens the EchoVR (EVR) real-time WebSocket message path against per-session resource exhaustion by (1) fixing ConfigRequest cache keying/unbounded growth and (2) adding a per-session inbound message rate limiter to prevent message-flood amplification on expensive handlers.
Changes:
- Replace the ConfigRequest cache with a bounded, TTL’d LRU keyed by config Type (matching the storage read key) and cache negative lookups to avoid repeated DB hits (SEC-1).
- Add a per-session token-bucket inbound rate limiter enforced in the EVR IncomingLoop, plus new socket config knobs and audit logging/metrics (SEC-2).
- Add regression tests covering both SEC-1 (cache behavior/cap/reachability) and SEC-2 (limiter throttling + disabled path).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| server/session_ws.go | Enforces the per-session inbound limiter before dispatching parsed EVR messages. |
| server/evr_pipeline_config.go | Implements the bounded TTL+LRU cache keyed on config Type and refactors stored-config loading. |
| server/evr_pipeline_config_sec_test.go | Adds SEC-1 regression tests for cache amplification/growth/cap and behavior preservation. |
| server/evr_inbound_ratelimit.go | Introduces the per-session token-bucket limiter with drop accounting and log throttling. |
| server/evr_inbound_ratelimit_test.go | Adds SEC-2 unit tests for throttling behavior and the disabled (nil) limiter path. |
| server/config.go | Adds socket config fields and defaults for inbound per-session message rate limiting. |
| if allowed > burst+perSec { | ||
| t.Fatalf("SEC-2: limiter allowed %d, well above burst %d (cap ineffective)", allowed, burst) | ||
| } |
…ministic test) Reword the three SEC-1 code comments in evr_pipeline_config.go to be self-contained for public readers: drop the "BUGS.md" pointer (that file is gitignored and absent from the repo, so the reference resolved to nothing) while keeping the SEC-1 label and the inline invariant description. Make TestSEC2_InboundRateLimiter_ThrottlesFlood deterministic. rate.Limiter .Allow() reads time.Now() per call, so tokens replenish at perSec as the loop runs; the old upper bound (allowed > burst+perSec) implicitly assumed the loop finished within ~1s of refill and could fail on a slow or -race machine. Assert against the token-bucket guarantee computed from the measured loop duration (burst + perSec*elapsed) instead. Verified stable: 20x runs under -race pass. Co-Authored-By: Andrew Bates <a@sprock.io> Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>
| // with a 200 burst is ~75x a normally active session — generous for | ||
| // connect-time bursts (config/document/login) yet caps a single socket | ||
| // far below the thousands/sec an unthrottled flood can push. | ||
| InboundMessageRateLimitPerSec: 100, |
There was a problem hiding this comment.
analyze the production logs to find out what the maximum ingres message rate is that we've seen in the past month. figure out what 3.5σ looks like, and set the value to that.
There was a problem hiding this comment.
Done — value is now data-derived, not guessed: InboundMessageRateLimitPerSec 100 → 9.
Source: one month of production nakama debug logs on fortytwo. Every inbound EVR message logs "msg":"Received message" with its sid at debug level (the global logger runs at debug in prod), and the limiter checks allow() once per that same parsed message in IncomingLoop — so the log is a 1:1 record of what the limiter counts, not a proxy.
Extraction (streamed, read-only, per-session per-wall-clock-second message count → distribution of msgs/sec/session):
# 29 log files (17 .zst + 12 .log + live), ts-filtered to the trailing month
awk to bucket key=(sid, ts-truncated-to-second); count++ ; flush per second;
accumulate n, Σx, Σx², max over all (sid,second) buckets with ≥1 message
Result:
| metric | value |
|---|---|
| window | 2026-06-03 → 2026-07-03 (30d) |
| sample | n = 10,782,318 active session-seconds / 19,895,343 msgs |
| mean | 1.845 msg/s/session |
| σ | 1.790 msg/s/session |
| max | 21 msg/s/session (single peak) |
| mean + 3.5σ | 8.11 → 9 |
Chosen: 9 (ceil of 8.11 — round up so the cap never sits below the 3.5σ legit boundary). The Burst stays 200, ~10x the observed 21 msg/s peak, so a legitimate connect-time burst is never clipped while 9 msg/s sustained still caps a flood. Distribution is right-skewed (exponential-ish), so 3.5σ lands at 9 even though the tail reaches 21; the token-bucket burst absorbs those sub-sustained spikes. TestSEC2_DefaultInboundRateLimit_IsDataDerived pins the value.
Limitation (honest): counts EVR-format inbound only (the dominant path for game clients); standard rtapi messages use a different log site and are negligible here. Bucketing is by wall-clock second, so a burst split across a second boundary is counted as two buckets — conservative (never inflates the peak).
0a3233494
There was a problem hiding this comment.
probably should just move rate limiting to the session_ws.go
There was a problem hiding this comment.
Done in 88ecbe614.
Checked current placement first: the SEC-2 enforcement already lived in session_ws.go (IncomingLoop calls s.inboundLimiter.allow() / logDue() / droppedTotal() per inbound message, and NewSessionWS constructs newInboundRateLimiter(...)). The only piece sitting elsewhere was the inboundRateLimiter type + its methods, which lived in a standalone server/evr_inbound_ratelimit.go. Nothing rate-limiting-related lived in evr_pipeline_config.go (that file is the ConfigRequest cache / SEC-1).
Consolidated per your intent:
- moved the
inboundRateLimitertype andnewInboundRateLimiter/allow/droppedTotal/logDueintosession_ws.go, right next to the loop that enforces them (added thegolang.org/x/time/rateimport there); - deleted
server/evr_inbound_ratelimit.go; - renamed the test
evr_inbound_ratelimit_test.go→session_ws_ratelimit_test.goto match.
No behavior change — go test -race ./server/ -run TestSEC2_InboundRateLimiter still green (throttle + disabled cases).
Andrew (PR #501 review): "probably should just move rate limiting to the session_ws.go". The SEC-2 per-session token-bucket enforcement already lives in session_ws.go IncomingLoop; only the inboundRateLimiter type and its methods sat in a separate evr_inbound_ratelimit.go. Move them next to the loop that enforces them and rename the test file to match. No behavior change. Co-Authored-By: Andrew Bates <a@sprock.io> Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>
Andrew (PR #501 review): "analyze the production logs to find out what the maximum ingress message rate is that we've seen in the past month. figure out what 3.5σ looks like, and set the value to that." Measured per-session inbound EVR message rate from one month of production debug logs on fortytwo (each inbound message logs "Received message" with its sid — the same messages the limiter counts): window : 2026-06-03 → 2026-07-03 (30 days) sample : n = 10,782,318 active session-seconds, 19,895,343 messages mean : 1.845 msg/s/session σ : 1.790 msg/s/session max : 21 msg/s/session (single peak; absorbed by the 200 burst) 3.5σ : mean + 3.5σ = 8.11 → rounded up to 9 The previous default of 100 was a guess. 9 msg/s sustained is the 3.5σ point of real traffic; the 200 burst (~10x the observed 21 msg/s peak) still passes any legitimate connect-time burst. Test pins the value so it cannot silently drift. Co-Authored-By: Andrew Bates <a@sprock.io> Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>
Summary
Hardens the EVR real-time message path against per-session resource exhaustion. Two related fixes.
SEC-1 — ConfigRequest cache bypass + unbounded growth
The config cache was keyed on the client-controlled
Type + ":" + ID, but the storage read keys only onType. So every request with a freshIDmissed the cache and forced aStorageReadObjectscall, and — when a storedConfig:<Type>object exists — the cache grew one entry per distinctIDwithout bound.ConfigRequestis reachable without user auth.Fix: key the cache on
Typeonly (matching the storage read) and add a bounded LRU (256 entries, TTL). Distinct-IDrequests for the sameTypenow collapse to one read and one entry. Negative results are cached too. The four built-in defaulttype:idpairs and real stored config are preserved.SEC-2 — no per-session inbound message throttle
The Consume loop dispatched every parsed message with no rate limit, so a single established socket could submit messages at line rate and force-multiply any cheap-request/expensive-handler pair.
Fix: a per-session token-bucket limiter checked before dispatch. Default 100 msg/s sustained, burst 200 (
socket.inbound_message_rate_limit_per_sec/_burst;0disables). Anchored on the server's own ~1.3 msg/s "regularly active" signal, so ~75x headroom for legitimate play. Drops are audited (counter + throttled warn with client IP, message type, running total) — never silent.Tests
Red→green for both: distinct-ID amplification (N reads → 1), unbounded growth (N entries → 1), cache size cap, and limiter throttle/disabled.
gofmt,go vet, and the SEC-1/SEC-2 tests pass under-race.Deployment
None in this PR.
🤖 Generated with Claude Code