fix(evr): bound ConfigRequest DB/cache load + per-session inbound rate limit by thesprockee · Pull Request #501 · EchoTools/nakama

thesprockee · 2026-07-03T11:29:38Z

Summary

Hardens the EVR real-time message path against per-session resource exhaustion. Two related fixes.

SEC-1 — ConfigRequest cache bypass + unbounded growth

The config cache was keyed on the client-controlled Type + ":" + ID, but the storage read keys only on Type. So every request with a fresh ID missed the cache and forced a StorageReadObjects call, and — when a stored Config:<Type> object exists — the cache grew one entry per distinct ID without bound. ConfigRequest is reachable without user auth.

Fix: key the cache on Type only (matching the storage read) and add a bounded LRU (256 entries, TTL). Distinct-ID requests for the same Type now collapse to one read and one entry. Negative results are cached too. The four built-in default type:id pairs and real stored config are preserved.

SEC-2 — no per-session inbound message throttle

The Consume loop dispatched every parsed message with no rate limit, so a single established socket could submit messages at line rate and force-multiply any cheap-request/expensive-handler pair.

Fix: a per-session token-bucket limiter checked before dispatch. Default 100 msg/s sustained, burst 200 (socket.inbound_message_rate_limit_per_sec / _burst; 0 disables). Anchored on the server's own ~1.3 msg/s "regularly active" signal, so ~75x headroom for legitimate play. Drops are audited (counter + throttled warn with client IP, message type, running total) — never silent.

Tests

Red→green for both: distinct-ID amplification (N reads → 1), unbounded growth (N entries → 1), cache size cap, and limiter throttle/disabled. gofmt, go vet, and the SEC-1/SEC-2 tests pass under -race.

Deployment

None in this PR.

🤖 Generated with Claude Code

configRequest cached on the client-controlled Type+":"+ID while the storage read keys on Type only, so a remote caller holding the public ServerKey could vary ID every packet to guarantee a cache miss and force one DB read per message; when a stored Config:<Type> object existed the same trick grew the sync.Map without bound. Key the lookup on Type only (matching the storage read), cache the negative result so distinct-ID floods collapse to a single read, and replace the unbounded sync.Map with a bounded (256-entry) TTL'd LRU. Behavior for the four valid Type:ID default pairs and real stored config is preserved. Reachability and amplification are covered by tests that drive the real handler with a stubbed storage read (no live Postgres). Refs BUGS.md SEC-1. Co-Authored-By: Andrew Bates <a@sprock.io> Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>

The Consume loop dispatched every parsed EVR message with no throttle, so one established socket could submit messages as fast as the TCP window allows. Connection-level rate limits do not throttle messages inside an already-established socket, so a per-message flood on one allowed socket force-multiplies SEC-1 into DB exhaustion. Add a per-session token-bucket limiter, checked before dispatch. Default 100 msg/s sustained with a 200 burst (config: socket.inbound_message_ rate_limit_per_sec / _burst; 0 disables). The server treats ~1.3 msg/s as "regularly active" (PingBackoffThreshold/PingPeriodMs), so the default is ~75x a normally active session and generous for connect-time bursts, while capping a single socket far below the thousands/sec an unthrottled flood can push. Drops are audited, never silent: a counter per drop and a throttled warn log naming the client IP, dropped message type, and running total. Refs BUGS.md SEC-2. Co-Authored-By: Andrew Bates <a@sprock.io> Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>

Copilot

Pull request overview

This PR hardens the EchoVR (EVR) real-time WebSocket message path against per-session resource exhaustion by (1) fixing ConfigRequest cache keying/unbounded growth and (2) adding a per-session inbound message rate limiter to prevent message-flood amplification on expensive handlers.

Changes:

Replace the ConfigRequest cache with a bounded, TTL’d LRU keyed by config Type (matching the storage read key) and cache negative lookups to avoid repeated DB hits (SEC-1).
Add a per-session token-bucket inbound rate limiter enforced in the EVR IncomingLoop, plus new socket config knobs and audit logging/metrics (SEC-2).
Add regression tests covering both SEC-1 (cache behavior/cap/reachability) and SEC-2 (limiter throttling + disabled path).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
server/session_ws.go	Enforces the per-session inbound limiter before dispatching parsed EVR messages.
server/evr_pipeline_config.go	Implements the bounded TTL+LRU cache keyed on config Type and refactors stored-config loading.
server/evr_pipeline_config_sec_test.go	Adds SEC-1 regression tests for cache amplification/growth/cap and behavior preservation.
server/evr_inbound_ratelimit.go	Introduces the per-session token-bucket limiter with drop accounting and log throttling.
server/evr_inbound_ratelimit_test.go	Adds SEC-2 unit tests for throttling behavior and the disabled (nil) limiter path.
server/config.go	Adds socket config fields and defaults for inbound per-session message rate limiting.

+	if allowed > burst+perSec {
+		t.Fatalf("SEC-2: limiter allowed %d, well above burst %d (cap ineffective)", allowed, burst)
+	}


…ministic test) Reword the three SEC-1 code comments in evr_pipeline_config.go to be self-contained for public readers: drop the "BUGS.md" pointer (that file is gitignored and absent from the repo, so the reference resolved to nothing) while keeping the SEC-1 label and the inline invariant description. Make TestSEC2_InboundRateLimiter_ThrottlesFlood deterministic. rate.Limiter .Allow() reads time.Now() per call, so tokens replenish at perSec as the loop runs; the old upper bound (allowed > burst+perSec) implicitly assumed the loop finished within ~1s of refill and could fail on a slow or -race machine. Assert against the token-bucket guarantee computed from the measured loop duration (burst + perSec*elapsed) instead. Verified stable: 20x runs under -race pass. Co-Authored-By: Andrew Bates <a@sprock.io> Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>

thesprockee · 2026-07-03T12:08:29Z

+		// with a 200 burst is ~75x a normally active session — generous for
+		// connect-time bursts (config/document/login) yet caps a single socket
+		// far below the thousands/sec an unthrottled flood can push.
+		InboundMessageRateLimitPerSec: 100,


analyze the production logs to find out what the maximum ingres message rate is that we've seen in the past month. figure out what 3.5σ looks like, and set the value to that.

Done — value is now data-derived, not guessed: InboundMessageRateLimitPerSec 100 → 9.

Source: one month of production nakama debug logs on fortytwo. Every inbound EVR message logs "msg":"Received message" with its sid at debug level (the global logger runs at debug in prod), and the limiter checks allow() once per that same parsed message in IncomingLoop — so the log is a 1:1 record of what the limiter counts, not a proxy.

Extraction (streamed, read-only, per-session per-wall-clock-second message count → distribution of msgs/sec/session):

# 29 log files (17 .zst + 12 .log + live), ts-filtered to the trailing month awk to bucket key=(sid, ts-truncated-to-second); count++ ; flush per second; accumulate n, Σx, Σx², max over all (sid,second) buckets with ≥1 message

Result:

metric value

window 2026-06-03 → 2026-07-03 (30d)

sample n = 10,782,318 active session-seconds / 19,895,343 msgs

mean 1.845 msg/s/session

σ 1.790 msg/s/session

max 21 msg/s/session (single peak)

mean + 3.5σ 8.11 → 9

Chosen: 9 (ceil of 8.11 — round up so the cap never sits below the 3.5σ legit boundary). The Burst stays 200, ~10x the observed 21 msg/s peak, so a legitimate connect-time burst is never clipped while 9 msg/s sustained still caps a flood. Distribution is right-skewed (exponential-ish), so 3.5σ lands at 9 even though the tail reaches 21; the token-bucket burst absorbs those sub-sustained spikes. TestSEC2_DefaultInboundRateLimit_IsDataDerived pins the value.

Limitation (honest): counts EVR-format inbound only (the dominant path for game clients); standard rtapi messages use a different log site and are negligible here. Bucketing is by wall-clock second, so a burst split across a second boundary is counted as two buckets — conservative (never inflates the peak).

0a3233494

thesprockee · 2026-07-03T12:09:02Z

probably should just move rate limiting to the session_ws.go

Done in 88ecbe614.

Checked current placement first: the SEC-2 enforcement already lived in session_ws.go (IncomingLoop calls s.inboundLimiter.allow() / logDue() / droppedTotal() per inbound message, and NewSessionWS constructs newInboundRateLimiter(...)). The only piece sitting elsewhere was the inboundRateLimiter type + its methods, which lived in a standalone server/evr_inbound_ratelimit.go. Nothing rate-limiting-related lived in evr_pipeline_config.go (that file is the ConfigRequest cache / SEC-1).

Consolidated per your intent:

moved the inboundRateLimiter type and newInboundRateLimiter / allow / droppedTotal / logDue into session_ws.go, right next to the loop that enforces them (added the golang.org/x/time/rate import there);

deleted server/evr_inbound_ratelimit.go;

renamed the test evr_inbound_ratelimit_test.go → session_ws_ratelimit_test.go to match.

No behavior change — go test -race ./server/ -run TestSEC2_InboundRateLimiter still green (throttle + disabled cases).

Andrew (PR #501 review): "probably should just move rate limiting to the session_ws.go". The SEC-2 per-session token-bucket enforcement already lives in session_ws.go IncomingLoop; only the inboundRateLimiter type and its methods sat in a separate evr_inbound_ratelimit.go. Move them next to the loop that enforces them and rename the test file to match. No behavior change. Co-Authored-By: Andrew Bates <a@sprock.io> Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>

Andrew (PR #501 review): "analyze the production logs to find out what the maximum ingress message rate is that we've seen in the past month. figure out what 3.5σ looks like, and set the value to that." Measured per-session inbound EVR message rate from one month of production debug logs on fortytwo (each inbound message logs "Received message" with its sid — the same messages the limiter counts): window : 2026-06-03 → 2026-07-03 (30 days) sample : n = 10,782,318 active session-seconds, 19,895,343 messages mean : 1.845 msg/s/session σ : 1.790 msg/s/session max : 21 msg/s/session (single peak; absorbed by the 200 burst) 3.5σ : mean + 3.5σ = 8.11 → rounded up to 9 The previous default of 100 was a guess. 9 msg/s sustained is the 3.5σ point of real traffic; the 200 burst (~10x the observed 21 msg/s peak) still passes any legitimate connect-time burst. Test pins the value so it cannot silently drift. Co-Authored-By: Andrew Bates <a@sprock.io> Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>

Spritz Metis Sprock and others added 2 commits July 1, 2026 20:42

Copilot AI review requested due to automatic review settings July 3, 2026 11:29

Copilot started reviewing on behalf of thesprockee July 3, 2026 11:30 View session

Copilot AI reviewed Jul 3, 2026

View reviewed changes

thesprockee commented Jul 3, 2026

View reviewed changes

thesprockee marked this pull request as draft July 3, 2026 12:09

Spritz Metis Sprock and others added 2 commits July 3, 2026 08:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(evr): bound ConfigRequest DB/cache load + per-session inbound rate limit#501

fix(evr): bound ConfigRequest DB/cache load + per-session inbound rate limit#501
thesprockee wants to merge 5 commits into
mainfrom
sec/configrequest-dos

thesprockee commented Jul 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thesprockee Jul 3, 2026

Uh oh!

thesprockee Jul 3, 2026

Uh oh!

thesprockee Jul 3, 2026

Uh oh!

thesprockee Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

metric	value
window	2026-06-03 → 2026-07-03 (30d)
sample	n = 10,782,318 active session-seconds / 19,895,343 msgs
mean	1.845 msg/s/session
σ	1.790 msg/s/session
max	21 msg/s/session (single peak)
mean + 3.5σ	8.11 → 9

Uh oh!

Conversation

thesprockee commented Jul 3, 2026

Summary

SEC-1 — ConfigRequest cache bypass + unbounded growth

SEC-2 — no per-session inbound message throttle

Tests

Deployment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thesprockee Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

thesprockee Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

thesprockee Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

thesprockee Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants