Skip to content

fix(evr): bound ConfigRequest DB/cache load + per-session inbound rate limit#501

Draft
thesprockee wants to merge 5 commits into
mainfrom
sec/configrequest-dos
Draft

fix(evr): bound ConfigRequest DB/cache load + per-session inbound rate limit#501
thesprockee wants to merge 5 commits into
mainfrom
sec/configrequest-dos

Conversation

@thesprockee

Copy link
Copy Markdown
Member

Summary

Hardens the EVR real-time message path against per-session resource exhaustion. Two related fixes.

SEC-1 — ConfigRequest cache bypass + unbounded growth

The config cache was keyed on the client-controlled Type + ":" + ID, but the storage read keys only on Type. So every request with a fresh ID missed the cache and forced a StorageReadObjects call, and — when a stored Config:<Type> object exists — the cache grew one entry per distinct ID without bound. ConfigRequest is reachable without user auth.

Fix: key the cache on Type only (matching the storage read) and add a bounded LRU (256 entries, TTL). Distinct-ID requests for the same Type now collapse to one read and one entry. Negative results are cached too. The four built-in default type:id pairs and real stored config are preserved.

SEC-2 — no per-session inbound message throttle

The Consume loop dispatched every parsed message with no rate limit, so a single established socket could submit messages at line rate and force-multiply any cheap-request/expensive-handler pair.

Fix: a per-session token-bucket limiter checked before dispatch. Default 100 msg/s sustained, burst 200 (socket.inbound_message_rate_limit_per_sec / _burst; 0 disables). Anchored on the server's own ~1.3 msg/s "regularly active" signal, so ~75x headroom for legitimate play. Drops are audited (counter + throttled warn with client IP, message type, running total) — never silent.

Tests

Red→green for both: distinct-ID amplification (N reads → 1), unbounded growth (N entries → 1), cache size cap, and limiter throttle/disabled. gofmt, go vet, and the SEC-1/SEC-2 tests pass under -race.

Deployment

None in this PR.

🤖 Generated with Claude Code

Spritz Metis Sprock and others added 2 commits July 1, 2026 20:42
configRequest cached on the client-controlled Type+":"+ID while the
storage read keys on Type only, so a remote caller holding the public
ServerKey could vary ID every packet to guarantee a cache miss and force
one DB read per message; when a stored Config:<Type> object existed the
same trick grew the sync.Map without bound.

Key the lookup on Type only (matching the storage read), cache the
negative result so distinct-ID floods collapse to a single read, and
replace the unbounded sync.Map with a bounded (256-entry) TTL'd LRU.
Behavior for the four valid Type:ID default pairs and real stored config
is preserved. Reachability and amplification are covered by tests that
drive the real handler with a stubbed storage read (no live Postgres).

Refs BUGS.md SEC-1.

Co-Authored-By: Andrew Bates <a@sprock.io>
Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>
The Consume loop dispatched every parsed EVR message with no throttle,
so one established socket could submit messages as fast as the TCP
window allows. Connection-level rate limits do not throttle messages
inside an already-established socket, so a per-message flood on one
allowed socket force-multiplies SEC-1 into DB exhaustion.

Add a per-session token-bucket limiter, checked before dispatch. Default
100 msg/s sustained with a 200 burst (config: socket.inbound_message_
rate_limit_per_sec / _burst; 0 disables). The server treats ~1.3 msg/s
as "regularly active" (PingBackoffThreshold/PingPeriodMs), so the
default is ~75x a normally active session and generous for connect-time
bursts, while capping a single socket far below the thousands/sec an
unthrottled flood can push. Drops are audited, never silent: a counter
per drop and a throttled warn log naming the client IP, dropped message
type, and running total.

Refs BUGS.md SEC-2.

Co-Authored-By: Andrew Bates <a@sprock.io>
Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>
Copilot AI review requested due to automatic review settings July 3, 2026 11:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the EchoVR (EVR) real-time WebSocket message path against per-session resource exhaustion by (1) fixing ConfigRequest cache keying/unbounded growth and (2) adding a per-session inbound message rate limiter to prevent message-flood amplification on expensive handlers.

Changes:

  • Replace the ConfigRequest cache with a bounded, TTL’d LRU keyed by config Type (matching the storage read key) and cache negative lookups to avoid repeated DB hits (SEC-1).
  • Add a per-session token-bucket inbound rate limiter enforced in the EVR IncomingLoop, plus new socket config knobs and audit logging/metrics (SEC-2).
  • Add regression tests covering both SEC-1 (cache behavior/cap/reachability) and SEC-2 (limiter throttling + disabled path).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
server/session_ws.go Enforces the per-session inbound limiter before dispatching parsed EVR messages.
server/evr_pipeline_config.go Implements the bounded TTL+LRU cache keyed on config Type and refactors stored-config loading.
server/evr_pipeline_config_sec_test.go Adds SEC-1 regression tests for cache amplification/growth/cap and behavior preservation.
server/evr_inbound_ratelimit.go Introduces the per-session token-bucket limiter with drop accounting and log throttling.
server/evr_inbound_ratelimit_test.go Adds SEC-2 unit tests for throttling behavior and the disabled (nil) limiter path.
server/config.go Adds socket config fields and defaults for inbound per-session message rate limiting.

Comment thread server/evr_pipeline_config.go Outdated
Comment thread server/evr_pipeline_config.go Outdated
Comment thread server/evr_pipeline_config.go Outdated
Comment thread server/evr_inbound_ratelimit_test.go Outdated
Comment on lines +38 to +40
if allowed > burst+perSec {
t.Fatalf("SEC-2: limiter allowed %d, well above burst %d (cap ineffective)", allowed, burst)
}
…ministic test)

Reword the three SEC-1 code comments in evr_pipeline_config.go to be
self-contained for public readers: drop the "BUGS.md" pointer (that file is
gitignored and absent from the repo, so the reference resolved to nothing)
while keeping the SEC-1 label and the inline invariant description.

Make TestSEC2_InboundRateLimiter_ThrottlesFlood deterministic. rate.Limiter
.Allow() reads time.Now() per call, so tokens replenish at perSec as the loop
runs; the old upper bound (allowed > burst+perSec) implicitly assumed the loop
finished within ~1s of refill and could fail on a slow or -race machine. Assert
against the token-bucket guarantee computed from the measured loop duration
(burst + perSec*elapsed) instead. Verified stable: 20x runs under -race pass.

Co-Authored-By: Andrew Bates <a@sprock.io>
Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>
Comment thread server/config.go Outdated
// with a 200 burst is ~75x a normally active session — generous for
// connect-time bursts (config/document/login) yet caps a single socket
// far below the thousands/sec an unthrottled flood can push.
InboundMessageRateLimitPerSec: 100,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analyze the production logs to find out what the maximum ingres message rate is that we've seen in the past month. figure out what 3.5σ looks like, and set the value to that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — value is now data-derived, not guessed: InboundMessageRateLimitPerSec 100 → 9.

Source: one month of production nakama debug logs on fortytwo. Every inbound EVR message logs "msg":"Received message" with its sid at debug level (the global logger runs at debug in prod), and the limiter checks allow() once per that same parsed message in IncomingLoop — so the log is a 1:1 record of what the limiter counts, not a proxy.

Extraction (streamed, read-only, per-session per-wall-clock-second message count → distribution of msgs/sec/session):

# 29 log files (17 .zst + 12 .log + live), ts-filtered to the trailing month
awk to bucket key=(sid, ts-truncated-to-second); count++ ; flush per second;
accumulate n, Σx, Σx², max over all (sid,second) buckets with ≥1 message

Result:

metric value
window 2026-06-03 → 2026-07-03 (30d)
sample n = 10,782,318 active session-seconds / 19,895,343 msgs
mean 1.845 msg/s/session
σ 1.790 msg/s/session
max 21 msg/s/session (single peak)
mean + 3.5σ 8.11 → 9

Chosen: 9 (ceil of 8.11 — round up so the cap never sits below the 3.5σ legit boundary). The Burst stays 200, ~10x the observed 21 msg/s peak, so a legitimate connect-time burst is never clipped while 9 msg/s sustained still caps a flood. Distribution is right-skewed (exponential-ish), so 3.5σ lands at 9 even though the tail reaches 21; the token-bucket burst absorbs those sub-sustained spikes. TestSEC2_DefaultInboundRateLimit_IsDataDerived pins the value.

Limitation (honest): counts EVR-format inbound only (the dominant path for game clients); standard rtapi messages use a different log site and are negligible here. Bucketing is by wall-clock second, so a burst split across a second boundary is counted as two buckets — conservative (never inflates the peak).

0a3233494

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should just move rate limiting to the session_ws.go

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 88ecbe614.

Checked current placement first: the SEC-2 enforcement already lived in session_ws.go (IncomingLoop calls s.inboundLimiter.allow() / logDue() / droppedTotal() per inbound message, and NewSessionWS constructs newInboundRateLimiter(...)). The only piece sitting elsewhere was the inboundRateLimiter type + its methods, which lived in a standalone server/evr_inbound_ratelimit.go. Nothing rate-limiting-related lived in evr_pipeline_config.go (that file is the ConfigRequest cache / SEC-1).

Consolidated per your intent:

  • moved the inboundRateLimiter type and newInboundRateLimiter / allow / droppedTotal / logDue into session_ws.go, right next to the loop that enforces them (added the golang.org/x/time/rate import there);
  • deleted server/evr_inbound_ratelimit.go;
  • renamed the test evr_inbound_ratelimit_test.gosession_ws_ratelimit_test.go to match.

No behavior change — go test -race ./server/ -run TestSEC2_InboundRateLimiter still green (throttle + disabled cases).

@thesprockee thesprockee marked this pull request as draft July 3, 2026 12:09
Spritz Metis Sprock and others added 2 commits July 3, 2026 08:56
Andrew (PR #501 review): "probably should just move rate limiting to the
session_ws.go". The SEC-2 per-session token-bucket enforcement already lives
in session_ws.go IncomingLoop; only the inboundRateLimiter type and its methods
sat in a separate evr_inbound_ratelimit.go. Move them next to the loop that
enforces them and rename the test file to match. No behavior change.

Co-Authored-By: Andrew Bates <a@sprock.io>
Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>
Andrew (PR #501 review): "analyze the production logs to find out what the
maximum ingress message rate is that we've seen in the past month. figure out
what 3.5σ looks like, and set the value to that."

Measured per-session inbound EVR message rate from one month of production
debug logs on fortytwo (each inbound message logs "Received message" with its
sid — the same messages the limiter counts):

  window : 2026-06-03 → 2026-07-03 (30 days)
  sample : n = 10,782,318 active session-seconds, 19,895,343 messages
  mean   : 1.845 msg/s/session
  σ      : 1.790 msg/s/session
  max    : 21 msg/s/session (single peak; absorbed by the 200 burst)
  3.5σ   : mean + 3.5σ = 8.11  → rounded up to 9

The previous default of 100 was a guess. 9 msg/s sustained is the 3.5σ point of
real traffic; the 200 burst (~10x the observed 21 msg/s peak) still passes any
legitimate connect-time burst. Test pins the value so it cannot silently drift.

Co-Authored-By: Andrew Bates <a@sprock.io>
Signed-off-by: Spritz Metis Sprock <spritz@sprock.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants