fix: increase recovery window for replication connection#1745
fix: increase recovery window for replication connection#1745filipecabaco merged 2 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRepairs replication recovery to use a timestamped window: state now records replication_recovery_started_at (cleared on success) and enforces a 2‑hour max recovery window. Backoff configuration keys renamed to backoff_min/backoff_max/backoff_type and a schedule_replication_retry helper schedules retry attempts using backoff. On replication down, replication_conn is cleared and recovery starts; recovery attempts perform DB checks and either restart replication (resetting backoff) or schedule retries. ReplicationConnection now returns :replication_slot_in_use for an in-use slot. A GitHub Actions E2E workflow was added. Realtime tests were updated and expanded (including changed node disconnect logic and duplicated replication-recovery test blocks). Sequence DiagramsequenceDiagram
participant Client as Replication Peer
participant Connect as Realtime.Tenants.Connect
participant DB as Postgres / DB Check
participant Backoff as Backoff Manager
participant Scheduler as Retry Scheduler
Client->>Connect: replication connection drops
Connect->>Connect: log recovery opened\nset replication_recovery_started_at
Note over Connect: recovery window = 2 hours
Connect->>Connect: is window elapsed?
alt window expired
Connect->>Client: terminate/shutdown
else window active
Connect->>DB: perform DB check (walsender/slot)
alt DB indicates ready
Connect->>Client: attempt to start replication
alt start succeeds
Connect->>Backoff: reset backoff
Connect->>Connect: clear replication_recovery_started_at
else start fails
Backoff->>Backoff: compute next delay (backoff_min/backoff_max/backoff_type)
Backoff->>Scheduler: schedule_replication_retry(delay)
Scheduler->>Connect: enqueue next retry
end
else DB shows old walsender or error
Backoff->>Backoff: compute next delay
Backoff->>Scheduler: schedule_replication_retry(delay)
Scheduler->>Connect: enqueue next retry
end
end
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
8dfe917 to
09de5f0
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/realtime/tenants/connect.ex`:
- Around line 399-405: The code resets replication_recovery_started_at on every
replication crash which prolongs the recovery window; update the state
assignment that currently sets replication_recovery_started_at to
System.monotonic_time(:millisecond) so it only initializes that timestamp when a
recovery is first entered (i.e., when state.replication_recovery_started_at is
nil or not set). Keep the existing updates to replication_connection_pid,
replication_connection_reference and backoff, and ensure elsewhere that
replication_recovery_started_at is cleared when recovery completes so the logic
can start a fresh window on the next recovery.
- Around line 416-430: The recovery probe is currently matching any
pg_stat_activity row with application_name='realtime_replication_connection' and
can be blocked by other tenants; update
handle_info(:recover_replication_connection) to scope the Postgrex.query (the
`@replication_connection_query` used in the Postgrex.query call) to this
connection's replication slot (e.g., pass state.replication_slot as a parameter
or add a WHERE condition for slot_name) and change the subsequent pattern match
branches so the {:query, {:ok, %{num_rows: _}}} branch only fires when rows
returned belong to that specific slot; ensure the Postgrex.query call and its
result handling (in handle_info) reference the replication_slot from state so
only the matching slot blocks recovery.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 26269bce-f6da-4105-bed7-5cca16dd1a25
📒 Files selected for processing (6)
.github/workflows/e2e_tests.ymllib/realtime/tenants/connect.exlib/realtime/tenants/replication_connection.extest/realtime/monitoring/latency_test.exstest/realtime/tenants/connect_test.exstest/realtime/tenants/replication_connection_test.exs
b049202 to
5bbfd82
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
lib/realtime/tenants/connect.ex (1)
412-445:⚠️ Potential issue | 🔴 CriticalScope the recovery probe to the actual replication slot.
Line 430 still treats any
pg_stat_activityrow withapplication_name='realtime_replication_connection'as “old walsender still exiting”. One unrelated replication session on the same Postgres instance can keep this tenant in recovery until the 2-hour window expires and then shut the whole connect process down. This needs a slot-scoped check before backing off.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/realtime/tenants/connect.ex` around lines 412 - 445, The recovery probe currently uses the module constant `@replication_connection_query` which only checks application_name and can be falsely triggered by unrelated walsenders; change the probe in handle_info(:recover_replication_connection, state) to scope the check to this tenant's replication slot (use state.replication_slot) by querying pg_replication_slots (e.g., SELECT 1 FROM pg_replication_slots WHERE slot_name = $1 AND active_pid IS NOT NULL) with Postgrex.query(db_conn_pid, ..., [state.replication_slot]) instead of `@replication_connection_query`, and keep the existing branches that call start_replication_connection, schedule_replication_retry, log_error/log_warning and update backoff/replication_recovery_started_at as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/realtime/tenants/connect_test.exs`:
- Around line 758-779: The test block in test/realtime/tenants/connect_test.exs
is misformatted (failing mix format) — re-run mix format and commit the
resulting changes to align the new test with surrounding describe blocks and
project style; specifically open the test named "recovery resets
replication_recovery_started_at on successful reconnection" (which calls
Connect.lookup_or_start_connection, Connect.ready?,
ReplicationConnection.whereis, assert_replication_status, and Connect.shutdown)
and fix indentation/spacing to match the file's existing formatting conventions,
then run `mix format` to ensure the file is properly formatted before
committing.
---
Duplicate comments:
In `@lib/realtime/tenants/connect.ex`:
- Around line 412-445: The recovery probe currently uses the module constant
`@replication_connection_query` which only checks application_name and can be
falsely triggered by unrelated walsenders; change the probe in
handle_info(:recover_replication_connection, state) to scope the check to this
tenant's replication slot (use state.replication_slot) by querying
pg_replication_slots (e.g., SELECT 1 FROM pg_replication_slots WHERE slot_name =
$1 AND active_pid IS NOT NULL) with Postgrex.query(db_conn_pid, ...,
[state.replication_slot]) instead of `@replication_connection_query`, and keep the
existing branches that call start_replication_connection,
schedule_replication_retry, log_error/log_warning and update
backoff/replication_recovery_started_at as before.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 2dfbd559-4616-4777-9e0e-448c9b233bf5
📒 Files selected for processing (6)
.github/workflows/e2e_tests.ymllib/realtime/tenants/connect.exlib/realtime/tenants/replication_connection.extest/realtime/monitoring/latency_test.exstest/realtime/tenants/connect_test.exstest/realtime/tenants/replication_connection_test.exs
🚧 Files skipped from review as they are similar to previous changes (1)
- lib/realtime/tenants/replication_connection.ex
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/e2e_tests.yml:
- Around line 60-64: The workflow currently passes the same inputs.test filter
to both matrix jobs so both partitions run identical suites; update the dispatch
handling so each job intersects inputs.test with its matrix.partition.tests (or
collapse the matrix into a single job for manual runs) before invoking "bun run
check" / test/e2e/realtime-check.ts; specifically, compute allowedTests =
intersection(split(inputs.test), split(matrix.partition.tests)) and only pass
--test=allowedTests (skip the flag when empty) to bun run check, ensuring
matrix.partition.tests is respected per partition.
- Around line 118-126: The health-check loop that probes Realtime with curl
currently breaks on success but does not fail the step if all 30 attempts time
out; modify the loop/step that contains the curl probe so that after the
for-loop completes without a successful curl (the block that echoes "Attempt
$i/30..." and sleeps), you emit a clear failure message and exit with a non-zero
status (e.g., call exit 1) to fail the job immediately; update the same shell
block that prints "Waiting for Realtime to be healthy..." so it fails fast when
the service never becomes healthy.
- Around line 109-112: Replace the hardcoded SECRET_KEY_BASE with a per-run
generated value: add a prep step (e.g., id: gen_secret) that runs something like
SECRET=$(openssl rand -hex 64) and writes it to GITHUB_OUTPUT (echo
"secret=$SECRET" >> $GITHUB_OUTPUT), then use that output in the docker run env
entry (replace SECRET_KEY_BASE=UpNV... with SECRET_KEY_BASE=${{
steps.gen_secret.outputs.secret }}). Update the docker run invocation to
reference the generated secret and ensure the gen_secret step runs before the
docker run step.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 7e26fc4c-726e-42b4-8cb9-662fd9b86734
📒 Files selected for processing (6)
.github/workflows/e2e_tests.ymllib/realtime/tenants/connect.exlib/realtime/tenants/replication_connection.extest/realtime/monitoring/latency_test.exstest/realtime/tenants/connect_test.exstest/realtime/tenants/replication_connection_test.exs
| partition: | ||
| - name: functional | ||
| tests: "connection,broadcast,broadcast-replay,presence,authorization,postgres-changes,broadcast-changes" | ||
| - name: load | ||
| tests: "load-postgres-changes,load-presence,load-broadcast,load-broadcast-from-db,load-broadcast-replay" |
There was a problem hiding this comment.
workflow_dispatch currently runs the requested suites twice.
When inputs.test is set, both matrix jobs ignore matrix.partition.tests and pass the same --test filter to bun run check. Since test/e2e/realtime-check.ts applies that filter directly, dispatching connection, functional, load, etc. will execute the same selection in both jobs. Either collapse the matrix for manual runs or intersect the requested categories with each partition before invoking the runner.
Also applies to: 138-150
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/e2e_tests.yml around lines 60 - 64, The workflow currently
passes the same inputs.test filter to both matrix jobs so both partitions run
identical suites; update the dispatch handling so each job intersects
inputs.test with its matrix.partition.tests (or collapse the matrix into a
single job for manual runs) before invoking "bun run check" /
test/e2e/realtime-check.ts; specifically, compute allowedTests =
intersection(split(inputs.test), split(matrix.partition.tests)) and only pass
--test=allowedTests (skip the flag when empty) to bun run check, ensuring
matrix.partition.tests is respected per partition.
| -e API_JWT_SECRET="$JWT_SECRET" \ | ||
| -e METRICS_JWT_SECRET="$JWT_SECRET" \ | ||
| -e SECRET_KEY_BASE=UpNVntn3cDxHJpq99YMc1T1AQgQpc8kfYTuRgBiYa15BLrx8etQoXz3gZv1/u2oq \ | ||
| -e ERL_AFLAGS="-proto_dist inet_tcp" \ |
There was a problem hiding this comment.
Generate SECRET_KEY_BASE per run instead of committing a fixed value.
This workflow now stores a reusable app secret in the repo. Even for CI-only runs, that weakens the signing boundary and keeps secret scanners noisy. Generate it inside the step and pass the generated value to docker run instead.
Suggested change
JWT_SECRET=$(echo "$STATUS" | jq -r '.jwt_secret')
DB_URL=$(echo "$STATUS" | jq -r '.db_url')
DB_HOST=$(echo "$DB_URL" | sed 's|.*@\(.*\):.*|\1|')
DB_PORT=$(echo "$DB_URL" | sed 's|.*:\([0-9]*\)/.*|\1|')
NETWORK=$(docker network ls --filter name=supabase --format '{{.Name}}' | head -1)
+ SECRET_KEY_BASE=$(openssl rand -hex 32)
docker run -d \
--name supabase_realtime_local \
--network "$NETWORK" \
-p 4000:4000 \
-e PORT=4000 \
-e DB_HOST="$DB_HOST" \
-e DB_PORT="$DB_PORT" \
-e DB_USER=supabase_admin \
-e DB_PASSWORD=postgres \
-e DB_NAME=postgres \
-e DB_ENC_KEY=1234567890123456 \
-e DB_AFTER_CONNECT_QUERY='SET search_path TO _realtime' \
-e API_JWT_SECRET="$JWT_SECRET" \
-e METRICS_JWT_SECRET="$JWT_SECRET" \
- -e SECRET_KEY_BASE=UpNVntn3cDxHJpq99YMc1T1AQgQpc8kfYTuRgBiYa15BLrx8etQoXz3gZv1/u2oq \
+ -e SECRET_KEY_BASE="$SECRET_KEY_BASE" \
-e ERL_AFLAGS="-proto_dist inet_tcp" \
-e DNS_NODES="''" \
-e APP_NAME=realtime \
-e SEED_SELF_HOST=true \
realtime:local🧰 Tools
🪛 Gitleaks (8.30.0)
[high] 111-111: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/e2e_tests.yml around lines 109 - 112, Replace the
hardcoded SECRET_KEY_BASE with a per-run generated value: add a prep step (e.g.,
id: gen_secret) that runs something like SECRET=$(openssl rand -hex 64) and
writes it to GITHUB_OUTPUT (echo "secret=$SECRET" >> $GITHUB_OUTPUT), then use
that output in the docker run env entry (replace SECRET_KEY_BASE=UpNV... with
SECRET_KEY_BASE=${{ steps.gen_secret.outputs.secret }}). Update the docker run
invocation to reference the generated secret and ensure the gen_secret step runs
before the docker run step.
d05e76d to
bf65361
Compare
bf65361 to
aabd45c
Compare
|
🎉 This PR is included in version 2.78.12 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
| ) do | ||
| Logger.warning("Max replication connection attempts reached, terminating connection") | ||
| {:stop, :shutdown, state} | ||
| @max_replication_recovery_ms :timer.hours(2) |
There was a problem hiding this comment.
Why not 24h, if there's exponential backoff?
What kind of change does this PR introduce?
increase recovery window for replication connection