Skip to content

fix: increase recovery window for replication connection#1745

Merged
filipecabaco merged 2 commits intomainfrom
fix/wal-replication-backoff
Mar 9, 2026
Merged

fix: increase recovery window for replication connection#1745
filipecabaco merged 2 commits intomainfrom
fix/wal-replication-backoff

Conversation

@filipecabaco
Copy link
Copy Markdown
Member

What kind of change does this PR introduce?

increase recovery window for replication connection

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Repairs replication recovery to use a timestamped window: state now records replication_recovery_started_at (cleared on success) and enforces a 2‑hour max recovery window. Backoff configuration keys renamed to backoff_min/backoff_max/backoff_type and a schedule_replication_retry helper schedules retry attempts using backoff. On replication down, replication_conn is cleared and recovery starts; recovery attempts perform DB checks and either restart replication (resetting backoff) or schedule retries. ReplicationConnection now returns :replication_slot_in_use for an in-use slot. A GitHub Actions E2E workflow was added. Realtime tests were updated and expanded (including changed node disconnect logic and duplicated replication-recovery test blocks).

Sequence Diagram

sequenceDiagram
    participant Client as Replication Peer
    participant Connect as Realtime.Tenants.Connect
    participant DB as Postgres / DB Check
    participant Backoff as Backoff Manager
    participant Scheduler as Retry Scheduler

    Client->>Connect: replication connection drops
    Connect->>Connect: log recovery opened\nset replication_recovery_started_at
    Note over Connect: recovery window = 2 hours

    Connect->>Connect: is window elapsed?
    alt window expired
        Connect->>Client: terminate/shutdown
    else window active
        Connect->>DB: perform DB check (walsender/slot)
        alt DB indicates ready
            Connect->>Client: attempt to start replication
            alt start succeeds
                Connect->>Backoff: reset backoff
                Connect->>Connect: clear replication_recovery_started_at
            else start fails
                Backoff->>Backoff: compute next delay (backoff_min/backoff_max/backoff_type)
                Backoff->>Scheduler: schedule_replication_retry(delay)
                Scheduler->>Connect: enqueue next retry
            end
        else DB shows old walsender or error
            Backoff->>Backoff: compute next delay
            Backoff->>Scheduler: schedule_replication_retry(delay)
            Scheduler->>Connect: enqueue next retry
        end
    end
Loading

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@filipecabaco filipecabaco force-pushed the fix/wal-replication-backoff branch from 8dfe917 to 09de5f0 Compare March 9, 2026 17:50
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/realtime/tenants/connect.ex`:
- Around line 399-405: The code resets replication_recovery_started_at on every
replication crash which prolongs the recovery window; update the state
assignment that currently sets replication_recovery_started_at to
System.monotonic_time(:millisecond) so it only initializes that timestamp when a
recovery is first entered (i.e., when state.replication_recovery_started_at is
nil or not set). Keep the existing updates to replication_connection_pid,
replication_connection_reference and backoff, and ensure elsewhere that
replication_recovery_started_at is cleared when recovery completes so the logic
can start a fresh window on the next recovery.
- Around line 416-430: The recovery probe is currently matching any
pg_stat_activity row with application_name='realtime_replication_connection' and
can be blocked by other tenants; update
handle_info(:recover_replication_connection) to scope the Postgrex.query (the
`@replication_connection_query` used in the Postgrex.query call) to this
connection's replication slot (e.g., pass state.replication_slot as a parameter
or add a WHERE condition for slot_name) and change the subsequent pattern match
branches so the {:query, {:ok, %{num_rows: _}}} branch only fires when rows
returned belong to that specific slot; ensure the Postgrex.query call and its
result handling (in handle_info) reference the replication_slot from state so
only the matching slot blocks recovery.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 26269bce-f6da-4105-bed7-5cca16dd1a25

📥 Commits

Reviewing files that changed from the base of the PR and between 4431aa3 and 8dfe917.

📒 Files selected for processing (6)
  • .github/workflows/e2e_tests.yml
  • lib/realtime/tenants/connect.ex
  • lib/realtime/tenants/replication_connection.ex
  • test/realtime/monitoring/latency_test.exs
  • test/realtime/tenants/connect_test.exs
  • test/realtime/tenants/replication_connection_test.exs

Comment thread lib/realtime/tenants/connect.ex
Comment thread lib/realtime/tenants/connect.ex
@filipecabaco filipecabaco force-pushed the fix/wal-replication-backoff branch 2 times, most recently from b049202 to 5bbfd82 Compare March 9, 2026 18:06
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
lib/realtime/tenants/connect.ex (1)

412-445: ⚠️ Potential issue | 🔴 Critical

Scope the recovery probe to the actual replication slot.

Line 430 still treats any pg_stat_activity row with application_name='realtime_replication_connection' as “old walsender still exiting”. One unrelated replication session on the same Postgres instance can keep this tenant in recovery until the 2-hour window expires and then shut the whole connect process down. This needs a slot-scoped check before backing off.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/realtime/tenants/connect.ex` around lines 412 - 445, The recovery probe
currently uses the module constant `@replication_connection_query` which only
checks application_name and can be falsely triggered by unrelated walsenders;
change the probe in handle_info(:recover_replication_connection, state) to scope
the check to this tenant's replication slot (use state.replication_slot) by
querying pg_replication_slots (e.g., SELECT 1 FROM pg_replication_slots WHERE
slot_name = $1 AND active_pid IS NOT NULL) with Postgrex.query(db_conn_pid, ...,
[state.replication_slot]) instead of `@replication_connection_query`, and keep the
existing branches that call start_replication_connection,
schedule_replication_retry, log_error/log_warning and update
backoff/replication_recovery_started_at as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/realtime/tenants/connect_test.exs`:
- Around line 758-779: The test block in test/realtime/tenants/connect_test.exs
is misformatted (failing mix format) — re-run mix format and commit the
resulting changes to align the new test with surrounding describe blocks and
project style; specifically open the test named "recovery resets
replication_recovery_started_at on successful reconnection" (which calls
Connect.lookup_or_start_connection, Connect.ready?,
ReplicationConnection.whereis, assert_replication_status, and Connect.shutdown)
and fix indentation/spacing to match the file's existing formatting conventions,
then run `mix format` to ensure the file is properly formatted before
committing.

---

Duplicate comments:
In `@lib/realtime/tenants/connect.ex`:
- Around line 412-445: The recovery probe currently uses the module constant
`@replication_connection_query` which only checks application_name and can be
falsely triggered by unrelated walsenders; change the probe in
handle_info(:recover_replication_connection, state) to scope the check to this
tenant's replication slot (use state.replication_slot) by querying
pg_replication_slots (e.g., SELECT 1 FROM pg_replication_slots WHERE slot_name =
$1 AND active_pid IS NOT NULL) with Postgrex.query(db_conn_pid, ...,
[state.replication_slot]) instead of `@replication_connection_query`, and keep the
existing branches that call start_replication_connection,
schedule_replication_retry, log_error/log_warning and update
backoff/replication_recovery_started_at as before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 2dfbd559-4616-4777-9e0e-448c9b233bf5

📥 Commits

Reviewing files that changed from the base of the PR and between 09de5f0 and b049202.

📒 Files selected for processing (6)
  • .github/workflows/e2e_tests.yml
  • lib/realtime/tenants/connect.ex
  • lib/realtime/tenants/replication_connection.ex
  • test/realtime/monitoring/latency_test.exs
  • test/realtime/tenants/connect_test.exs
  • test/realtime/tenants/replication_connection_test.exs
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/realtime/tenants/replication_connection.ex

Comment thread test/realtime/tenants/connect_test.exs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/e2e_tests.yml:
- Around line 60-64: The workflow currently passes the same inputs.test filter
to both matrix jobs so both partitions run identical suites; update the dispatch
handling so each job intersects inputs.test with its matrix.partition.tests (or
collapse the matrix into a single job for manual runs) before invoking "bun run
check" / test/e2e/realtime-check.ts; specifically, compute allowedTests =
intersection(split(inputs.test), split(matrix.partition.tests)) and only pass
--test=allowedTests (skip the flag when empty) to bun run check, ensuring
matrix.partition.tests is respected per partition.
- Around line 118-126: The health-check loop that probes Realtime with curl
currently breaks on success but does not fail the step if all 30 attempts time
out; modify the loop/step that contains the curl probe so that after the
for-loop completes without a successful curl (the block that echoes "Attempt
$i/30..." and sleeps), you emit a clear failure message and exit with a non-zero
status (e.g., call exit 1) to fail the job immediately; update the same shell
block that prints "Waiting for Realtime to be healthy..." so it fails fast when
the service never becomes healthy.
- Around line 109-112: Replace the hardcoded SECRET_KEY_BASE with a per-run
generated value: add a prep step (e.g., id: gen_secret) that runs something like
SECRET=$(openssl rand -hex 64) and writes it to GITHUB_OUTPUT (echo
"secret=$SECRET" >> $GITHUB_OUTPUT), then use that output in the docker run env
entry (replace SECRET_KEY_BASE=UpNV... with SECRET_KEY_BASE=${{
steps.gen_secret.outputs.secret }}). Update the docker run invocation to
reference the generated secret and ensure the gen_secret step runs before the
docker run step.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 7e26fc4c-726e-42b4-8cb9-662fd9b86734

📥 Commits

Reviewing files that changed from the base of the PR and between b049202 and 5bbfd82.

📒 Files selected for processing (6)
  • .github/workflows/e2e_tests.yml
  • lib/realtime/tenants/connect.ex
  • lib/realtime/tenants/replication_connection.ex
  • test/realtime/monitoring/latency_test.exs
  • test/realtime/tenants/connect_test.exs
  • test/realtime/tenants/replication_connection_test.exs

Comment thread .github/workflows/e2e_tests.yml Outdated
Comment on lines +60 to +64
partition:
- name: functional
tests: "connection,broadcast,broadcast-replay,presence,authorization,postgres-changes,broadcast-changes"
- name: load
tests: "load-postgres-changes,load-presence,load-broadcast,load-broadcast-from-db,load-broadcast-replay"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

workflow_dispatch currently runs the requested suites twice.

When inputs.test is set, both matrix jobs ignore matrix.partition.tests and pass the same --test filter to bun run check. Since test/e2e/realtime-check.ts applies that filter directly, dispatching connection, functional, load, etc. will execute the same selection in both jobs. Either collapse the matrix for manual runs or intersect the requested categories with each partition before invoking the runner.

Also applies to: 138-150

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/e2e_tests.yml around lines 60 - 64, The workflow currently
passes the same inputs.test filter to both matrix jobs so both partitions run
identical suites; update the dispatch handling so each job intersects
inputs.test with its matrix.partition.tests (or collapse the matrix into a
single job for manual runs) before invoking "bun run check" /
test/e2e/realtime-check.ts; specifically, compute allowedTests =
intersection(split(inputs.test), split(matrix.partition.tests)) and only pass
--test=allowedTests (skip the flag when empty) to bun run check, ensuring
matrix.partition.tests is respected per partition.

Comment thread .github/workflows/e2e_tests.yml Outdated
Comment on lines +109 to +112
-e API_JWT_SECRET="$JWT_SECRET" \
-e METRICS_JWT_SECRET="$JWT_SECRET" \
-e SECRET_KEY_BASE=UpNVntn3cDxHJpq99YMc1T1AQgQpc8kfYTuRgBiYa15BLrx8etQoXz3gZv1/u2oq \
-e ERL_AFLAGS="-proto_dist inet_tcp" \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Generate SECRET_KEY_BASE per run instead of committing a fixed value.

This workflow now stores a reusable app secret in the repo. Even for CI-only runs, that weakens the signing boundary and keeps secret scanners noisy. Generate it inside the step and pass the generated value to docker run instead.

Suggested change
           JWT_SECRET=$(echo "$STATUS" | jq -r '.jwt_secret')
           DB_URL=$(echo "$STATUS" | jq -r '.db_url')
           DB_HOST=$(echo "$DB_URL" | sed 's|.*@\(.*\):.*|\1|')
           DB_PORT=$(echo "$DB_URL" | sed 's|.*:\([0-9]*\)/.*|\1|')
           NETWORK=$(docker network ls --filter name=supabase --format '{{.Name}}' | head -1)
+          SECRET_KEY_BASE=$(openssl rand -hex 32)

           docker run -d \
             --name supabase_realtime_local \
             --network "$NETWORK" \
             -p 4000:4000 \
             -e PORT=4000 \
             -e DB_HOST="$DB_HOST" \
             -e DB_PORT="$DB_PORT" \
             -e DB_USER=supabase_admin \
             -e DB_PASSWORD=postgres \
             -e DB_NAME=postgres \
             -e DB_ENC_KEY=1234567890123456 \
             -e DB_AFTER_CONNECT_QUERY='SET search_path TO _realtime' \
             -e API_JWT_SECRET="$JWT_SECRET" \
             -e METRICS_JWT_SECRET="$JWT_SECRET" \
-            -e SECRET_KEY_BASE=UpNVntn3cDxHJpq99YMc1T1AQgQpc8kfYTuRgBiYa15BLrx8etQoXz3gZv1/u2oq \
+            -e SECRET_KEY_BASE="$SECRET_KEY_BASE" \
             -e ERL_AFLAGS="-proto_dist inet_tcp" \
             -e DNS_NODES="''" \
             -e APP_NAME=realtime \
             -e SEED_SELF_HOST=true \
             realtime:local
🧰 Tools
🪛 Gitleaks (8.30.0)

[high] 111-111: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/e2e_tests.yml around lines 109 - 112, Replace the
hardcoded SECRET_KEY_BASE with a per-run generated value: add a prep step (e.g.,
id: gen_secret) that runs something like SECRET=$(openssl rand -hex 64) and
writes it to GITHUB_OUTPUT (echo "secret=$SECRET" >> $GITHUB_OUTPUT), then use
that output in the docker run env entry (replace SECRET_KEY_BASE=UpNV... with
SECRET_KEY_BASE=${{ steps.gen_secret.outputs.secret }}). Update the docker run
invocation to reference the generated secret and ensure the gen_secret step runs
before the docker run step.

Comment thread .github/workflows/e2e_tests.yml Outdated
@filipecabaco filipecabaco force-pushed the fix/wal-replication-backoff branch 2 times, most recently from d05e76d to bf65361 Compare March 9, 2026 19:14
@filipecabaco filipecabaco force-pushed the fix/wal-replication-backoff branch from bf65361 to aabd45c Compare March 9, 2026 19:24
@filipecabaco filipecabaco merged commit 82f5571 into main Mar 9, 2026
9 checks passed
@filipecabaco filipecabaco deleted the fix/wal-replication-backoff branch March 9, 2026 19:47
@kiwicopple
Copy link
Copy Markdown
Member

🎉 This PR is included in version 2.78.12 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

) do
Logger.warning("Max replication connection attempts reached, terminating connection")
{:stop, :shutdown, state}
@max_replication_recovery_ms :timer.hours(2)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not 24h, if there's exponential backoff?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants