WPB-19318: Ensure high-availability of the postgress cluster #807

sghosh23 · 2025-08-28T07:50:30Z

Please checkout the system design doc for more info: https://wearezeta.atlassian.net/wiki/spaces/CUSSOPS/pages/2088108112/PostgreSQL+High+Availability+System+Design

Change type

Fix
Feature
Documentation
Security / Upgrade

Basic information

THIS CHANGE REQUIRES A DEPLOYMENT PACKAGE RELEASE
THIS CHANGE REQUIRES A WIRE-DOCS RELEASE

Testing

I ran/applied the changes myself, in a test environment.
The CI job attached to this repo will test it for me.

Tracking

I added a new entry in an appropriate subdirectory of changelog.d
I mentioned this PR in Jira, OR I mentioned the Jira ticket in this PR.
I mentioned this PR in one of the issues attached to one of our repositories.

Knowledge Transfer

An Asciinema session is attached to the Jira ticket.

Motivation

Objective

Reason

Use case

offline/postgresql-cluster.md

nix/pkgs/wire-binaries.nix

ansible/templates/postgresql/simple_fence.sh.j2

ansible/templates/postgresql/postgresql_primary.conf.j2

ansible/templates/postgresql/detect_rouge_primary.sh.j2

ansible/postgresql-deploy.yml

…sive docs - Consolidate PostgreSQL configuration into single unified template - Fix split-brain detection script (correct 'rouge' to 'rogue' typo) - Add detailed HA features documentation with failover validation - Include monitoring & event system documentation - Add node_id and priority configuration parameters - Add official repmgr and PostgreSQL documentation references - Improve deployment commands and monitoring checks - Enhance split-brain protection with advanced features

- Remove duplicate HA features list from Key Concepts section - Remove duplicate monitoring system section from Configuration Options - Fix incorrect numbering in monitoring commands (5 → 8) - Consolidate monitoring information into single comprehensive section

- PostgreSQL cluster runs independently, not integrated with endpoint-manager - Explain postgres-endpoint-manager as separate component that monitors cluster externally - Emphasize independent operation of cluster vs endpoint management

mohitrajain · 2025-09-25T09:34:17Z

dumping status of services and logs

sudo systemctl status postgresql@17-main repmgrd@17-main detect-rogue-primary.timer -l --no-pager
● [email protected] - PostgreSQL Cluster 17-main
     Loaded: loaded (/lib/systemd/system/[email protected]; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Wed 2025-09-24 12:45:01 UTC; 20h ago
   Main PID: 18744 (postgres)
      Tasks: 7 (limit: 4532)
     Memory: 107.8M
        CPU: 9min 54.399s
     CGroup: /system.slice/system-postgresql.slice/[email protected]
             ├─18744 /usr/lib/postgresql/17/bin/postgres -D /var/lib/postgresql/17/main -c unix_socket_directories=/var/run/postgresql -c config_file=/etc/postgresql/17/main/postgresql.conf
             ├─18745 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18746 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18747 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18748 "postgres: startup recovering 00000001000000000000000B" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18749 "postgres: walreceiver streaming 0/B4E1D80" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             └─18835 "postgres: repmgr repmgr 10.1.1.6(48288) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

Sep 24 12:44:58 cassandra-warm-mackerel systemd[1]: Starting PostgreSQL Cluster 17-main...
Sep 24 12:45:01 cassandra-warm-mackerel systemd[1]: Started PostgreSQL Cluster 17-main.

● [email protected] - Repmgr failover daemon (instance 17-main)
     Loaded: loaded (/etc/systemd/system/[email protected]; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2025-09-24 12:45:14 UTC; 20h ago
   Main PID: 18837 (repmgrd)
      Tasks: 1 (limit: 4532)
     Memory: 2.0M
        CPU: 19min 44.459s
     CGroup: /system.slice/system-repmgrd.slice/[email protected]
             └─18837 /usr/lib/postgresql/17/bin/repmgrd -f /etc/repmgr/17-main/repmgr.conf --daemonize

Sep 25 08:45:11 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 08:50:13 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 08:55:15 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:00:17 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:05:19 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:10:20 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:15:22 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:20:24 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:25:26 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:30:27 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state

● detect-rogue-primary.timer - PostgreSQL Split-Brain Detection Timer
     Loaded: loaded (/etc/systemd/system/detect-rogue-primary.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Wed 2025-09-24 12:45:29 UTC; 20h ago
    Trigger: Thu 2025-09-25 09:33:30 UTC; 12s left
   Triggers: ● detect-rogue-primary.service
       Docs: man:systemd.timer(5)

Sep 24 12:45:29 cassandra-warm-mackerel systemd[1]: Stopping PostgreSQL Split-Brain Detection Timer...
Sep 24 12:45:29 cassandra-warm-mackerel systemd[1]: Started PostgreSQL Split-Brain Detection Timer.

mohitrajain · 2025-09-25T09:41:33Z

sudo -u postgres repmgr -f /etc/repmgr/17-main/repmgr.conf cluster show
 ID | Name        | Role    | Status    | Upstream    | Location | Priority | Timeline | Connection string                                                                
----+-------------+---------+-----------+-------------+----------+----------+----------+-----------------------------------------------------------------------------------
 1  | postgresql1 | primary | * running |             | default  | 150      | 1        | host=10.1.1.8 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 2  | postgresql2 | standby |   running | postgresql1 | default  | 100      | 1        | host=10.1.1.7 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 3  | postgresql3 | standby |   running | postgresql1 | default  | 50       | 1        | host=10.1.1.6 user=repmgr dbname=repmgr password=securepassword connect_timeout=2

mohitrajain · 2025-09-25T10:10:42Z

repmgr brings back postgresql service if it is found stopped

sudo systemctl stop [email protected] 
root@cassandra-leading-eagle:~# sudo systemctl status [email protected] 
○ [email protected] - PostgreSQL Cluster 17-main
     Loaded: loaded (/lib/systemd/system/[email protected]; enabled-runtime; vendor preset: enabled)
     Active: inactive (dead) since Thu 2025-09-25 10:08:27 UTC; 1s ago
    Process: 177096 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 17-main stop (code>
   Main PID: 24886 (code=exited, status=0/SUCCESS)
        CPU: 35min 18.968s

Sep 24 12:44:06 cassandra-leading-eagle systemd[1]: Starting PostgreSQL Cluster 17-main...
Sep 24 12:44:09 cassandra-leading-eagle systemd[1]: Started PostgreSQL Cluster 17-main.
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: Stopping PostgreSQL Cluster 17-main...
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: [email protected]: Deactivated successfully.
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: Stopped PostgreSQL Cluster 17-main.
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: [email protected]: Consumed 35min 18.968s C>
root@cassandra-leading-eagle:~# sudo -u postgres repmgr -f /etc/repmgr/17-main/repmgr.conf cluster show
 ID | Name        | Role    | Status    | Upstream    | Location | Priority | Timeline | Connection string                                                                
----+-------------+---------+-----------+-------------+----------+----------+----------+-----------------------------------------------------------------------------------
 1  | postgresql1 | primary | * running |             | default  | 150      | 1        | host=10.1.1.8 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 2  | postgresql2 | standby |   running | postgresql1 | default  | 100      | 1        | host=10.1.1.7 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 3  | postgresql3 | standby |   running | postgresql1 | default  | 50       | 1        | host=10.1.1.6 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
root@cassandra-leading-eagle:~# sudo systemctl status [email protected] 
● [email protected] - PostgreSQL Cluster 17-main
     Loaded: loaded (/lib/systemd/system/[email protected]; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Thu 2025-09-25 10:08:34 UTC; 19s ago
    Process: 177113 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 17-main start (code=exite>
   Main PID: 177118 (postgres)
      Tasks: 14 (limit: 4532)
     Memory: 39.6M
        CPU: 1.399s
     CGroup: /system.slice/system-postgresql.slice/[email protected]
             ├─177118 /usr/lib/postgresql/17/bin/postgres -D /var/lib/postgresql/17/main -c unix_socket_>
             ├─177119 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177120 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177121 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" >
             ├─177123 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177124 "postgres: autovacuum launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ">
             ├─177125 "postgres: archiver " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" >
             ├─177126 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" ">
             ├─177127 "postgres: walsender repmgr 10.1.1.6(41268) streaming 0/C005EB0" "" "" "" "" "" "">
             ├─177129 "postgres: walsender repmgr 10.1.1.7(47974) streaming 0/C005EB0" "" "" "" "" "" "">
             ├─177160 "postgres: repmgr repmgr 10.1.1.6(34430) idle" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177163 "postgres: repmgr repmgr 10.1.1.7(47666) idle" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177165 "postgres: repmgr repmgr 10.1.1.8(50260) idle" "" "" "" "" "" "" "" "" "" "" "" "">
             └─177224 "postgres: wire-server wire-server 10.1.1.15(7997) authentication" "" "" "" "" "" >

Sep 25 10:08:31 cassandra-leading-eagle systemd[1]: Starting PostgreSQL Cluster 17-main...
Sep 25 10:08:34 cassandra-leading-eagle systemd[1]: Started PostgreSQL Cluster 17-main.

2025-09-25T10:07:19+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:07:32+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:07:44+00:00 INFO HISTORY CHECK: no gaps in last 20 seq
2025-09-25T10:07:44+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:07:57+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:08:10+00:00 INFO HISTORY CHECK: no gaps in last 20 seq
2025-09-25T10:08:10+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:08:23+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:08:32+00:00 WARNING Failed to fetch next seq; reconnecting...
2025-09-25T10:08:33+00:00 ERROR CONNECT/INIT failed; retry in 1s
2025-09-25T10:08:34+00:00 ERROR CONNECT/INIT failed; retry in 2s
2025-09-25T10:08:37+00:00 INFO Connected OK (schema/table ensured) host=postgresql-external-rw port=5432 db=wire-server sslmode=prefer client_id=d2152319-da41-4da0-94d8-0634c2d56683
2025-09-25T10:08:42+00:00 INFO HISTORY CHECK: no gaps in last 20 seq
2025-09-25T10:08:42+00:00 INFO PROBE SUMMARY: 5 successful probes, 1 errors in last 5 seconds
2025-09-25T10:08:54+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:09:08+00:00 INFO HISTORY CHECK: no gaps in last 20 seq

mohitrajain · 2025-09-25T10:56:23Z

@sghosh23 we should leave a note in the postgresql documentation for maintenance of postgresql service, that it will require the repmgr to be stopped, otherwise, postgresql service can change during the maintenance.

mohitrajain · 2025-09-25T11:16:18Z

Can we please documentation on how to activate a postgresql service back which was masked by the detect-rogue-primary.timer? Also, lets mention the expected downtime for an application about 4.5 mins when failover happens.

sghosh23 · 2025-09-25T13:27:44Z

Can we please documentation on how to activate a postgresql service back which was masked by the detect-rogue-primary.timer? Also, lets mention the expected downtime for an application about 4.5 mins when failover happens.

As we already tested this part. I will add in the doc

mohitrajain

Based on my testing (logged on the ticket), it looks good to me.

adding request for more documentation

mohitrajain

Can we please add some more documentation about repmgr behaviour during OS upgrades? For example, if someone tried to upgrade the firmware which requires a OS reboot or a kernel upgrade requiring a reboot, what should be expected from postgresql service in those situations. Do clients need to ensure each time manually that postgresql service is up during upgrades or it will be automatic, and if not up, what they need to do?

sonarqubecloud · 2025-10-15T07:34:05Z

Quality Gate passed

Issues
14 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

add pg failover automation with repmgr

68db610

sghosh23 mentioned this pull request Aug 28, 2025

WPB-19318: Ensure high-availability of the postgress cluster #801

Closed

12 tasks

sghosh23 added 2 commits August 28, 2025 12:19

Add a drop-IN to guard the priamry auto start

79eb0ec

add monitoring to detect split-brain and organize the plabooks

743a97d

sghosh23 marked this pull request as ready for review August 29, 2025 11:50

sghosh23 requested review from a team and julialongtin as code owners August 29, 2025 11:50

sghosh23 added 7 commits August 29, 2025 14:32

Update postgresql configuration and documentation

ab07fc3

Update the doc

e09ac6a

Merge branch 'master' into wpb-19318-pg-ha

57964fb

fix: typo on repmger.conf and update playbooks

ee0a531

debug: test deployment

9321edd

skip demo and mini build for now

5e57636

fix: set the right dns-resolver

759a7cf

mohitrajain requested changes Sep 16, 2025

View reviewed changes

sghosh23 added 7 commits September 18, 2025 18:06

Merge branch 'master' into wpb-19318-pg-ha

b519d48

docs: Clarify Kubernetes integration architecture

bc4b4c3

- PostgreSQL cluster runs independently, not integrated with endpoint-manager - Explain postgres-endpoint-manager as separate component that monitors cluster externally - Emphasize independent operation of cluster vs endpoint management

Optimize the doc

86a6e60

Optimize the doc to have a cleaner order of texts

10391bf

Update postgres document with full command paths

0d6347c

sghosh23 added 2 commits September 25, 2025 16:05

fix the repmgr reconnect time and adjust doc

e39dc15

update document

d69f358

mohitrajain previously approved these changes Sep 26, 2025

View reviewed changes

sghosh23 added 3 commits September 26, 2025 15:55

add postrgresql-external values file for the CI

06ad1e7

add demo values

7885fe5

Merge branch 'master' into wpb-19318-pg-ha

3bef7d0

sghosh23 force-pushed the wpb-19318-pg-ha branch from 4a7d57b to 3bef7d0 Compare October 2, 2025 14:36

sghosh23 mentioned this pull request Oct 6, 2025

WPB-19318: Add postgresql-cluster setup guide wireapp/wire-docs#80

Merged

8 tasks

mohitrajain self-requested a review October 7, 2025 09:08

mohitrajain requested changes Oct 7, 2025

View reviewed changes

Update with different cluster recovery scenario

c9fdf7d

mohitrajain approved these changes Oct 9, 2025

View reviewed changes

sghosh23 added 5 commits October 9, 2025 10:26

add instructions regarding rogue-detector and unmasking the pg service

8c5ec6a

store the postgresql secret as k8s secret

26a96d9

optimize the password management section

5d4e12b

sync k8s secrets

46fd118

refactor the sync command

c3226c8

WPB-19318: Ensure high-availability of the postgress cluster #807

Are you sure you want to change the base?

WPB-19318: Ensure high-availability of the postgress cluster #807

Conversation

sghosh23 commented Aug 28, 2025 • edited by mohitrajain Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change type

Basic information

Testing

Tracking

Knowledge Transfer

Motivation

Objective

Reason

Use case

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mohitrajain commented Sep 25, 2025

Uh oh!

mohitrajain commented Sep 25, 2025

Uh oh!

mohitrajain commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mohitrajain commented Sep 25, 2025

Uh oh!

mohitrajain commented Sep 25, 2025

Uh oh!

sghosh23 commented Sep 25, 2025

Uh oh!

mohitrajain left a comment

Choose a reason for hiding this comment

Uh oh!

mohitrajain left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Oct 15, 2025

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sghosh23 commented Aug 28, 2025 •

edited by mohitrajain

Loading

mohitrajain commented Sep 25, 2025 •

edited

Loading