Skip to content

Conversation

sghosh23
Copy link
Contributor

@sghosh23 sghosh23 commented Aug 28, 2025

Please checkout the system design doc for more info: https://wearezeta.atlassian.net/wiki/spaces/CUSSOPS/pages/2088108112/PostgreSQL+High+Availability+System+Design

Change type

  • Fix
  • Feature
  • Documentation
  • Security / Upgrade

Basic information

  • THIS CHANGE REQUIRES A DEPLOYMENT PACKAGE RELEASE
  • THIS CHANGE REQUIRES A WIRE-DOCS RELEASE

Testing

  • I ran/applied the changes myself, in a test environment.
  • The CI job attached to this repo will test it for me.

Tracking

  • I added a new entry in an appropriate subdirectory of changelog.d
  • I mentioned this PR in Jira, OR I mentioned the Jira ticket in this PR.
  • I mentioned this PR in one of the issues attached to one of our repositories.

Knowledge Transfer

  • An Asciinema session is attached to the Jira ticket.

Motivation

Objective

Reason

Use case

@sghosh23 sghosh23 marked this pull request as ready for review August 29, 2025 11:50
@sghosh23 sghosh23 requested review from a team and julialongtin as code owners August 29, 2025 11:50
…sive docs

- Consolidate PostgreSQL configuration into single unified template
- Fix split-brain detection script (correct 'rouge' to 'rogue' typo)
- Add detailed HA features documentation with failover validation
- Include monitoring & event system documentation
- Add node_id and priority configuration parameters
- Add official repmgr and PostgreSQL documentation references
- Improve deployment commands and monitoring checks
- Enhance split-brain protection with advanced features
- Remove duplicate HA features list from Key Concepts section
- Remove duplicate monitoring system section from Configuration Options
- Fix incorrect numbering in monitoring commands (5 → 8)
- Consolidate monitoring information into single comprehensive section
- PostgreSQL cluster runs independently, not integrated with endpoint-manager
- Explain postgres-endpoint-manager as separate component that monitors cluster externally
- Emphasize independent operation of cluster vs endpoint management
@mohitrajain
Copy link
Contributor

dumping status of services and logs

sudo systemctl status postgresql@17-main repmgrd@17-main detect-rogue-primary.timer -l --no-pager
● [email protected] - PostgreSQL Cluster 17-main
     Loaded: loaded (/lib/systemd/system/[email protected]; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Wed 2025-09-24 12:45:01 UTC; 20h ago
   Main PID: 18744 (postgres)
      Tasks: 7 (limit: 4532)
     Memory: 107.8M
        CPU: 9min 54.399s
     CGroup: /system.slice/system-postgresql.slice/[email protected]
             ├─18744 /usr/lib/postgresql/17/bin/postgres -D /var/lib/postgresql/17/main -c unix_socket_directories=/var/run/postgresql -c config_file=/etc/postgresql/17/main/postgresql.conf
             ├─18745 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18746 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18747 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18748 "postgres: startup recovering 00000001000000000000000B" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─18749 "postgres: walreceiver streaming 0/B4E1D80" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             └─18835 "postgres: repmgr repmgr 10.1.1.6(48288) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

Sep 24 12:44:58 cassandra-warm-mackerel systemd[1]: Starting PostgreSQL Cluster 17-main...
Sep 24 12:45:01 cassandra-warm-mackerel systemd[1]: Started PostgreSQL Cluster 17-main.

● [email protected] - Repmgr failover daemon (instance 17-main)
     Loaded: loaded (/etc/systemd/system/[email protected]; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2025-09-24 12:45:14 UTC; 20h ago
   Main PID: 18837 (repmgrd)
      Tasks: 1 (limit: 4532)
     Memory: 2.0M
        CPU: 19min 44.459s
     CGroup: /system.slice/system-repmgrd.slice/[email protected]
             └─18837 /usr/lib/postgresql/17/bin/repmgrd -f /etc/repmgr/17-main/repmgr.conf --daemonize

Sep 25 08:45:11 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 08:50:13 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 08:55:15 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:00:17 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:05:19 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:10:20 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:15:22 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:20:24 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:25:26 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state
Sep 25 09:30:27 cassandra-warm-mackerel repmgrd[18837]: node "postgresql3" (ID: 3) monitoring upstream node "postgresql1" (ID: 1) in normal state

● detect-rogue-primary.timer - PostgreSQL Split-Brain Detection Timer
     Loaded: loaded (/etc/systemd/system/detect-rogue-primary.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Wed 2025-09-24 12:45:29 UTC; 20h ago
    Trigger: Thu 2025-09-25 09:33:30 UTC; 12s left
   Triggers: ● detect-rogue-primary.service
       Docs: man:systemd.timer(5)

Sep 24 12:45:29 cassandra-warm-mackerel systemd[1]: Stopping PostgreSQL Split-Brain Detection Timer...
Sep 24 12:45:29 cassandra-warm-mackerel systemd[1]: Started PostgreSQL Split-Brain Detection Timer.

@mohitrajain
Copy link
Contributor

sudo -u postgres repmgr -f /etc/repmgr/17-main/repmgr.conf cluster show
 ID | Name        | Role    | Status    | Upstream    | Location | Priority | Timeline | Connection string                                                                
----+-------------+---------+-----------+-------------+----------+----------+----------+-----------------------------------------------------------------------------------
 1  | postgresql1 | primary | * running |             | default  | 150      | 1        | host=10.1.1.8 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 2  | postgresql2 | standby |   running | postgresql1 | default  | 100      | 1        | host=10.1.1.7 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 3  | postgresql3 | standby |   running | postgresql1 | default  | 50       | 1        | host=10.1.1.6 user=repmgr dbname=repmgr password=securepassword connect_timeout=2

@mohitrajain
Copy link
Contributor

mohitrajain commented Sep 25, 2025

repmgr brings back postgresql service if it is found stopped

sudo systemctl stop [email protected] 
root@cassandra-leading-eagle:~# sudo systemctl status [email protected][email protected] - PostgreSQL Cluster 17-main
     Loaded: loaded (/lib/systemd/system/[email protected]; enabled-runtime; vendor preset: enabled)
     Active: inactive (dead) since Thu 2025-09-25 10:08:27 UTC; 1s ago
    Process: 177096 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 17-main stop (code>
   Main PID: 24886 (code=exited, status=0/SUCCESS)
        CPU: 35min 18.968s

Sep 24 12:44:06 cassandra-leading-eagle systemd[1]: Starting PostgreSQL Cluster 17-main...
Sep 24 12:44:09 cassandra-leading-eagle systemd[1]: Started PostgreSQL Cluster 17-main.
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: Stopping PostgreSQL Cluster 17-main...
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: [email protected]: Deactivated successfully.
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: Stopped PostgreSQL Cluster 17-main.
Sep 25 10:08:27 cassandra-leading-eagle systemd[1]: [email protected]: Consumed 35min 18.968s C>
root@cassandra-leading-eagle:~# sudo -u postgres repmgr -f /etc/repmgr/17-main/repmgr.conf cluster show
 ID | Name        | Role    | Status    | Upstream    | Location | Priority | Timeline | Connection string                                                                
----+-------------+---------+-----------+-------------+----------+----------+----------+-----------------------------------------------------------------------------------
 1  | postgresql1 | primary | * running |             | default  | 150      | 1        | host=10.1.1.8 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 2  | postgresql2 | standby |   running | postgresql1 | default  | 100      | 1        | host=10.1.1.7 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
 3  | postgresql3 | standby |   running | postgresql1 | default  | 50       | 1        | host=10.1.1.6 user=repmgr dbname=repmgr password=securepassword connect_timeout=2
root@cassandra-leading-eagle:~# sudo systemctl status [email protected][email protected] - PostgreSQL Cluster 17-main
     Loaded: loaded (/lib/systemd/system/[email protected]; enabled-runtime; vendor preset: enabled)
     Active: active (running) since Thu 2025-09-25 10:08:34 UTC; 19s ago
    Process: 177113 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 17-main start (code=exite>
   Main PID: 177118 (postgres)
      Tasks: 14 (limit: 4532)
     Memory: 39.6M
        CPU: 1.399s
     CGroup: /system.slice/system-postgresql.slice/[email protected]
             ├─177118 /usr/lib/postgresql/17/bin/postgres -D /var/lib/postgresql/17/main -c unix_socket_>
             ├─177119 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177120 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177121 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" >
             ├─177123 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177124 "postgres: autovacuum launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ">
             ├─177125 "postgres: archiver " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" >
             ├─177126 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" ">
             ├─177127 "postgres: walsender repmgr 10.1.1.6(41268) streaming 0/C005EB0" "" "" "" "" "" "">
             ├─177129 "postgres: walsender repmgr 10.1.1.7(47974) streaming 0/C005EB0" "" "" "" "" "" "">
             ├─177160 "postgres: repmgr repmgr 10.1.1.6(34430) idle" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177163 "postgres: repmgr repmgr 10.1.1.7(47666) idle" "" "" "" "" "" "" "" "" "" "" "" "">
             ├─177165 "postgres: repmgr repmgr 10.1.1.8(50260) idle" "" "" "" "" "" "" "" "" "" "" "" "">
             └─177224 "postgres: wire-server wire-server 10.1.1.15(7997) authentication" "" "" "" "" "" >

Sep 25 10:08:31 cassandra-leading-eagle systemd[1]: Starting PostgreSQL Cluster 17-main...
Sep 25 10:08:34 cassandra-leading-eagle systemd[1]: Started PostgreSQL Cluster 17-main.
2025-09-25T10:07:19+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:07:32+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:07:44+00:00 INFO HISTORY CHECK: no gaps in last 20 seq
2025-09-25T10:07:44+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:07:57+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:08:10+00:00 INFO HISTORY CHECK: no gaps in last 20 seq
2025-09-25T10:08:10+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:08:23+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:08:32+00:00 WARNING Failed to fetch next seq; reconnecting...
2025-09-25T10:08:33+00:00 ERROR CONNECT/INIT failed; retry in 1s
2025-09-25T10:08:34+00:00 ERROR CONNECT/INIT failed; retry in 2s
2025-09-25T10:08:37+00:00 INFO Connected OK (schema/table ensured) host=postgresql-external-rw port=5432 db=wire-server sslmode=prefer client_id=d2152319-da41-4da0-94d8-0634c2d56683
2025-09-25T10:08:42+00:00 INFO HISTORY CHECK: no gaps in last 20 seq
2025-09-25T10:08:42+00:00 INFO PROBE SUMMARY: 5 successful probes, 1 errors in last 5 seconds
2025-09-25T10:08:54+00:00 INFO PROBE SUMMARY: 5 successful probes, 0 errors in last 5 seconds
2025-09-25T10:09:08+00:00 INFO HISTORY CHECK: no gaps in last 20 seq

@mohitrajain
Copy link
Contributor

@sghosh23 we should leave a note in the postgresql documentation for maintenance of postgresql service, that it will require the repmgr to be stopped, otherwise, postgresql service can change during the maintenance.

@mohitrajain
Copy link
Contributor

Can we please documentation on how to activate a postgresql service back which was masked by the detect-rogue-primary.timer? Also, lets mention the expected downtime for an application about 4.5 mins when failover happens.

@sghosh23
Copy link
Contributor Author

Can we please documentation on how to activate a postgresql service back which was masked by the detect-rogue-primary.timer? Also, lets mention the expected downtime for an application about 4.5 mins when failover happens.

As we already tested this part. I will add in the doc

mohitrajain
mohitrajain previously approved these changes Sep 26, 2025
Copy link
Contributor

@mohitrajain mohitrajain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my testing (logged on the ticket), it looks good to me.

@mohitrajain mohitrajain self-requested a review October 7, 2025 09:08
@mohitrajain mohitrajain dismissed their stale review October 7, 2025 09:09

adding request for more documentation

Copy link
Contributor

@mohitrajain mohitrajain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add some more documentation about repmgr behaviour during OS upgrades? For example, if someone tried to upgrade the firmware which requires a OS reboot or a kernel upgrade requiring a reboot, what should be expected from postgresql service in those situations. Do clients need to ensure each time manually that postgresql service is up during upgrades or it will be automatic, and if not up, what they need to do?

Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants