Excessive timeline switches on managed cluster; async replication data-loss risk

### Observed                                                                                
  
  Our flux-pg-cluster instance is on **timeline 12 today**. That means 11 auto-failovers since
   the cluster was bootstrapped about 3 weeks ago. Roughly one leader change every 3 days,
  none triggered manually.                                                                    
                                                                      
  Most recent observed: today (2026-05-01) the primary moved silently between two cluster     
  members while client apps were running. We noticed the change by probing candidate nodes
  manually with `pg_is_in_recovery()` from outside the cluster. We don't have direct          
  visibility into how connected app processes adapted. Patroni REST on port 18008 was not
  reachable from outside the Flux network, so we could not confirm via `/primary`.

  ### Why this is a problem

  flux-pg-cluster runs Patroni with `synchronous_mode: off` (async replication) by default.   
  Each of the 11 failovers is therefore a potential silent data-loss point: committed
  transactions on the outgoing leader that hadn't yet replicated lose acknowledgement on the  
  incoming leader.                                                    

  For a managed-DB offering positioned as production-ready, 11 silent data-loss windows in 3  
  weeks is concerning even if no actual loss has been observed.
                                                                                              
  ### Ask                                                             

  1. Is this failover rate expected? It looks indicative of either network instability between
   Flux DCS nodes, or aggressive default `ttl`/`loop_wait` thresholds.
  2. Can `synchronous_mode: quorum` (or at least `synchronous_mode: on` with                  
  `synchronous_node_count: 1`) be exposed as a config knob in flux-pg-cluster? Patroni        
  supports both natively.
  3. Any recommended approach to surface "transactions on previous timeline lost in failover" 
  as an alert? Currently we have no visibility into whether silent loss has happened.         
  
  (Hard reinstall data loss, restore UI and leader discovery have already been discussed with 
  @ali via Discord. This issue focuses narrowly on timeline switching frequency and
  async-replication risk.)                                                                    
                                                                      
  ### Environment

  flux-pg-cluster 3-node, Custom tier (2 vCPU, 6GB RAM, 80GB SSD per node), Europe region.    
  Cluster age about 3 weeks. Timeline progression: 1 → 12.
                                                                                              
  Hardware is well above the resource-exhaustion bar. Each node has plenty of headroom. The   
  failover frequency is more likely caused by network instability between Flux DCS nodes or
  overly aggressive default Patroni timing thresholds, rather than node load.


*report summarised with Claude.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive timeline switches on managed cluster; async replication data-loss risk #1

Observed

Why this is a problem

Ask

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Excessive timeline switches on managed cluster; async replication data-loss risk #1

Description

Observed

Why this is a problem

Ask

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions