Observed
Our flux-pg-cluster instance is on timeline 12 today. That means 11 auto-failovers since
the cluster was bootstrapped about 3 weeks ago. Roughly one leader change every 3 days,
none triggered manually.
Most recent observed: today (2026-05-01) the primary moved silently between two cluster
members while client apps were running. We noticed the change by probing candidate nodes
manually with pg_is_in_recovery() from outside the cluster. We don't have direct
visibility into how connected app processes adapted. Patroni REST on port 18008 was not
reachable from outside the Flux network, so we could not confirm via /primary.
Why this is a problem
flux-pg-cluster runs Patroni with synchronous_mode: off (async replication) by default.
Each of the 11 failovers is therefore a potential silent data-loss point: committed
transactions on the outgoing leader that hadn't yet replicated lose acknowledgement on the
incoming leader.
For a managed-DB offering positioned as production-ready, 11 silent data-loss windows in 3
weeks is concerning even if no actual loss has been observed.
Ask
- Is this failover rate expected? It looks indicative of either network instability between
Flux DCS nodes, or aggressive default ttl/loop_wait thresholds.
- Can
synchronous_mode: quorum (or at least synchronous_mode: on with
synchronous_node_count: 1) be exposed as a config knob in flux-pg-cluster? Patroni
supports both natively.
- Any recommended approach to surface "transactions on previous timeline lost in failover"
as an alert? Currently we have no visibility into whether silent loss has happened.
(Hard reinstall data loss, restore UI and leader discovery have already been discussed with
@ali via Discord. This issue focuses narrowly on timeline switching frequency and
async-replication risk.)
Environment
flux-pg-cluster 3-node, Custom tier (2 vCPU, 6GB RAM, 80GB SSD per node), Europe region.
Cluster age about 3 weeks. Timeline progression: 1 → 12.
Hardware is well above the resource-exhaustion bar. Each node has plenty of headroom. The
failover frequency is more likely caused by network instability between Flux DCS nodes or
overly aggressive default Patroni timing thresholds, rather than node load.
*report summarised with Claude.
Observed
Our flux-pg-cluster instance is on timeline 12 today. That means 11 auto-failovers since
the cluster was bootstrapped about 3 weeks ago. Roughly one leader change every 3 days,
none triggered manually.
Most recent observed: today (2026-05-01) the primary moved silently between two cluster
members while client apps were running. We noticed the change by probing candidate nodes
manually with
pg_is_in_recovery()from outside the cluster. We don't have directvisibility into how connected app processes adapted. Patroni REST on port 18008 was not
reachable from outside the Flux network, so we could not confirm via
/primary.Why this is a problem
flux-pg-cluster runs Patroni with
synchronous_mode: off(async replication) by default.Each of the 11 failovers is therefore a potential silent data-loss point: committed
transactions on the outgoing leader that hadn't yet replicated lose acknowledgement on the
incoming leader.
For a managed-DB offering positioned as production-ready, 11 silent data-loss windows in 3
weeks is concerning even if no actual loss has been observed.
Ask
Flux DCS nodes, or aggressive default
ttl/loop_waitthresholds.synchronous_mode: quorum(or at leastsynchronous_mode: onwithsynchronous_node_count: 1) be exposed as a config knob in flux-pg-cluster? Patronisupports both natively.
as an alert? Currently we have no visibility into whether silent loss has happened.
(Hard reinstall data loss, restore UI and leader discovery have already been discussed with
@ali via Discord. This issue focuses narrowly on timeline switching frequency and
async-replication risk.)
Environment
flux-pg-cluster 3-node, Custom tier (2 vCPU, 6GB RAM, 80GB SSD per node), Europe region.
Cluster age about 3 weeks. Timeline progression: 1 → 12.
Hardware is well above the resource-exhaustion bar. Each node has plenty of headroom. The
failover frequency is more likely caused by network instability between Flux DCS nodes or
overly aggressive default Patroni timing thresholds, rather than node load.
*report summarised with Claude.