Skip to content

Excessive timeline switches on managed cluster; async replication data-loss risk #1

@big-hill

Description

@big-hill

Observed

Our flux-pg-cluster instance is on timeline 12 today. That means 11 auto-failovers since
the cluster was bootstrapped about 3 weeks ago. Roughly one leader change every 3 days,
none triggered manually.

Most recent observed: today (2026-05-01) the primary moved silently between two cluster
members while client apps were running. We noticed the change by probing candidate nodes
manually with pg_is_in_recovery() from outside the cluster. We don't have direct
visibility into how connected app processes adapted. Patroni REST on port 18008 was not
reachable from outside the Flux network, so we could not confirm via /primary.

Why this is a problem

flux-pg-cluster runs Patroni with synchronous_mode: off (async replication) by default.
Each of the 11 failovers is therefore a potential silent data-loss point: committed
transactions on the outgoing leader that hadn't yet replicated lose acknowledgement on the
incoming leader.

For a managed-DB offering positioned as production-ready, 11 silent data-loss windows in 3
weeks is concerning even if no actual loss has been observed.

Ask

  1. Is this failover rate expected? It looks indicative of either network instability between
    Flux DCS nodes, or aggressive default ttl/loop_wait thresholds.
  2. Can synchronous_mode: quorum (or at least synchronous_mode: on with
    synchronous_node_count: 1) be exposed as a config knob in flux-pg-cluster? Patroni
    supports both natively.
  3. Any recommended approach to surface "transactions on previous timeline lost in failover"
    as an alert? Currently we have no visibility into whether silent loss has happened.

(Hard reinstall data loss, restore UI and leader discovery have already been discussed with
@ali via Discord. This issue focuses narrowly on timeline switching frequency and
async-replication risk.)

Environment

flux-pg-cluster 3-node, Custom tier (2 vCPU, 6GB RAM, 80GB SSD per node), Europe region.
Cluster age about 3 weeks. Timeline progression: 1 → 12.

Hardware is well above the resource-exhaustion bar. Each node has plenty of headroom. The
failover frequency is more likely caused by network instability between Flux DCS nodes or
overly aggressive default Patroni timing thresholds, rather than node load.

*report summarised with Claude.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions