Skip to content

Commit 6692045

Browse files
committed
Add more info re: WAL failover probe file, logging
Fixes DOC-14573
1 parent c400a46 commit 6692045

File tree

2 files changed

+8
-0
lines changed

2 files changed

+8
-0
lines changed

src/current/_includes/v25.3/wal-failover-intro.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,6 @@ When WAL failover is enabled, CockroachDB will take the the following actions:
66

77
- At node startup, each store is assigned another store to be its failover destination.
88
- CockroachDB will begin monitoring the latency of all WAL writes. If latency to the WAL exceeds the value of the [cluster setting `storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node will attempt to write WAL entries to a secondary store's volume.
9+
- While writing the WAL to the secondary store's volume, the node continuously checks whether the primary store's volume has recovered yet. To determine when it is safe to switch back, CockroachDB creates a small *probe file* on the primary store and periodically `fsync`s it. This file is an internal health‑check artifact created only when WAL failover is enabled; it contains no user data.
10+
- If CockroachDB cannot write the probe file to the primary store during the interval defined by [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables), it emits a log line like the following: `disk stall detected: sync on file probe-file has been ongoing for 40.0s`
911
- CockroachDB will update the [store status endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#store-status-endpoint) at `/_status/stores` so you can monitor the store's status.

src/current/_includes/v25.3/wal-failover-metrics.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,9 @@ You can access these metrics via the following methods:
1010

1111
- The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}).
1212
- By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}).
13+
14+
In addition to metrics, logs help identify disk stalls during WAL failover. The following message indicates a disk stall on the primary store's volume:
15+
16+
~~~
17+
disk stall detected: sync on file probe-file has been ongoing for 40.0s
18+
~~~

0 commit comments

Comments
 (0)