Skip to content

Commit 066bf25

Browse files
Added runbook for existing alerts (#6)
This change adds pages to the troubleshooting runbook, which are referenced by NuoDB operator alerts.
1 parent b7592c1 commit 066bf25

File tree

7 files changed

+796
-0
lines changed

7 files changed

+796
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ node_modules
55
public
66
resources/_gen
77
hugo_stats.json
8+
.DS_Store
Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
---
2+
title: "DatabaseComponentUnreadyReplicas"
3+
description: "Database resource has a component with replicas which were declared to be unready"
4+
summary: ""
5+
date: 2025-06-05T13:52:09+03:00
6+
lastmod: 2025-06-05T13:52:09+03:00
7+
draft: false
8+
weight: 100
9+
toc: true
10+
seo:
11+
title: "" # custom title (optional)
12+
description: "Database resource has a component with replicas which were declared to be unready" # custom description (recommended)
13+
canonical: "" # custom canonical URL (optional)
14+
noindex: false # false (default) or true
15+
---
16+
17+
## Meaning
18+
19+
Database component has unready replicas.
20+
21+
{{< details "Full context" open >}}
22+
Database resource has a component with replicas which were declared to be unready.
23+
Database components impacted by this alert are Transaction Engines (TEs) and Storage Managers (SMs).
24+
For example, it is expected for a database to have 2 TE replicas, but it has less than that for a noticeable period of time.
25+
26+
On rare occasions, there may be more replicas than requested and the system did not clean them up.
27+
{{< /details >}}
28+
29+
### Symptom
30+
31+
To manually evaluate the conditions for this alert follow the steps below.
32+
33+
A database, which has a component with unready replicas, will have the `Ready` status condition set to `False`.
34+
List all unready databases.
35+
36+
```sh
37+
JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[?(@.type=="Ready")]}{@.type}={@.status}{"\n"}{end}{end}'
38+
kubectl get database -o jsonpath="$JSONPATH" | grep "Ready=False"
39+
```
40+
41+
Inspect the database component status and compare the `replicas` and `readyReplicas` fields.
42+
43+
```sh
44+
kubectl get database <name> -o jsonpath='{.status.components}' | jq
45+
```
46+
47+
## Impact
48+
49+
Service degradation or unavailability.
50+
51+
NuoDB database is fault-tolerant and remains available even if a certain number of database processes are down.
52+
Depending on the database configuration, however, this might have an impact on the database availability of certain data partitions (storage groups) or client applications using custom load-balancing rules.
53+
54+
## Diagnosis
55+
56+
- Check the database state using `kubectl describe database <name>`.
57+
- Check the database component state and message.
58+
- Check how many replicas are declared for this component.
59+
- List and check the status of all pods associated with the database's Helm release.
60+
- Check if there are issues with provisioning or attaching disks to pods
61+
- Check if the cluster-autoscaler is able to create new nodes.
62+
- Check pod logs and identify issues during database process startup
63+
- Check the NuoDB process state.
64+
Kubernetes readiness probes require that the database processes are in `MONITORED:RUNNING` state.
65+
66+
### Scenarios
67+
68+
{{< details "Scenario 1: Pod in `Pending` status for a long time" >}}
69+
70+
Possible causes for a Pod not being scheduled:
71+
72+
- A container on the Pod requests a resource not available in the cluster
73+
- The Pod has affinity rules that do not match any available worker node
74+
- One of the containers mounts a volume provisioned in an availability zone (AZ) where no Kubernetes worker is available
75+
- A Persistent volume claim (PVC) created for this Pod has a storage class that may be misconfigured or unusable
76+
77+
{{< /details >}}
78+
79+
{{< details "Scenario 2: Pod in `CreateContainerConfigError` status for a long time" >}}
80+
81+
Possible causes for a container not being created:
82+
83+
- The container depends on a resource that does not exist yet (e.g. ConfigMap or Secret)
84+
- NuoDB Control Plane external operator did not populate the database connection details yet
85+
86+
{{< /details >}}
87+
88+
{{< details "Scenario 3: Database process fails to join the domain" >}}
89+
90+
Upon startup, the main _engine_ container process communicates with the NuoDB Admin to register the database process with the domain and start it using the NuoDB binary.
91+
92+
Possible causes for unsuccessful startup during this phase are:
93+
94+
- Network issues prevent communication between the container entrypoint client scripts and NuoDB Admin REST API
95+
- The NuoDB Admin layer is not available or has no Raft leader
96+
- No Raft quorum in the NuoDB Admin prevents committing new Raft commands
97+
- AP with ordinal 0 formed a separate domain. In case of catastrophic loss of the `admin-0` container (i.e. its durable domain state `raftlog` file is lost), it might form a new domain causing a split-brain scenario. For more information, see [Setting _bootstrapServers_ Helm value](https://github.com/nuodb/nuodb-helm-charts/blob/v3.10.0/stable/admin/values.yaml#L106).
98+
99+
{{< /details >}}
100+
101+
{{< details "Scenario 4: Database process fails to join the database" >}}
102+
103+
Once started, a database process communicates with the rest of the database and executes an entry protocol.
104+
105+
Possible causes for unsuccessful startup during this phase are:
106+
107+
- Network issues prevent communication between NuoDB database processes
108+
- No suitable entry node is available
109+
- The database process binary version is too old
110+
111+
{{< /details >}}
112+
113+
{{< details "Scenario 5: An SM in `TRACKED` state for a long time" >}}
114+
115+
The database state might be `AWAITING_ARCHIVE_HISTORIES_MSG` indicating that the database leader assignment is in progress.
116+
NuoDB Admin must collect archive history information from all provisioned archives on database cold start.
117+
This requires all SM processes to start and connect to the NuoDB Admin within the configured timeout period.
118+
119+
Possible causes for unsuccessful leader assignment:
120+
121+
- Not all SMs have been scheduled by Kubernes or not all SM processes have started
122+
- Some of the SM pods are in `CrashLoopBackOff` state with long back-off
123+
- There is a _defunct_ archive metadata provisioned in the domain which is not served by an actual SM
124+
125+
{{< /details >}}
126+
127+
{{< details "Scenario 6: An TE in `TRACKED` state for a long time" >}}
128+
129+
A TE process joins the database via an entry node which is normally the first SM that goes to `RUNNING` state.
130+
NuoDB Admin performs synchronization tasks so that TEs are started after the entry node is available.
131+
132+
Possible causes for missing entry node:
133+
134+
- Database leader assignment is not performed after cold start. See _Symptom 5_
135+
- The `UNPARTITIONED` storage group is not in `RUNNING` state
136+
137+
{{< /details >}}
138+
139+
{{< details "Scenario 7: SM in `CONFIGURED:RECOVERING_JOURNAL` state for a long time" >}}
140+
141+
Upon startup, SM processes perform a journal recovery.
142+
This may be time consuming if there are many journal entries to recover.
143+
The SM process reports the progress of the journal recovery which is displayed in `nuocmd show domain` output.
144+
145+
Possible causes for slow journal recovery:
146+
147+
- High latency of the archive disk caused by reaching the IOPS limit
148+
149+
{{< /details >}}
150+
151+
### Example
152+
153+
Get the database name and its namespace from the alert's labels.
154+
Inspect the database state in the Kubernetes cluster.
155+
156+
```sh
157+
kubectl get database acme-messaging-demo -n nuodb-cp-system
158+
```
159+
160+
Notice that the `READY` status condition is `False` which means that the database is in a degraded state.
161+
162+
```text
163+
NAME TIER VERSION READY SYNCED DISABLED AGE
164+
acme-messaging-demo n0.small 6.0.2 False True False 46h
165+
```
166+
167+
Inspect the database components state.
168+
169+
```sh
170+
kubectl get database acme-messaging-demo -o jsonpath='{.status.components}' | jq
171+
```
172+
173+
The output below indicates issues with scheduling `te-acme-messaging-demo-zfb77wc-5cd8b5f7c4-qnplm` Pod because of insufficient memory on the cluster.
174+
The mismatch between `replicas` and `readyReplicas` for this component triggers this alert.
175+
176+
```json
177+
{
178+
"lastUpdateTime": "2025-06-06T13:08:19Z",
179+
"storageManagers": [
180+
{
181+
"kind": "StatefulSet",
182+
"name": "sm-acme-messaging-demo-zfb77wc",
183+
"readyReplicas": 2,
184+
"replicas": 2,
185+
"state": "Ready",
186+
"version": "v1"
187+
}
188+
],
189+
"transactionEngines": [
190+
{
191+
"kind": "Deployment",
192+
"message": "there is an active rollout for deployment/te-acme-messaging-demo-zfb77wc; pod/te-acme-messaging-demo-zfb77wc-5cd8b5f7c4-qnplm: 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.",
193+
"name": "te-acme-messaging-demo-zfb77wc",
194+
"readyReplicas": 5,
195+
"replicas": 6,
196+
"state": "Updating",
197+
"version": "v1"
198+
}
199+
]
200+
}
201+
```
202+
203+
If needed, drill down to the Pod resources associalted with the database by using the below command.
204+
205+
```sh
206+
RELEASE_NAME=$(kubectl get database acme-messaging-demo -o jsonpath='{.spec.template.releaseName}')
207+
kubectl get pods -l release=$RELEASE_NAME
208+
```
209+
210+
Obtain NuoDB domain state by running [nuocmd show domain](https://doc.nuodb.com/nuodb/latest/reference-information/command-line-tools/nuodb-command/nuocmd-reference/#show-domain) and [nuocmd show database](https://doc.nuodb.com/nuodb/latest/reference-information/command-line-tools/nuodb-command/nuocmd-reference/#show-domain) inside any NuoDB pod that has `Running` status.
211+
212+
```sh
213+
SM_POD=$(kubectl get pod \
214+
-l release=${RELEASE_NAME},component=sm \
215+
--field-selector=status.phase==Running \
216+
-o jsonpath='{.items[0].metadata.name}')
217+
218+
kubectl exec -ti $SM_POD -- nuocmd show domain
219+
```

0 commit comments

Comments
 (0)