Skip to content

Commit 1739c6c

Browse files
Add kubestack troubleshooting (#2495)
Fixes elastic/opentelemetry#394 Fixes elastic/opentelemetry-dev#978 --------- Co-authored-by: Aleksandra Spilkowska <[email protected]>
1 parent eedc28d commit 1739c6c

File tree

2 files changed

+121
-0
lines changed

2 files changed

+121
-0
lines changed
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
---
2+
navigation_title: Insufficient resources with Kube-Stack chart
3+
description: Learn what to do when the Kube-Stack chart is deployed with insufficient resources.
4+
applies_to:
5+
stack:
6+
serverless:
7+
observability:
8+
product:
9+
edot_collector: ga
10+
products:
11+
- id: cloud-serverless
12+
- id: observability
13+
- id: edot-collector
14+
---
15+
16+
# Insufficient resources issue with Kube-Stack Helm Chart
17+
18+
The OpenTelemetry Kube-Stack Helm Chart deploys multiple EDOT collectors with varying configurations based on the selected architecture and deployment mode. On larger clusters, the default Kubernetes resource limits might be insufficient.
19+
20+
## Symptoms
21+
22+
These symptoms are common when the Kube-Stack chart is deployed with insufficient resources:
23+
24+
- Collector Pods in a `CrashLoopBackOff`/`OOMKilled` state.
25+
- Cluster or Daemon pods are unable to export data to the Gateway collector due being `OOMKilled` (high memory usage).
26+
- Pods have logs similar to: `error internal/queue_sender.go:128 Exporting failed. Dropping data.`
27+
28+
## Resolution
29+
30+
Follow these steps to resolve the issue.
31+
32+
:::::{stepper}
33+
34+
::::{step} Check for OOMKilled Pods
35+
Run the following command to check the Pods:
36+
37+
```bash
38+
kubectl get pods -n opentelemetry-operator-system
39+
```
40+
41+
Look for any Pods in the `OOMKilled` state:
42+
43+
```
44+
NAME READY STATUS RESTARTS AGE
45+
opentelemetry-kube-stack-cluster-stats-collector-7cd88c77drvj76 1/1 Running 0 49s
46+
opentelemetry-kube-stack-daemon-collector-pn4qj 1/1 Running 0 47s
47+
opentelemetry-kube-stack-gateway-collector-85795c7965-wxqls 0/1 OOMKilled 3 (34s ago) 49s
48+
opentelemetry-kube-stack-gateway-collector-8cfdb59df-lgpbr 0/1 OOMKilled 3 (30s ago) 49s
49+
opentelemetry-kube-stack-gateway-collector-8cfdb59df-s7plz 0/1 CrashLoopBackOff 2 (17s ago) 34s
50+
opentelemetry-kube-stack-opentelemetry-operator-77d46bc4dbv2h6k 2/2 Running 0 3m14s
51+
```
52+
::::
53+
54+
::::{step} Verify the Pod last status
55+
56+
Run the following command to verify the last status of the Pod:
57+
58+
```bash
59+
kubectl describe pod -n opentelemetry-operator-system opentelemetry-kube-stack-gateway-collector-85795c7965-wxqls
60+
61+
State: Waiting
62+
Reason: CrashLoopBackOff
63+
Last State: Terminated
64+
Reason: OOMKilled
65+
Exit Code: 137
66+
```
67+
::::
68+
69+
::::{step} Increase resource limits
70+
71+
Edit the `values.yaml` file used to deploy the corresponding Helm release. For the Gateway collector, ensure horitzontal Pod autoscaling is turned on. The Gateway collector configuration should be similar to this:
72+
73+
```yaml
74+
gateway:
75+
fullnameOverride: "opentelemetry-kube-stack-gateway"
76+
suffix: gateway
77+
replicas: 2
78+
autoscaler:
79+
minReplicas: 2 # Start with at least 2 replicas for better availability.
80+
maxReplicas: 5 # Allow more scale-out if needed.
81+
targetCPUUtilization: 70 # Scale when CPU usage exceeds 70%.
82+
targetMemoryUtilization: 75 # Scale when memory usage exceeds 75%.
83+
```
84+
85+
If the autoscaler configuration is already available, or another Collector type is running out of memory, increase the resource limits in the corresponding Collector configuration section:
86+
87+
```yaml
88+
gateway:
89+
fullnameOverride: "opentelemetry-kube-stack-gateway"
90+
...
91+
resources:
92+
limits:
93+
cpu: 500m
94+
memory: 20Mi
95+
requests:
96+
cpu: 100m
97+
memory: 10Mi
98+
```
99+
100+
Make sure to update the resource limits within the correct Collector type section. Available types are: `gateway`, `daemon`, `cluster`, and `opentelemetry-operator`.
101+
::::
102+
103+
::::{step} Update the Helm release
104+
105+
Run the following command to update the Helm release:
106+
107+
```bash
108+
$ helm upgrade opentelemetry-kube-stack open-telemetry/opentelemetry-kube-stack --namespace opentelemetry-operator-system --values values.yaml --version '0.6.3'
109+
```
110+
111+
:::{note}
112+
The hard memory limit should be around 2GB.
113+
:::
114+
::::
115+
:::::
116+
117+
## Resources
118+
119+
* [Elastic Kube-stack Helm chart](https://github.com/open-telemetry/opentelemetry-helm-charts/tree/main/charts/opentelemetry-kube-stack)
120+
* [Elastic stack Kubernetes Helm charts](https://github.com/elastic/helm-charts)

troubleshoot/ingest/opentelemetry/toc.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ toc:
55
- file: edot-collector/index.md
66
children:
77
- file: edot-collector/collector-oomkilled.md
8+
- file: edot-collector/insufficient-resources-kubestack.md
89
- file: edot-collector/metadata.md
910
- file: edot-collector/enable-debug-logging.md
1011
- file: edot-collector/collector-not-starting.md

0 commit comments

Comments
 (0)