Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file removed .DS_Store
Binary file not shown.
Binary file removed dockerfiles/.DS_Store
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
278 changes: 217 additions & 61 deletions docs/monitoring/alertmanager.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,85 +5,241 @@ sidebar_id: "alertmanager"
sidebar_position: 3
---

# Alertmanager: Managing Alerts in Kubernetes
# ⎈ A Hands-On Guide: Setting Up Prometheus and AlertManager in Kubernetes with Custom Alerts 🛠️

Alertmanager is a component of the Prometheus ecosystem that handles alerts generated by Prometheus. It routes, deduplicates, and manages alerts, ensuring that critical issues are brought to the attention of the right people. This guide provides an overview of Alertmanager, its benefits, and how to set it up in a Kubernetes cluster.
#### *⇢ Understanding Prometheus & AlertManager Setup in Kubernetes with Custom Rules: A Comprehensive Guide*

---

<div style={{ backgroundColor: '#f9f9f9', borderLeft: '4px solid #0078d4', padding: '1rem', margin: '1rem 0', borderRadius: '5px' }}>
<h2 style={{ marginTop: 0 }}>🚧 Work in Progress</h2>
<p>This page is currently under construction. Please check back later for detailed information about Alertmanager setup and usage in Kubernetes.</p>
</div>

---
![img](./img/alertmanager.png.webp)

## Table of Contents
- [Introduction](#introduction)
- [Why Use Alertmanager?](#why-use-alertmanager)
- [Architecture](#architecture)
- [Installation](#installation)
- [Configuration](#configuration)
- [Best Practices](#best-practices)
Monitoring your Kubernetes cluster is crucial for maintaining the health and performance of your applications. In this guide, we’ll walk through setting up Prometheus and Alertmanager using Helm and configuring custom alert rules to monitor your cluster effectively.
If you haven’t already, I recommend checking out my previous blog post on Kubernetes monitoring using Prometheus and Grafana for a comprehensive overview of setting up Prometheus and Grafana.

---
### Prerequisites

## Introduction
Alertmanager is a critical component for managing alerts in Kubernetes. It works with Prometheus to ensure that alerts are routed to the appropriate channels, such as email, Slack, or PagerDuty, and provides mechanisms for silencing and grouping alerts.
Before we start, ensure you have the following:

---
- A running Kubernetes cluster.
- Helm installed on your local machine.

## Why Use Alertmanager?
- **Centralized Alert Management**: Consolidates alerts from multiple Prometheus instances.
- **Routing and Notification**: Sends alerts to the right people or systems based on defined rules.
- **Deduplication**: Prevents duplicate alerts from overwhelming notification channels.
- **Silencing**: Temporarily suppresses alerts during maintenance or known issues.
![img](./img/custom-alerts.png.gif)

---
### Step 1: Install Prometheus and Alertmanager using Helm

## Architecture
Alertmanager works as follows:
1. **Prometheus**: Generates alerts based on defined rules.
2. **Alertmanager**: Receives alerts from Prometheus and processes them.
3. **Notification Channels**: Sends alerts to configured channels like email, Slack, or PagerDuty.
We’ll use the kube-prometheus-stack Helm chart from the Prometheus community. This chart includes Prometheus, Alertmanager, and Grafana, along with several pre-configured dashboards and alerting rules.

---

## Installation
> **Note:** Detailed installation steps will be added soon.
First, create a custom-values.yaml file to specify our custom configurations:

```yaml
# custom-values.yaml
prometheus:
service:
type: NodePort
grafana:
service:
type: NodePort
alertmanager:
service:
type: NodePort
```
Next, install the kube-prometheus-stack using Helm:

```yaml
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack -f custom-values.yaml
```
This command will deploy Prometheus, Alertmanager, and Grafana to your cluster with the services exposed as NodePort.

![img](./img/alert-manager-architecture.png.webp)

### Step 2: Verifying the Setup
To verify that Prometheus and Alertmanager are running correctly, you can access their web UIs. Since we exposed their services as NodePort, you can use kubectl port-forward to access them locally or you can use external IP of the cluster and nodeport of the respective service.

For Prometheus:

![img](./img/prometheus-ui.png.webp)


For Alertmanager:

![img](./img/alertmanager-ui.png.webp)

For Grafana:

![img](./img/grafana-dashboard.png.webp)

Access the default Alertmanager rules:
To access the alertmanager rules/alerts, navigate to Alerts section on prometheus UI:


![img](./img/alerts-in-prometheus-ui.png.webp)


Here we can see that three alerts are in Firing state, so these alerts we can see in AlertManager UI to manage:

![img](./img/alerts-fired.png.webp)

### Step 3: Configuring Custom Alert Rules
From the above steps we can see that the default alerts are configured in prometheus and alertmanager. Now, let’s add custom alert rules to monitor our Kubernetes cluster. We’ll create a PrometheusRule manifest to define these alerts.

Create a file named custom-alert-rules.yaml with the following content:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
app: kube-prometheus-stack
app.kubernetes.io/instance: kube-prometheus-stack
release: kube-prometheus-stack
name: kube-pod-not-ready
spec:
groups:
- name: my-pod-demo-rules
rules:
- alert: KubernetesPodNotHealthy
expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0
for: 1m
labels:
severity: critical
annotations:
summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesDaemonsetRolloutStuck
expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0
for: 10m
labels:
severity: warning
annotations:
summary: Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }})
description: "Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled or not ready\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ContainerHighCpuUtilization
expr: (sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: Container High CPU utilization (instance {{ $labels.instance }})
description: "Container CPU utilization is above 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ContainerHighMemoryUsage
expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: Container High Memory usage (instance {{ $labels.instance }})
description: "Container Memory usage is above 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesContainerOomKiller
expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
for: 0m
labels:
severity: warning
annotations:
summary: Kubernetes Container oom killer (instance {{ $labels.instance }})
description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesPodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
```
Apply the manifest to your Kubernetes cluster:

```yaml
kubectl apply -f custom-alert-rules.yaml
```
Once the PromethuesRule is created then check the newly created alerts on Prometheus UI.

![img](./img/promethues-rule.png.webp)

That’s it we have successfully added our new custom alerts on alertmanager.

### Step 4: Test the custom rules:
To ensure our custom alert rules are working correctly, we’ll simulate a failure by creating a pod with an incorrect image tag. This will help us verify if the alerts are triggered and properly reported in Alertmanager. KubernetesPodNotHealthy alert is responsible to report this alert.


1. **Create a Pod with an Invalid Image**

This will simulate a failure by using an incorrect image tag:

```yaml
kubectl run nginx-pod --image=nginx:lates3
```
Note: The correct tag is latest, so lates3 is intentionally incorrect to cause the pod to fail.

2. **Verify the Pod Status**

Check the status of the pod to confirm that it is failing:

```yaml
kubectl get pods nginx-pod
NAME READY STATUS RESTARTS AGE
nginx-pod 0/1 ImagePullBackOff 0 5m35s
```

You should see the pod in a ErrImagePull state. You can also describe the pod for more details

---

## Configuration
Alertmanager configuration involves defining alert routing, grouping, and notification channels. Example configuration:
```yaml
kubectl describe pod nginx-pod
```
This will provide information about why the pod is failing.

3. **Check for Alerts in Alertmanager**

Since you have set up custom alert rules, these should trigger an alert when the pod fails. Look for alerts related to pod failures. The custom alerts you configured should appear in the Alertmanager interface.

![img](./img/alert-triggered-on-prometheus.png.webp)

![img](./img/alert-triggered-on-alertmanager.png.webp)

This process ensures that your custom alerting rules are working correctly and that you are notified when a pod fails.

### Step 5: Understanding Custom Alert Rules
To better understand how to create and customize alert rules, let’s break down one of the alert rules defined in our custom-alert-rules.yaml. We'll use the KubernetesPodNotHealthy alert as an example:

```yaml
- alert: KubernetesPodNotHealthy
expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0
for: 1m
labels:
severity: critical
annotations:
summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
```
## Alert Fields Breakdown

**alert:** The name of the alert (KubernetesPodNotHealthy).
**expr:** The Prometheus expression to evaluate. This alert triggers if any pod in a Pending, Unknown, or Failed state is detected.
**for:** The duration for which the condition should be true before the alert fires (1m or 1 minute).
**labels:** Additional labels to categorize the alert. In this case, we label it with a severity of critical.
**annotations:** Descriptive information about the alert. These fields can provide context when the alert is triggered:
**— — summary:** A brief description of the alert (Kubernetes Pod not healthy (instance {{ $labels.instance }})).
**— — description:** A detailed description that includes dynamic values from the alert labels (Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}).

These fields help to provide clarity and context when an alert is triggered, making it easier to diagnose and respond to issues in your cluster.


For more examples of custom Prometheus alert rules, you can refer to this Awesome Prometheus Alerts repository.
### Step 5: Cleanup
If you want to remove Prometheus, Alertmanager, and Grafana from your Kubernetes cluster, you can do so with the following commands:

1. **Uninstall the Helm Chart:**
```yaml
helm uninstall kube-prometheus-stack
```
2. **Verify Resources Are Deleted:**
Check that the Prometheus, AlertManager, and Grafana resources have been removed:
```yaml
global:
resolve_timeout: 5m

route:
receiver: 'email-alerts'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h

receivers:
- name: 'email-alerts'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.example.com:587'
auth_username: 'user'
auth_password: 'password'
kubectl get all -l release=kube-prometheus-stack
```
## Conclusion

## Best Practices
- Use grouping to consolidate similar alerts into a single notification.
- Define silences for planned maintenance windows to avoid unnecessary alerts.
- Integrate Alertmanager with multiple notification channels for redundancy.
- Monitor Alertmanager itself to ensure it is functioning correctly.
In this guide, we have successfully set up Prometheus and Alertmanager in a Kubernetes cluster using Helm and configured custom alert rules to monitor the cluster’s health. We also explored the components of an alert rule to better understand how they work. This setup provides a robust monitoring solution that can be further extended and customized to suit your needs. For more examples of custom Prometheus alert rules, you can refer to this Awesome Prometheus Alerts repository.

---

Stay tuned for updates as we continue to enhance this guide!
Loading
Loading