Kubernetes Observability Stack — Prometheus + Grafana + Alertmanager
A production-grade observability stack for EKS deployed via Helm and managed through GitOps. Covers the three pillars of observability: metrics (Prometheus), visualization (Grafana), and alerting (Alertmanager). Built for multi-tenant clusters with per-namespace dashboards and team-scoped alert routing.
┌──────────────────────────────────────────────────────────────────────┐
│ EKS Cluster │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ monitoring namespace │ │
│ │ │ │
│ │ ┌─────────────┐ scrapes ┌──────────────────────────────┐ │ │
│ │ │ Prometheus │ ──────────► │ ServiceMonitor / PodMonitor │ │ │
│ │ │ (metrics) │ │ (per-namespace targets) │ │ │
│ │ └──────┬──────┘ └──────────────────────────────┘ │ │
│ │ │ evaluates │ │
│ │ ▼ │ │
│ │ ┌─────────────┐ fires ┌──────────────────────────────┐ │ │
│ │ │ PrometheusRule│ ────────► │ Alertmanager │ │ │
│ │ │ (alert rules) │ │ routes → Slack / PagerDuty │ │ │
│ │ └──────────────┘ └──────────────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────┐ │ │
│ │ │ Grafana │ ◄── queries Prometheus datasource │ │
│ │ │ (dashboards)│ reads ConfigMaps for dashboard JSON │ │
│ │ └─────────────┘ │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ team-alpha │ │ team-beta │ │ team-gamma │ │
│ │ (metrics) │ │ (metrics) │ │ (metrics) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
Slack PagerDuty CloudWatch
(#alerts) (on-call) (audit trail)
kube-prometheus-stack — battle-tested Helm chart bundling Prometheus Operator, Grafana, Alertmanager, node-exporter, kube-state-metrics
ServiceMonitor/PodMonitor — per-team scrape configs using label selectors, no manual Prometheus config edits
PrometheusRules — alert rules for node health, pod crash loops, quota exhaustion, and API server latency
Grafana dashboards as code — dashboards stored as ConfigMaps, provisioned automatically at startup
Multi-tenant alert routing — Alertmanager routes alerts to team-specific Slack channels based on namespace label
Persistent storage — Prometheus data on EBS gp3, Grafana config on EBS gp3
IRSA — Prometheus uses IAM role via IRSA for CloudWatch remote write (no static credentials)
.
├── prometheus/
│ ├── config/ # kube-prometheus-stack Helm values
│ └── rules/ # PrometheusRule manifests (alert rules)
├── grafana/
│ ├── dashboards/ # Dashboard JSON stored as ConfigMaps
│ └── datasources/ # Grafana datasource config
├── alertmanager/ # Alertmanager routing config
└── manifests/ # ServiceMonitor and PodMonitor examples
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm upgrade --install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values prometheus/config/helm-values.yaml \
--wait
# Apply custom alert rules
kubectl apply -f prometheus/rules/
# Apply dashboards
kubectl apply -f grafana/dashboards/
# Apply ServiceMonitors
kubectl apply -f manifests/
kubectl port-forward svc/kube-prometheus-stack-grafana \
-n monitoring 3000:80
# Default credentials (change immediately)
# Username: admin
# Password: kubectl get secret kube-prometheus-stack-grafana \
# -n monitoring -o jsonpath="{.data.admin-password}" | base64 -d
Alerts are routed to teams based on the namespace label:
Namespace
Slack Channel
PagerDuty
team-alpha
#team-alpha-alerts
team-alpha-pd
team-beta
#team-beta-alerts
team-beta-pd
team-gamma
#team-gamma-alerts
team-gamma-pd
platform/*
#platform-alerts
platform-sre-pd
Dashboard
What It Shows
EKS Cluster Overview
Node CPU/memory, pod count, API server latency
Namespace Resource Usage
Per-team CPU/memory vs quota, pod saturation
Pod Health
Restart counts, OOMKill events, pending pods
Kubernetes API Server
Request rates, error rates, etcd latency
Node Exporter
Disk I/O, network throughput, filesystem usage