Skip to content

vila-brunette/k8s-observability-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kubernetes Observability Stack — Prometheus + Grafana + Alertmanager

A production-grade observability stack for EKS deployed via Helm and managed through GitOps. Covers the three pillars of observability: metrics (Prometheus), visualization (Grafana), and alerting (Alertmanager). Built for multi-tenant clusters with per-namespace dashboards and team-scoped alert routing.

Architecture Overview

┌──────────────────────────────────────────────────────────────────────┐
│                          EKS Cluster                                 │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │                    monitoring namespace                        │  │
│  │                                                                │  │
│  │  ┌─────────────┐   scrapes   ┌──────────────────────────────┐ │  │
│  │  │  Prometheus │ ──────────► │  ServiceMonitor / PodMonitor │ │  │
│  │  │  (metrics)  │             │  (per-namespace targets)     │ │  │
│  │  └──────┬──────┘             └──────────────────────────────┘ │  │
│  │         │ evaluates                                            │  │
│  │         ▼                                                      │  │
│  │  ┌─────────────┐  fires     ┌──────────────────────────────┐  │  │
│  │  │ PrometheusRule│ ────────► │     Alertmanager             │  │  │
│  │  │ (alert rules) │          │  routes → Slack / PagerDuty  │  │  │
│  │  └──────────────┘           └──────────────────────────────┘  │  │
│  │                                                                │  │
│  │  ┌─────────────┐                                              │  │
│  │  │   Grafana   │ ◄── queries Prometheus datasource           │  │
│  │  │ (dashboards)│     reads ConfigMaps for dashboard JSON      │  │
│  │  └─────────────┘                                              │  │
│  └────────────────────────────────────────────────────────────────┘  │
│                                                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│  │  team-alpha  │  │  team-beta   │  │  team-gamma  │              │
│  │  (metrics)   │  │  (metrics)   │  │  (metrics)   │              │
│  └──────────────┘  └──────────────┘  └──────────────┘              │
└──────────────────────────────────────────────────────────────────────┘
                              │
               ┌──────────────┼──────────────┐
               ▼              ▼              ▼
            Slack         PagerDuty      CloudWatch
         (#alerts)        (on-call)      (audit trail)

Features

  • kube-prometheus-stack — battle-tested Helm chart bundling Prometheus Operator, Grafana, Alertmanager, node-exporter, kube-state-metrics
  • ServiceMonitor/PodMonitor — per-team scrape configs using label selectors, no manual Prometheus config edits
  • PrometheusRules — alert rules for node health, pod crash loops, quota exhaustion, and API server latency
  • Grafana dashboards as code — dashboards stored as ConfigMaps, provisioned automatically at startup
  • Multi-tenant alert routing — Alertmanager routes alerts to team-specific Slack channels based on namespace label
  • Persistent storage — Prometheus data on EBS gp3, Grafana config on EBS gp3
  • IRSA — Prometheus uses IAM role via IRSA for CloudWatch remote write (no static credentials)

Repository Structure

.
├── prometheus/
│   ├── config/           # kube-prometheus-stack Helm values
│   └── rules/            # PrometheusRule manifests (alert rules)
├── grafana/
│   ├── dashboards/       # Dashboard JSON stored as ConfigMaps
│   └── datasources/      # Grafana datasource config
├── alertmanager/         # Alertmanager routing config
└── manifests/            # ServiceMonitor and PodMonitor examples

Quick Start

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

kubectl create namespace monitoring

helm upgrade --install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values prometheus/config/helm-values.yaml \
  --wait

# Apply custom alert rules
kubectl apply -f prometheus/rules/

# Apply dashboards
kubectl apply -f grafana/dashboards/

# Apply ServiceMonitors
kubectl apply -f manifests/

Access Grafana

kubectl port-forward svc/kube-prometheus-stack-grafana \
  -n monitoring 3000:80

# Default credentials (change immediately)
# Username: admin
# Password: kubectl get secret kube-prometheus-stack-grafana \
#   -n monitoring -o jsonpath="{.data.admin-password}" | base64 -d

Alert Routing

Alerts are routed to teams based on the namespace label:

Namespace Slack Channel PagerDuty
team-alpha #team-alpha-alerts team-alpha-pd
team-beta #team-beta-alerts team-beta-pd
team-gamma #team-gamma-alerts team-gamma-pd
platform/* #platform-alerts platform-sre-pd

Key Dashboards

Dashboard What It Shows
EKS Cluster Overview Node CPU/memory, pod count, API server latency
Namespace Resource Usage Per-team CPU/memory vs quota, pod saturation
Pod Health Restart counts, OOMKill events, pending pods
Kubernetes API Server Request rates, error rates, etcd latency
Node Exporter Disk I/O, network throughput, filesystem usage

Related Repositories

Repo Purpose
aws-eks-platform Terraform — VPC, EKS, IAM
gitops-eks-platform GitOps — ArgoCD workloads
k8s-security-platform Security — Gatekeeper + Falco
k8s-multi-tenancy Multi-tenancy — RBAC, Quotas
k8s-observability-stack (this repo) Observability — Prometheus + Grafana

About

Kubernetes observability — Prometheus, Grafana, Alertmanager with multi-tenant alert routing and GitOps

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors