GitOps-managed AI Agent Platform deployed on Kubernetes using ArgoCD with the Components + Overlays pattern.
Current Environment: Kind (Local Development) β Planned: OpenShift (Staging & Production) π§
Kagenti is a production-ready AI agent orchestration platform built on Kubernetes with:
- GitOps Deployment - All infrastructure defined as code, deployed via ArgoCD
- Service Mesh Security - Istio mTLS STRICT mode for all pod-to-pod communication
- SSO Authentication - Keycloak with OAuth2/OIDC for unified access control
- Distributed Tracing - Dual-backend (Tempo for infrastructure, Phoenix for AI/LLM traces)
- Comprehensive Observability - Grafana, Prometheus, Loki, Kiali for full platform visibility
- Tekton CI/CD - Automated agent build pipelines with operator-based workflows
Key Architecture Documents:
- ARCHITECTURE.md - Detailed platform architecture and deployment layers
- CLAUDE.md - Development workflow, GitOps practices, TDD approach
- TODO_SECURITY.md - Comprehensive security roadmap (Kind β OpenShift)
# Complete cluster redeploy (auto-detects repo/branch)
./scripts/quick-redeploy.shWhat it does:
- β Auto-detects repository and branch (works with PRs/forks in GitHub Actions)
- β Destroys existing Kind cluster
- β Creates new Kind cluster with registry
- β Installs ArgoCD
- β Bootstraps applications (uses detected repo/branch automatically)
- β Prompts for operator images (build/load/skip)
- β Prompts for agent images (build from source/load pre-built/skip)
- β Syncs root application and child apps
- β Shows final status and access URLs
Branch Detection (automatic):
- Local development: Uses current Git branch
- GitHub Actions PR: Automatically uses PR branch and fork repository
- Default: Falls back to
mainbranch from upstream repo
Interactive Prompts (local only, auto-skipped in CI):
Operator Images (kagenti-operator, kagenti-platform-operator):
y- Load pre-built images from Docker (30 seconds) β Recommendedn- Rebuild from source (2-5 minutes)skip- Skip loading (operators will show ImagePullBackOff)
Agent Images (research-agent, code-agent, orchestrator-agent):
y- Build from source (2-5 minutes)n- Load pre-built images (30 seconds)skip- Skip agents (agents will show ImagePullBackOff) β Default
Environment Variables (optional):
# Custom agent source location
export AGENT_SOURCE_DIR=/path/to/your/agent-repo
./scripts/quick-redeploy.sh
# Skip all prompts (CI mode)
CI=true ./scripts/quick-redeploy.sh
# Specify operator mode
export OPERATOR_IMAGE_MODE=rebuild # rebuild|load|skip
./scripts/quick-redeploy.shFor more control, run scripts individually:
# 1. Create Kind cluster
./scripts/kind/01-create-cluster.sh
# 2. Install ArgoCD
./scripts/kind/02-install-argocd.sh
# 3. Bootstrap ArgoCD Applications
./scripts/kind/03-bootstrap-apps.sh
# 4. (Optional) Load agent images
./scripts/kind/04-load-agent-images.sh build # or 'load'
# 5. Sync root application (creates all child apps)
argocd app sync kagenti-platform-kind \
--port-forward --port-forward-namespace argocd --grpc-web \
--timeout 600
# 6. Monitor deployment progress
./scripts/monitor-argocd-apps.sh 900 # 15-minute timeout
# 7. Check platform health
./scripts/platform-status.shKagenti follows GitOps + TDD workflow as documented in CLAUDE.md.
ALWAYS test after deployment changes
All platform changes go through:
- β Edit manifests in Git
- β
Validate syntax with
kustomize build - β Commit and push to branch
- β Sync via ArgoCD
- β Test with pytest integration tests
- β Merge when tests pass
# Fast validation (critical apps only, ~30s)
pytest tests/validation/test_app_state.py -v --only-critical
# Full platform health check (includes automated tests)
./scripts/platform-status.sh
# Specific component tests
pytest tests/integration/test_observability.py -v
# Full test suite with HTML report
pytest tests/ -v --html=report.html --self-contained-htmlTest Coverage (see docs/CI_CD_TESTING.md):
- β ArgoCD application state validation
- β Pod health checks across all namespaces
- β Service accessibility via Gateway
- β OAuth authentication flows
- β Observability stack integration (Grafana, Tempo, Phoenix)
- β Istio mTLS STRICT mode verification
- β Certificate readiness checks
Integration with Observability (see CLAUDE.md):
- Real-time test results in Grafana dashboards
- Alert testing via Grafana API
- Trace validation in Tempo and Phoenix
- Log aggregation in Loki
# Build from local source
export AGENT_SOURCE_DIR=/path/to/agent-examples-local
./scripts/kind/04-load-agent-images.sh build
# Or load pre-built images
./scripts/kind/04-load-agent-images.sh loadDefault agent source (can be overridden):
AGENT_SOURCE_DIR=/Users/ladas/Projects/OCTO/research/agent-examples-localPoint to your own agent repo by setting environment variable before deployment:
# Set custom agent source directory
export AGENT_SOURCE_DIR=/path/to/my-agent-repo
# Run quick-redeploy (will use your repo)
./scripts/quick-redeploy.sh
# Select 'y' when prompted for agent imagesRequirements for custom repo:
- Directory structure:
a2a/{agent-name}/Dockerfile - Agents:
research-agent,code-agent,orchestrator-agent - Images will be tagged:
localhost:5000/{agent}:v0.0.15
For production-like builds using Tekton + kagenti-operator:
# 1. Setup agent source repo in Kubernetes
./scripts/dev/setup-agent-source-repo.sh
# 2. Trigger builds via Tekton pipelines
./scripts/dev/trigger-agent-builds.sh
# 3. Monitor build progress
kubectl get pipelineruns -n kagenti-operator -wSee components/03-applications/agents/agent-builds/README.md for details.
π§ Note: Legacy build script (04-load-agent-images.sh) is for quick local dev only. Use operator-based workflow for production.
All services accessible via localtest.me DNS wildcard (points to localhost):
# Get comprehensive access information
./scripts/show-access-info.sh| Service | URL | Default Credentials | Purpose |
|---|---|---|---|
| ArgoCD | https://argocd.localtest.me:9443 | admin / (see /tmp/argocd-pass.txt) |
GitOps deployment |
| Keycloak | https://keycloak.localtest.me:9443 | admin / admin123 (dev) |
SSO identity provider |
| Grafana | https://grafana.localtest.me:9443 | admin / admin123 (dev) |
Metrics dashboards |
| Phoenix | https://phoenix.localtest.me:9443 | Keycloak kagenti realm |
LLM observability |
| Kiali | https://kiali.localtest.me:9443 | Keycloak kubernetes realm |
Service mesh visualization |
| Kagenti UI | https://kagenti.localtest.me:9443 | Keycloak kagenti realm |
Platform management |
| Kubernetes Dashboard | https://k8s-dashboard.localtest.me:9443 | Keycloak kubernetes realm |
Cluster management |
π Security:
- β All traffic encrypted (TLS 1.3 at Gateway, mTLS STRICT between services)
- β SSO authentication via Keycloak
- β OAuth2-Proxy for services without native auth
- β Self-signed certificates (Kind only - production uses Let's Encrypt)
See docs/08-security/encryption.md for encryption architecture.
# Comprehensive platform status (includes pytest tests)
./scripts/platform-status.shChecks:
- β ArgoCD applications (health & sync status)
- β Platform pods (all namespaces)
- β Gateway & certificates
- β Istio mTLS STRICT verification
- β Service accessibility (via Gateway)
- β OAuth authentication flows
- β Automated pytest integration tests (runs real tests)
# Monitor with 15-minute timeout, formatted tables
./scripts/monitor-argocd-apps.sh 900Shows:
- Formatted ArgoCD application status table (with colors)
- Formatted pod status by namespace table
- Progress tracking with elapsed time
- Smart failure logic (only fails on CRITICAL apps degraded)
- Monitors ALL apps (critical + optional observability, Kiali, Ollama)
Built-in dashboards:
- Grafana: Metrics, dashboards, alerts (https://grafana.localtest.me:9443)
- Phoenix: LLM traces, agent observability (https://phoenix.localtest.me:9443)
- Kiali: Service mesh topology (https://kiali.localtest.me:9443)
- Prometheus: Metrics storage (internal only, accessible via Grafana)
See docs/04-observability/ for detailed observability documentation.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GitOps Layer β
β ArgoCD (App-of-Apps) β Components + Overlays β Kubernetes β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Security Layer β
β TLS 1.3 (Gateway) + Istio mTLS STRICT + Keycloak SSO β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬ββββββββββββββββββ
β Infrastructureβ Platform β Observabilityβ Applications β
β (Wave 0-5) β (Wave 10-15) β (Wave 20) β (Wave 25-30) β
ββββββββββββββββΌβββββββββββββββΌβββββββββββββββΌββββββββββββββββββ€
β Gateway API β Operators β Grafana β AI Agents β
β cert-manager β Kagenti UI β Tempo β Kiali β
β Istio β Keycloak β Phoenix β K8s Dashboard β
β Tekton β OAuth2-Proxy β Prometheus β β
β β β Loki β β
ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄ββββββββββββββββββ
See ARCHITECTURE.md for:
- Detailed component diagrams
- Deployment layers and sync waves
- Service mesh architecture
- Secrets management (Kind vs OpenShift)
- Network architecture with mTLS
- Access patterns and authentication flows
components/ # Base Kubernetes manifests (reusable)
βββ 00-infrastructure/ # Core (Gateway, cert-manager, Istio, Keycloak)
βββ 01-platform/ # Platform services (Operators, UI)
βββ 02-observability/ # Monitoring (Grafana, Tempo, Phoenix, Prometheus)
βββ 03-applications/ # Applications (AI Agents)
argocd/applications/
βββ base/ # Base Application templates
βββ kind-local/ # Kind-specific patches (localtest.me, self-signed certs)
βββ openshift/ # OpenShift patches (Routes, SCCs) π§ PLANNED
Benefits:
- β No code duplication across environments
- β Modular components (enable/disable per environment)
- β Single source of truth
- β Easy to add new environments (e.g., OpenShift staging/prod)
Defense-in-Depth (6 Layers) - see TODO_SECURITY.md:
- Perimeter - Gateway API, TLS 1.3, cert-manager
- Identity - Keycloak SSO, OAuth2-Proxy, SPIRE (planned)
- Network - Istio mTLS STRICT, NetworkPolicies (limited), AuthorizationPolicies
- Application - Pod Security Standards (partial), RBAC (basic)
- Data - Secrets encryption at rest (π§ planned), Vault (π§ planned)
- Runtime - Falco (π§ planned), OPA/Kyverno (π§ planned)
Current Status:
- β IMPLEMENTED: TLS 1.3, Istio mTLS STRICT, Keycloak SSO, OAuth2-Proxy, basic RBAC
β οΈ PARTIAL: NetworkPolicies (only observability namespace), Pod Security Standards- π§ PLANNED: CI/CD security scanning, etcd encryption, Vault, Falco, comprehensive NetworkPolicies
See TODO_SECURITY.md for complete security roadmap.
Next Phase: Deploy to OpenShift (Staging & Production)
| Component | Kind (Local) | OpenShift (Production) |
|---|---|---|
| Ingress | Gateway API + MetalLB | OpenShift Routes |
| Certificates | Self-signed (cert-manager) | Let's Encrypt |
| Secrets | Sealed Secrets (Git) | External Secrets Operator + Vault |
| Pod Security | Baseline mode | Restricted mode (SCC) |
| NetworkPolicies | Permissive (dev-friendly) | Strict (default-deny) |
| Observability | In-cluster (Grafana, Tempo) | Hybrid (in-cluster + Grafana Cloud) |
| Encryption at Rest | None (etcd base64 only) | FIPS-validated cryptography |
See TODO_SECURITY.md for ultra-detailed implementation tasks:
Phase 1: Foundation (P0 - Critical) - 3-6 months:
- CI/CD Security Scanning - Multi-layer pipeline (Trivy, Snyk, Checkov, Kubescape, Semgrep)
- Expand NetworkPolicies - Default-deny + allow-specific for all namespaces
- OpenShift SCCs - SecurityContextConstraints for all components
- etcd Encryption - Secrets encrypted at rest
- Sealed Secrets - Encrypted secrets in Git
- Pod Security Standards - Baseline (Kind), Restricted (OpenShift)
- Audit Logging - Kubernetes API audit logs β Loki
Phase 2: Hardening (P1 - High) - 6-12 months:
- External Secrets Operator + Vault - Centralized secret management
- Container Image Scanning - Trivy in Tekton pipelines
- Enhanced RBAC - Least-privilege ServiceAccounts
- Secret Rotation - Automated 90-day rotation
- Istio AuthorizationPolicies - Layer 7 access control
Phase 3: Production Ready (P2 - Medium) - 12-18 months:
- Falco - Runtime threat detection
- OPA/Kyverno - Policy enforcement
- MFA - Keycloak multi-factor authentication
- Rate Limiting - Istio EnvoyFilter
- Compliance - SOC 2, GDPR, HIPAA readiness
Detailed tasks, code examples, validation steps: TODO_SECURITY.md
| Document | Description |
|---|---|
| ARCHITECTURE.md | Detailed platform architecture, deployment layers, diagrams |
| CLAUDE.md | Development workflow, GitOps best practices, TDD approach |
| TODO_SECURITY.md | Comprehensive security roadmap (Kind β OpenShift) |
| docs/README.md | Complete documentation index |
- Architecture Overview - Platform components and design
- Prerequisites - Required tools and setup
- Quick Start - 15-minute deployment walkthrough
- Kubernetes & Kind - Local Kubernetes with Kind
- ArgoCD - GitOps deployment
- Gateway API - HTTPRoute and TLS configuration
- cert-manager - TLS certificate management
- Istio - Service mesh with Gateway API and mTLS
- Traffic Management - Routing and load balancing
- Encryption Architecture - TLS 1.3 + Istio mTLS STRICT
- Network Policies - Network isolation
- Secrets Management - Secret handling
- Security Roadmap - Security maturity model
- Keycloak - SSO and identity management
- OAuth2-Proxy - OAuth2 authentication layer
- Distributed Tracing - Tempo + Phoenix architecture
- Grafana - Metrics dashboards
- Phoenix - LLM observability
- Kiali - Service mesh visualization
- Prometheus - Metrics collection
- Loki - Log aggregation
- GenAI Semantic Conventions - AI agent tracing standards
- Alerting Architecture - Alert configuration and management
- Adding New Alerts - How to add custom alerts
- Alert Runbooks - Troubleshooting guides for all alerts
- GitOps Workflows - ArgoCD and Git workflows
- Tekton - CI/CD pipelines
- CI/CD Testing - Integration test strategy
- Troubleshooting - Common issues and solutions
- Agents - AI agent architecture
- Agent Import - How to import agents via UI
kagenti-demo-deployment/
βββ argocd/ # ArgoCD GitOps definitions
β βββ bootstrap/kind/ # Root Application (App-of-Apps)
β βββ applications/ # Application definitions
β βββ base/ # Base application templates
β βββ helm/ # Helm-based applications
β βββ kind-local/ # Kind-specific patches
β βββ openshift/ # OpenShift patches (π§ PLANNED)
β
βββ components/ # Reusable Kubernetes manifests
β βββ 00-infrastructure/ # Foundation (Wave 0-5)
β β βββ cert-manager.yaml
β β βββ gateway-api-chart/
β β βββ istio/
β β βββ keycloak/
β β βββ oauth2-proxy/
β β βββ spire/
β β βββ tekton/
β βββ 01-platform/ # Platform services (Wave 10-15)
β β βββ gateway/
β β βββ kagenti-ui/
β β βββ kagenti-operator/
β β βββ platform-operator/
β βββ 02-observability/ # Monitoring (Wave 20)
β β βββ grafana/
β β βββ tempo/
β β βββ phoenix/
β β βββ prometheus/
β β βββ loki/
β β βββ kiali/
β βββ 03-applications/ # Applications (Wave 25-30)
β β βββ agents/
β βββ 08-security/ # Security configs (π§ PLANNED)
β βββ network-policies/
β βββ pod-security/
β βββ rbac/
β βββ sealed-secrets/
β
βββ scripts/ # Automation scripts
β βββ quick-redeploy.sh # β One-command deployment
β βββ platform-status.sh # Platform health check
β βββ monitor-argocd-apps.sh # Monitor sync progress
β βββ show-access-info.sh # Service URLs and credentials
β βββ kind/ # Kind cluster setup
β β βββ 01-create-cluster.sh
β β βββ 02-install-argocd.sh
β β βββ 03-bootstrap-apps.sh
β β βββ 04-load-agent-images.sh
β βββ dev/ # Developer tools
β βββ setup-agent-source-repo.sh
β βββ trigger-agent-builds.sh
β
βββ tests/ # Integration & validation tests
β βββ integration/ # Component integration tests
β βββ validation/ # Platform validation tests
β
βββ environments/ # Environment overlays (π§ LEGACY)
β βββ openshift-stage/ # OpenShift staging example
β
βββ docs/ # Documentation
βββ 00-getting-started/
βββ 01-infrastructure/
βββ 02-service-mesh/
βββ 03-authentication/
βββ 04-observability/
βββ 05-ci-cd/
βββ 07-platform/
βββ 08-security/
βββ 09-deployment/
βββ 10-operations/
βββ runbooks/
Minimum:
- CPU: 4 cores
- Memory: 8 GB RAM
- Disk: 20 GB
Recommended:
- CPU: 6 cores
- Memory: 12 GB RAM
- Disk: 40 GB
Total pods: ~35-40 (infrastructure + observability + agents)
Cluster Size (recommended):
- Master nodes: 3x (4 vCPU, 16 GB RAM)
- Worker nodes: 5x (8 vCPU, 32 GB RAM)
- Storage: 500 GB+ (persistent volumes)
See TODO_SECURITY.md for production security requirements.
Issue: Pods stuck in ImagePullBackOff
# Check pod events
kubectl describe pod <pod-name> -n <namespace>
# Verify image in Kind
docker exec kagenti-demo-control-plane crictl images | grep <image>
# Load missing images
./scripts/kind/04-load-agent-images.sh loadIssue: ArgoCD app stuck OutOfSync
# Force sync
argocd app sync <app-name> --force \
--port-forward --port-forward-namespace argocd --grpc-webIssue: Services not accessible via Gateway
# Check Gateway status
kubectl get gateway -A
kubectl describe gateway external-gateway -n default
# Check HTTPRoutes
kubectl get httproute -AFull troubleshooting guide: docs/10-operations/troubleshooting.md
Alert troubleshooting: docs/04-observability/ALERT_FIX_SUMMARY.md
See CLAUDE.md for:
- Development workflow and GitOps best practices
- TDD approach with pytest integration
- How to add new components
- Branch-based development strategy
- Testing and validation procedures
Key Principles:
- β
All changes via Git (no
kubectl apply) - β Test before merge (pytest integration tests)
- β
Sync via ArgoCD (
argocd app sync) - β
Validate with
./scripts/platform-status.sh
- argocd_architecture.md - ArgoCD architecture and sync waves
- docs/08-security/encryption.md - Encryption and mTLS details
- OBSERVABILITY_ARCHITECTURE.md - Dual-backend tracing architecture
- HTTPS_ACCESS_GUIDE.md - Service URLs and access details
- HTTPS_ENFORCEMENT_SUMMARY.md - TLS implementation details
Apache 2.0
Kagenti Platform Team Repository: https://github.com/Ladas/kagenti-demo-deployment
Status Legend:
- β Implemented - Ready to use
β οΈ Partial - Working but incomplete- π§ Planned - Documented in roadmap
- π΄ Critical - High priority