Production-ready distributed LLM inference with intelligent KV-cache-aware routing, GPU acceleration, and 90%+ cache hit rates.
π― Proven Results: 92.5% cache hit rate, 77% TTFT improvement, GPU-accelerated inference
β‘ Key Features:
- Cache-Aware Routing: EPP (External Processing Pod) with intelligent request scheduling
- GPU Acceleration: NVIDIA A100 support with vLLM v0.10.0 optimization
- Istio Integration: Gateway API + service mesh for production traffic management
- Out-of-Box Experience: Automated installation with proper dependency ordering
To run LLM-D with GPU acceleration, you'll need to install and configure the NVIDIA GPU Operator on your OpenShift/Kubernetes cluster. This section provides step-by-step instructions based on a successful deployment.
- OpenShift 4.16+ or Kubernetes cluster with GPU nodes (tested with p4d.24xlarge instances)
- Cluster admin privileges
- GPU nodes should be labeled appropriately (e.g.,
node.kubernetes.io/instance-type=p4d.24xlarge
)
Install the NVIDIA GPU Operator from OperatorHub:
# Create the nvidia-gpu-operator namespace
oc create namespace nvidia-gpu-operator
# Install the operator (via OperatorHub or manually)
oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: gpu-operator
namespace: nvidia-gpu-operator
spec:
channel: stable
name: gpu-operator-certified
source: certified-operators
sourceNamespace: openshift-marketplace
EOF
On RHEL CoreOS (RHCOS), Node Feature Discovery (NFD) may not automatically detect all required labels. Manually add the necessary labels:
# Get your GPU node names
oc get nodes -l node.kubernetes.io/instance-type | grep -E "(g4dn|g5|p3|p4)"
# For each GPU node, add the required NFD labels
for node in $(oc get nodes -o name -l node.kubernetes.io/instance-type | grep -E "(g4dn|g5|p3|p4)"); do
# Add NVIDIA PCI device label
oc label $node feature.node.kubernetes.io/pci-10de.present=true
# Add kernel version label (check your kernel version first)
KERNEL_VERSION=$(oc debug $node -- chroot /host uname -r | tail -1)
oc label $node feature.node.kubernetes.io/kernel-version.full=$KERNEL_VERSION
# Add RHCOS version label (for RHCOS nodes)
OSTREE_VERSION=$(oc debug $node -- chroot /host cat /etc/os-release | grep OSTREE_VERSION | cut -d'=' -f2 | tr -d '"' | tail -1)
oc label $node feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=$OSTREE_VERSION
done
Create a cluster policy optimized for RHCOS:
oc apply -f - <<EOF
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
operator:
defaultRuntime: crio
runtimeClass: nvidia
driver:
enabled: true
rdma:
enabled: false
useOpenKernelModules: false
toolkit:
enabled: true
devicePlugin:
enabled: true
dcgm:
enabled: true
dcgmExporter:
enabled: true
gfd:
enabled: true
nodeStatusExporter:
enabled: true
daemonsets:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
updateStrategy: RollingUpdate
EOF
Monitor the installation progress:
# Check cluster policy status
oc describe clusterpolicy gpu-cluster-policy
# Watch all GPU operator pods
oc get pods -n nvidia-gpu-operator -w
# Verify GPU resources are advertised
oc describe nodes -l node.kubernetes.io/instance-type | grep nvidia.com/gpu
# Expected output should show: nvidia.com/gpu: <number_of_gpus>
Successful deployment should show:
nvidia-driver-daemonset-*
pods:2/2 Running
nvidia-device-plugin-daemonset-*
pods:1/1 Running
gpu-feature-discovery-*
pods:1/1 Running
nvidia-container-toolkit-daemonset-*
pods:1/1 Running
- GPU resources advertised on nodes (e.g.,
nvidia.com/gpu: 8
)
Validate GPU access with a test pod:
oc apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
namespace: default
spec:
containers:
- name: gpu-test
image: nvidia/cuda:12.4-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: Never
EOF
# Check the output
oc logs gpu-test
# Clean up
oc delete pod gpu-test
Driver compilation issues:
- RHCOS doesn't include kernel headers by default
- The OpenShift Driver Toolkit (DTK) automatically handles driver compilation
- Wait for
nvidia-driver-daemonset
pods to show2/2 Running
Missing NFD labels:
- Manually add the required labels as shown in step 2
- Check
oc logs
of the GPU operator for specific missing labels
Image pull issues:
- Ensure cluster has internet access to pull NVIDIA container images
- Check for any corporate proxy/firewall restrictions
For p4d.24xlarge instances:
- GPUs: 8x NVIDIA A100-SXM4-40GB (40GB memory each)
- GPU Memory: 320GB total per node
- CUDA Compute Capability: 8.0
- NVLink: High-speed inter-GPU communication
CRITICAL: Follow this exact order for guaranteed success. Do NOT skip steps.
1. Cluster Requirements:
- OpenShift 4.16+ or Kubernetes 1.28+
- GPU nodes with NVIDIA GPU Operator installed (see section above)
- Cluster admin privileges
2. Required Tools:
# Verify these tools are installed:
kubectl version --client
helm version
oc version # For OpenShift
tkn version # Tekton CLI (for testing)
3. Environment Setup:
# Required: Set your Hugging Face token for model access
export HF_TOKEN="hf_your_actual_token_here"
# Optional: Set custom namespace (default: llm-d)
export NS=llm-d
Why This Version: Istio 1.27.0+ includes Gateway API Inference Extension support, which is required for EPP cache-aware routing. Older versions (including OpenShift Service Mesh) will NOT work.
# Download and install Istio 1.27.0
curl -L https://istio.io/downloadIstio | ISTIO_VERSION=1.27.0 TARGET_ARCH=x86_64 sh -
sudo mv istio-1.27.0/bin/istioctl /usr/local/bin/
# Install Istio with Gateway API support
istioctl install --set values.pilot.env.EXTERNAL_ISTIOD=false -y
# Add anyuid SCC permissions for OpenShift
oc adm policy add-scc-to-user anyuid -z istio-ingressgateway-service-account -n istio-system
oc adm policy add-scc-to-user anyuid -z istiod -n istio-system
Verify Istio Installation:
# Should show istio control plane pods running
kubectl get pods -n istio-system
# Should show "istio" GatewayClass available
kubectl get gatewayclass
# Install standard Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.0.0/standard-install.yaml
# Install Gateway API Inference Extension CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/latest/download/manifests.yaml
Verify Gateway API:
# Should show gateway, httproute, and inference CRDs
kubectl get crd | grep -E "(gateway|inference)"
# Install base infrastructure (Gateway, HTTPRoute)
make infra NS=llm-d GATEWAY_CLASS=istio
# Verify gateway is programmed
kubectl get gateway -n llm-d
# Should show: PROGRAMMED=True
# Install EPP, decode services, and complete stack
make llm-d NS=llm-d
# Wait for all components to be ready (~5-10 minutes for GPU model loading)
make status NS=llm-d
# Run end-to-end cache-aware routing test
make test NS=llm-d
# Expected results:
# β
Cache Hit Rate: 90%+
# β
TTFT improvement: 70%+ for cached requests
# β
Gateway routing: HTTP 200 responses
# Set required environment
export HF_TOKEN="hf_your_actual_token_here"
export NS=llm-d
# Complete installation (10-15 minutes)
make install-all NS=$NS
# Run validation test
make test NS=$NS
Command | Purpose | When to Use |
---|---|---|
make install-all |
Complete installation (all steps) | First-time setup |
make infra |
Infrastructure only | Gateway/routing setup |
make llm-d |
LLM-D components only | After infra is ready |
make test |
Run cache-hit validation | Verify deployment |
make status |
Check component status | Troubleshooting |
make clean |
Remove all components | Fresh restart |
Environment Variables:
NS
: Namespace (default: llm-d)HF_TOKEN
: Hugging Face token (required)GATEWAY_CLASS
: Gateway class (default: istio)
- β OpenShift Service Mesh: Based on older Istio, lacks Gateway API Inference Extension
- β LLM-D Operator: Outdated and not maintained
- β Manual Kubernetes Manifests: Configuration loading issues, use official Helm charts
- β kGateway: This demo is optimized for Istio integration
After successful installation, you should see:
Performance Metrics:
- Cache Hit Rate: 90-95%
- TTFT Improvement: 70-80% for cached requests
- Response Times: ~220ms TTFT for cache hits vs ~970ms for misses
- Throughput: Optimized based on EPP intelligent routing
Infrastructure Status:
kubectl get pods -n llm-d
# Should show all pods Running:
# - llm-d-gaie-epp-*: 1/1 Running (EPP)
# - ms-llm-d-modelservice-decode-*: 3/3 Running (GPU inference)
# - llm-d-infra-inference-gateway-*: 1/1 Running (Istio gateway)
High-level flow
- Client β Istio Gateway β Envoy External Processing (EPP)
- EPP scores endpoints for KV-cache reuse and health β returns routing decision (header hint)
- Gateway forwards to decode Service/pod honoring EPPβs decision
- vLLM pods execute inference with prefix cache enabled (TTFT improves after warm-up)
- Prometheus aggregates metrics; Tekton prints hit-rate and timings
Key components
- EPP (External Processor): cache-aware scoring and decisioning
- Istio Gateway/Envoy: ext_proc integration; EPP uses InferencePool for endpoint discovery and scoring
- vLLM pods: prefix cache enabled, block_size=16, no chunked prefill
- Observability: Prometheus (or Thanos) used by the Tekton Task to aggregate pod metrics
Why it works
- EPP-driven routing concentrates session traffic onto warm pods for maximal KV cache reuse
- Prefix caching reduces TTFT and total latency significantly for repeated prompts
- All policy is centralized in EPP; the data plane remains simple
For a deeper technical outline (design rationale, metrics, demo flow), see the blog posts in blog/ (do not modify them here).
Whatβs deployed (llm-d-monitoring)
- Prometheus: v2.45.0, 7d retention, jobs include:
- kubernetes-pods (llm-d), vllm-instances (port mapping 8000β8000), llm-d-scheduler, gateway-api inference extension (EPP), Envoy/gateway
- Grafana: latest, anonymous viewer enabled, admin user seeded for demo
- Dashboards: LLM Performance Dashboard provisioned from monitoring/grafana-dashboard-llm-performance.json
Key panels (examples)
- TTFT: histogram_quantile over vllm:time_to_first_token_seconds_bucket
- Inter-token latency: vllm:time_per_output_token_seconds_bucket
- Cache hit rates: sum(vllm:gpu_prefix_cache_hits_total) / sum(vllm:gpu_prefix_cache_queries_total)
- Request queue: vllm:num_requests_running vs vllm:num_requests_waiting
- Throughput: rate(vllm:request_success_total[5m])
Files of record
- Prometheus
- monitoring/prometheus.yaml (SA/RBAC/Deployment/Service)
- monitoring/prometheus-config.yaml (scrape configs + alert rules)
- Grafana
- monitoring/grafana.yaml (SA/Deployment)
- monitoring/grafana-config.yaml (grafana.ini)
- monitoring/grafana-datasources.yaml (Prometheus datasource)
- monitoring/grafana-dashboards-config.yaml (provisioning)
- monitoring/grafana-dashboard-llm-performance.json (dashboard)
- monitoring/grafana-service.yaml (Service + OpenShift Route)
- deploy.sh: single command installer and validator for the Istio + EPP demo
- assets/llm-d: decode Service/Deployment, EPP stack, HTTPRoute
- assets/cache-aware/tekton: Tekton cache-hit pipeline definition
- monitoring/: optional monitoring assets (Grafana dashboards, configs)
- llm-d-infra/: upstream infrastructure (optional), not required for this demo path
- Metrics and routes: some names/hosts are environment-specific; update to your cluster
- Secrets/tokens: this repo does not include real secrets. Configure any required tokens (e.g., HF) as Kubernetes Secrets in your cluster
- GPU requirement: for real model inference, deploy onto GPU nodes with NVIDIA GPU Operator installed (see "NVIDIA GPU Operator Setup" section above); otherwise, deploy the stack and test the control-plane paths only
- Blog: see blog/ for architectural deep dives and demo details
- Troubleshooting: monitoring/README.md for monitoring-specific steps
- Advanced architecture details: assets/cache-aware/docs/ARCHITECTURE.md
- Metrics details: assets/cache-aware/docs/METRICS.md