This repository contains Infrastructure-as-Code and examples to provision an AWS environment for LLM/ML workloads on Kubernetes:
- VPC
- EKS cluster (with GPU nodegroups)
- Argo CD for GitOps continuous delivery
- Example Kubernetes YAML to verify GPU scheduling
- Provide repeatable infra to run GPU-accelerated LLM workloads on EKS.
- Demonstrate GitOps deployment with Argo CD.
- Include a minimal GPU test manifest.
- AWS account with sufficient quotas (EC2 GPU instance types, EKS).
- AWS CLI configured with credentials and region.
- kubectl installed.
- eksctl or Terraform (depending on your preferred IaC approach).
- Helm (for Argo CD install).
- Optional: aws-iam-authenticator, jq
- terraform/ or eksctl/ (expected): IaC definitions for VPC and EKS
- argocd/ : manifests or Helm values for Argo CD installation & apps
- k8s/ : sample k8s manifests (GPU test pod, sample apps)
- README.md : this document
(Adjust paths to match this repo's actual structure.)
- Create VPC and EKS
- If using eksctl (example):
- eksctl create cluster -f eks-cluster.yaml
- eksctl create nodegroup --cluster --name gpu-nodes --node-type p3.2xlarge --nodes 1 --nodes-min 0 --nodes-max 2 --node-ami auto
- If using Terraform:
- terraform init
- terraform apply -var="aws_region=..."
- Configure kubectl
- aws eks update-kubeconfig --name --region
- Install NVIDIA device plugin (for GPU scheduling)
- kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.11.0/nvidia-device-plugin.yml
- Install Argo CD
- helm repo add argo https://argoproj.github.io/argo-helm
- helm repo update
- helm install argocd argo/argo-cd -n argocd --create-namespace
- kubectl -n argocd port-forward svc/argocd-server 8080:443 &
- Register your Git repo or apply Argo CD Application manifests pointing to k8s/ manifests.
- Place a test pod in k8s/gpu-test.yaml (example):
- Request a GPU resource: resources: limits: nvidia.com/gpu: 1
- Example commands:
- kubectl apply -f k8s/gpu-test.yaml
- kubectl get pods -o wide
- kubectl logs
- Ensure GPU node shows allocatable GPUs:
- kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable}{"\n"}{end}'
- Verify the GPU test pod is scheduled onto a GPU node and the container detects GPU (nvidia-smi output).
- Delete cluster/nodegroups (eksctl or Terraform destroy).
- helm uninstall argocd -n argocd
- kubectl delete -f k8s/gpu-test.yaml
- Adjust instance types and autoscaling limits to control cost.
- Ensure IAM roles for service accounts or node IAM policies allow GPU drivers, ECR access, and other required actions.
- This README expects existing IaC artifacts in the repo — update paths/commands to match them.
Apache 2.0