k8s-llmops-iac

This repository contains Infrastructure-as-Code and examples to provision an AWS environment for LLM/ML workloads on Kubernetes:

VPC
EKS cluster (with GPU nodegroups)
Argo CD for GitOps continuous delivery
Example Kubernetes YAML to verify GPU scheduling

Goals

Provide repeatable infra to run GPU-accelerated LLM workloads on EKS.
Demonstrate GitOps deployment with Argo CD.
Include a minimal GPU test manifest.

Prerequisites

AWS account with sufficient quotas (EC2 GPU instance types, EKS).
AWS CLI configured with credentials and region.
kubectl installed.
eksctl or Terraform (depending on your preferred IaC approach).
Helm (for Argo CD install).
Optional: aws-iam-authenticator, jq

Repo layout

terraform/ or eksctl/ (expected): IaC definitions for VPC and EKS
argocd/ : manifests or Helm values for Argo CD installation & apps
k8s/ : sample k8s manifests (GPU test pod, sample apps)
README.md : this document

(Adjust paths to match this repo's actual structure.)

Quickstart (high level)

Create VPC and EKS

If using eksctl (example):
- eksctl create cluster -f eks-cluster.yaml
- eksctl create nodegroup --cluster --name gpu-nodes --node-type p3.2xlarge --nodes 1 --nodes-min 0 --nodes-max 2 --node-ami auto
If using Terraform:
- terraform init
- terraform apply -var="aws_region=..."

Configure kubectl

aws eks update-kubeconfig --name --region

Install NVIDIA device plugin (for GPU scheduling)

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.11.0/nvidia-device-plugin.yml

Install Argo CD

helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
helm install argocd argo/argo-cd -n argocd --create-namespace
kubectl -n argocd port-forward svc/argocd-server 8080:443 &

Register your Git repo or apply Argo CD Application manifests pointing to k8s/ manifests.

GPU test manifest (minimal)

Place a test pod in k8s/gpu-test.yaml (example):
- Request a GPU resource: resources: limits: nvidia.com/gpu: 1
Example commands:
- kubectl apply -f k8s/gpu-test.yaml
- kubectl get pods -o wide
- kubectl logs

Validation

Ensure GPU node shows allocatable GPUs:
- kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable}{"\n"}{end}'
Verify the GPU test pod is scheduled onto a GPU node and the container detects GPU (nvidia-smi output).

Cleanup

Delete cluster/nodegroups (eksctl or Terraform destroy).
helm uninstall argocd -n argocd
kubectl delete -f k8s/gpu-test.yaml

Notes

Adjust instance types and autoscaling limits to control cost.
Ensure IAM roles for service accounts or node IAM policies allow GPU drivers, ECR access, and other required actions.
This README expects existing IaC artifacts in the repo — update paths/commands to match them.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
0-gateway		0-gateway
1-vpc		1-vpc
2-eks		2-eks
3-argocd		3-argocd
4-gpu		4-gpu
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

k8s-llmops-iac

Goals

Prerequisites

Repo layout

Quickstart (high level)

GPU test manifest (minimal)

Validation

Cleanup

Notes

License

About

Uh oh!

Releases

Packages

Languages

License

paravatha/k8s-llmops-iac

Folders and files

Latest commit

History

Repository files navigation

k8s-llmops-iac

Goals

Prerequisites

Repo layout

Quickstart (high level)

GPU test manifest (minimal)

Validation

Cleanup

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages