Skip to content

paravatha/k8s-llmops-iac

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

k8s-llmops-iac

This repository contains Infrastructure-as-Code and examples to provision an AWS environment for LLM/ML workloads on Kubernetes:

  • VPC
  • EKS cluster (with GPU nodegroups)
  • Argo CD for GitOps continuous delivery
  • Example Kubernetes YAML to verify GPU scheduling

Goals

  • Provide repeatable infra to run GPU-accelerated LLM workloads on EKS.
  • Demonstrate GitOps deployment with Argo CD.
  • Include a minimal GPU test manifest.

Prerequisites

  • AWS account with sufficient quotas (EC2 GPU instance types, EKS).
  • AWS CLI configured with credentials and region.
  • kubectl installed.
  • eksctl or Terraform (depending on your preferred IaC approach).
  • Helm (for Argo CD install).
  • Optional: aws-iam-authenticator, jq

Repo layout

  • terraform/ or eksctl/ (expected): IaC definitions for VPC and EKS
  • argocd/ : manifests or Helm values for Argo CD installation & apps
  • k8s/ : sample k8s manifests (GPU test pod, sample apps)
  • README.md : this document

(Adjust paths to match this repo's actual structure.)

Quickstart (high level)

  1. Create VPC and EKS
  • If using eksctl (example):
    • eksctl create cluster -f eks-cluster.yaml
    • eksctl create nodegroup --cluster --name gpu-nodes --node-type p3.2xlarge --nodes 1 --nodes-min 0 --nodes-max 2 --node-ami auto
  • If using Terraform:
    • terraform init
    • terraform apply -var="aws_region=..."
  1. Configure kubectl
  • aws eks update-kubeconfig --name --region
  1. Install NVIDIA device plugin (for GPU scheduling)
  1. Install Argo CD
  • helm repo add argo https://argoproj.github.io/argo-helm
  • helm repo update
  • helm install argocd argo/argo-cd -n argocd --create-namespace
  • kubectl -n argocd port-forward svc/argocd-server 8080:443 &
  1. Register your Git repo or apply Argo CD Application manifests pointing to k8s/ manifests.

GPU test manifest (minimal)

  • Place a test pod in k8s/gpu-test.yaml (example):
    • Request a GPU resource: resources: limits: nvidia.com/gpu: 1
  • Example commands:
    • kubectl apply -f k8s/gpu-test.yaml
    • kubectl get pods -o wide
    • kubectl logs

Validation

  • Ensure GPU node shows allocatable GPUs:
    • kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable}{"\n"}{end}'
  • Verify the GPU test pod is scheduled onto a GPU node and the container detects GPU (nvidia-smi output).

Cleanup

  • Delete cluster/nodegroups (eksctl or Terraform destroy).
  • helm uninstall argocd -n argocd
  • kubectl delete -f k8s/gpu-test.yaml

Notes

  • Adjust instance types and autoscaling limits to control cost.
  • Ensure IAM roles for service accounts or node IAM policies allow GPU drivers, ECR access, and other required actions.
  • This README expects existing IaC artifacts in the repo — update paths/commands to match them.

License

Apache 2.0

About

Setup LLMOps on AWS EKS using ArgoCD, Nvidia GPU Operator and AMD GPU Operator

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published