Master AI Infrastructure Engineering through hands-on projects and practical learning
Prerequisites β’ Getting Started β’ Curriculum β’ Projects β’ Resources
This repository contains a complete, production-ready learning path for becoming an AI Infrastructure Engineer. Through comprehensive modules, real-world projects, and production-grade code stubs with educational TODO comments, you'll develop the skills needed to build, deploy, and maintain ML infrastructure at scale.
Repository Status: β 100% COMPLETE - All modules and projects ready for learning!
- β Build ML Infrastructure from scratch (Docker, Kubernetes, cloud platforms)
- β Deploy Production ML Systems with auto-scaling and comprehensive monitoring
- β Implement End-to-End MLOps pipelines (Airflow, MLflow, DVC)
- β Deploy Cutting-Edge LLM Infrastructure (vLLM, RAG, vector databases)
- β Scale Training with distributed systems and GPU clusters
- β Monitor and Troubleshoot complex ML systems in production
- β Optimize Costs across cloud providers (60-80% savings possible)
- π Industry-Aligned: Based on actual job requirements from FAANG and top tech companies
- π» Hands-On: Code stubs with TODO comments guide you through real implementations
- ποΈ Production-Ready: Learn patterns used at Netflix, Uber, Airbnb, OpenAI
- π Career-Focused: Directly maps to $120k-$180k AI Infrastructure Engineer roles
- π Progressive: 10 modules building from basics to advanced LLM infrastructure
- π₯ Modern Stack: 2024-2025 technologies (vLLM, RAG, GPU optimization)
Recently Added Content:
- π Comprehensive Quizzes for modules 102-110 (265+ questions)
- Module 102: Cloud Computing (mid-module + final, 50 questions)
- Module 103: Containerization (25 questions)
- Module 104: Kubernetes (30 questions)
- Module 105: Data Pipelines (25 questions)
- Module 106: MLOps (30 questions)
- Module 107: GPU Computing (25 questions)
- Module 108: Monitoring (25 questions)
- Module 109: IaC (25 questions)
- Module 110: LLM Infrastructure (30 questions)
- π Technology Versions Guide - Complete specifications for 100+ tools
- πΊοΈ Curriculum Cross-Reference - Mapping to Junior track
- π Career Progression Guide - Engineer to Principal roadmap
| Module | Topic | Hours | Status | Quiz |
|---|---|---|---|---|
| 01 | Foundations | 50h | β Complete (15 files) | β 30Q |
| 02 | Cloud Computing | 50h | β Complete (11 files) | β¨ +50Q |
| 03 | Containerization | 50h | β Complete (14 files) | β¨ +25Q |
| 04 | Kubernetes | 50h | β Complete (13 files) | β¨ +30Q |
| 05 | Data Pipelines | 50h | β Complete (12 files) | β¨ +25Q |
| 06 | MLOps | 50h | β Complete (12 files) | β¨ +30Q |
| 07 | GPU Computing | 50h | β Complete (12 files) | β¨ +25Q |
| 08 | Monitoring & Observability | 50h | β Complete (11 files) | β¨ +25Q |
| 09 | Infrastructure as Code | 50h | β Complete (12 files) | β¨ +25Q |
| 10 | LLM Infrastructure | 50h | β Complete (12 files) | β¨ +30Q |
| Project | Technologies | Duration | Files | Status |
|---|---|---|---|---|
| 01: Basic Model Serving | FastAPI + K8s + Monitoring | 30h | ~30 | β Complete |
| 02: MLOps Pipeline | Airflow + MLflow + DVC | 40h | 30 | β Complete |
| 03: LLM Deployment | vLLM + RAG + Vector DB | 50h | 47 | β Complete |
Total Repository: 207 files | ~95,000+ lines of code | 500+ hours of learning content
If you've completed the Junior AI Infrastructure Engineer curriculum, you have ALL required prerequisites! β
The Junior curriculum covers:
- β Python fundamentals & advanced concepts
- β Linux/Unix command line mastery
- β Git & version control workflows
- β ML basics (PyTorch, TensorFlow)
- β Docker & containerization
- β Kubernetes introduction
- β API development & databases
- β Monitoring & cloud platforms
Duration: 440 hours (22 weeks part-time, 11 weeks full-time)
Haven't completed Junior curriculum? Use our comprehensive Prerequisites Guide to:
- Check your readiness with detailed skill checklists
- Identify knowledge gaps
- Get personalized learning recommendations
- Run automated skill assessment
If self-studying, you must have:
- Python 3.9+ (intermediate level: OOP, async, testing, type hints)
- Linux/Unix CLI (bash scripting, processes, debugging)
- Git fundamentals (branching, merging, collaboration)
- ML basics (PyTorch/TensorFlow, training, inference, evaluation)
- Docker basics (images, containers, Compose)
- Kubernetes intro (pods, deployments, services)
π Not sure if you're ready? Read the Prerequisites Guide for detailed assessment.
# 1. Clone repository
git clone https://github.com/ai-infra-curriculum/ai-infra-engineer-learning.git
cd ai-infra-engineer-learning
# 2. Create virtual environment
python3.11 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Start with Module 01
cd lessons/mod-101-foundations
cat README.md- Modules 01-02 (Foundations) - Start here if new to ML infrastructure
- Modules 03-04 (Core Infrastructure) - Docker and Kubernetes mastery
- Modules 05-06 (MLOps) - Data pipelines and ML operations
- Modules 07-08 (Advanced) - GPU computing and monitoring
- Modules 09-10 (Modern Stack) - IaC and LLM infrastructure
Detailed guide: GETTING_STARTED.md
50 hours | 15 files
Build your foundation in ML infrastructure:
- ML infrastructure landscape and career paths
- Python environment setup and best practices
- ML frameworks (PyTorch, TensorFlow)
- Docker fundamentals and containerization
- REST API development with FastAPI
50 hours | 11 files
Master cloud platforms for ML:
- Cloud architecture for ML workloads
- AWS (EC2, S3, EKS, SageMaker)
- GCP (Compute Engine, GCS, GKE, Vertex AI)
- Azure (VMs, Blob Storage, AKS, Azure ML)
- Multi-cloud strategies and cost optimization (60-80% savings)
50 hours | 14 files
Deep dive into containers:
- Docker architecture and best practices
- Multi-stage builds and optimization
- Docker Compose for multi-service applications
- Container registries and image management
- Security and vulnerability scanning
50 hours | 13 files
Master Kubernetes for ML:
- Kubernetes architecture and components
- Deployments, Services, ConfigMaps, Secrets
- GPU resource management and scheduling
- Autoscaling (HPA, VPA, Cluster Autoscaler)
- Helm charts and GitOps with ArgoCD
50 hours | 12 files
Build robust data pipelines:
- Apache Airflow for workflow orchestration
- Data processing with Apache Spark
- Streaming data with Apache Kafka
- Data version control with DVC
- Data quality validation and monitoring
50 hours | 12 files
Implement MLOps best practices:
- Experiment tracking with MLflow
- Model registry and versioning
- Feature stores and engineering
- CI/CD for ML models
- A/B testing and experimentation
- ML governance and best practices
50 hours | 12 files
Harness GPU power:
- CUDA programming fundamentals
- PyTorch GPU acceleration
- Distributed training (DDP, FSDP)
- Multi-GPU and multi-node training
- Model and pipeline parallelism
- GPU memory optimization
50 hours | 11 files
Build comprehensive observability:
- Prometheus and Grafana
- Metrics, logs, and traces (OpenTelemetry)
- Distributed tracing with Jaeger
- Alerting and incident response
- Model performance monitoring
- SLIs, SLOs, and SLAs
50 hours | 12 files
Automate infrastructure:
- Terraform fundamentals and best practices
- Pulumi for multi-language IaC
- CloudFormation for AWS
- State management and modules
- Multi-environment deployments
- GitOps workflows
50 hours | 12 files
Master cutting-edge LLM infrastructure (2024-2025):
- LLM serving with vLLM and TensorRT-LLM
- RAG (Retrieval-Augmented Generation)
- Vector databases (Pinecone, Weaviate, Milvus)
- Model quantization (FP16, INT8)
- GPU optimization for inference
- Cost tracking and optimization
β Beginner | 30 hours | ~30 files
Build a complete model serving system:
- FastAPI REST API for image classification
- Docker containerization with optimization
- Kubernetes deployment with monitoring
- Prometheus and Grafana dashboards
- CI/CD pipeline with GitHub Actions
Technologies: FastAPI, Docker, Kubernetes, PyTorch, Prometheus, Grafana
ββ Intermediate | 40 hours | 30 files
Create a production MLOps pipeline:
- Apache Airflow DAGs (data, training, deployment)
- MLflow experiment tracking and model registry
- DVC for data versioning
- Automated model deployment to Kubernetes
- Comprehensive monitoring and alerting
- CI/CD with automated testing
Technologies: Airflow, MLflow, DVC, PostgreSQL, Redis, MinIO, Kubernetes
βββ Advanced | 50 hours | 47 files
Deploy cutting-edge LLM infrastructure:
- vLLM/TensorRT-LLM for optimized serving
- RAG system with vector database (Pinecone/ChromaDB/Milvus)
- Document ingestion pipeline (PDF, TXT, web)
- FastAPI with Server-Sent Events streaming
- Kubernetes with GPU support
- Cost tracking and optimization
- Comprehensive monitoring
Technologies: vLLM, LangChain, Vector DBs, FastAPI, Kubernetes + GPU, Transformers
All learning materials can be completed within free tier limits:
- AWS: 750 hours/month t2.micro + $300 credits (varies)
- GCP: $300 credit (90 days)
- Azure: $200 credit (30 days)
GPU costs (optional, for advanced projects):
- On-demand: $1-3/hour
- Spot instances: $0.30-1/hour (70% savings)
- Estimated total: $50-150 for complete curriculum
- Use spot instances for training (60-90% savings)
- Leverage free tiers across multiple cloud providers
- Delete resources when not in use
- Use local development where possible
- Comprehensive lesson materials with examples
- Code stubs with TODO comments for guided implementation
- Complete project specifications with architecture diagrams
- Quizzes and assessments for each module
- Best practices and design patterns
- π Reading List: resources/reading-list.md
- π οΈ Tools Guide: resources/tools.md
- π References: resources/references.md
- β FAQ: resources/faq.md
- π Technology Versions Guide - Recommended versions for all tools and frameworks
- πΊοΈ Curriculum Cross-Reference - Mapping between Junior and Engineer tracks
- π Career Progression Guide - Complete career ladder from Junior to Principal
AI Infrastructure Engineer
- π° Salary: $120,000 - $180,000
- π’ Companies: Tech companies, AI startups, ML-focused organizations
- π Demand: Very high (growing 35% year-over-year)
ML Platform Engineer
- π° Salary: $130,000 - $190,000
- π’ Companies: Large tech firms, enterprises with ML teams
- π Demand: High (specialized role)
MLOps Engineer
- π° Salary: $110,000 - $170,000
- π’ Companies: All organizations doing ML at scale
- π Demand: Very high (fastest growing ML role)
β Kubernetes expertise with GPU scheduling β End-to-end MLOps pipeline implementation β LLM infrastructure and RAG systems β Distributed training and GPU optimization β Production monitoring and observability β Cloud platform mastery (AWS, GCP, Azure) β Infrastructure as Code with Terraform β Cost optimization strategies
- Total Files: 207
- Estimated Lines: ~95,000+
- Modules: 10 (all complete)
- Projects: 3 (all complete)
- Learning Hours: 500+
- Technologies: 50+
Core Infrastructure: Docker, Kubernetes, Terraform, Helm, ArgoCD
ML & Data: PyTorch, TensorFlow, Apache Airflow, Apache Spark, Kafka, DVC
MLOps: MLflow, Feature Stores, Model Registry, CI/CD
LLM Infrastructure: vLLM, TensorRT-LLM, LangChain, Vector Databases (Pinecone, Milvus, ChromaDB)
Cloud Platforms: AWS (EC2, S3, EKS, SageMaker), GCP (GCE, GCS, GKE, Vertex AI), Azure (VMs, AKS, Azure ML)
Monitoring: Prometheus, Grafana, OpenTelemetry, Jaeger, ELK Stack
GPU Computing: CUDA, NCCL, Multi-GPU training, Distributed training
We welcome contributions! Please see CONTRIBUTING.md for:
- Bug reports and fixes
- Documentation improvements
- New exercises and examples
- Updated best practices
- π Documentation: Start with GETTING_STARTED.md
- π¬ GitHub Discussions: Ask questions
- π Issues: Report bugs
- π§ Contact: ai-infra-curriculum@joshua-ferguson.com
This project is licensed under the MIT License - see LICENSE for details.
Upon completion, you should be able to:
- Deploy ML models to production with confidence
- Build complete MLOps pipelines from scratch
- Implement LLM infrastructure with RAG
- Optimize cloud costs by 60-80%
- Debug complex distributed systems
- Pass technical interviews for AI Infrastructure roles
- Confidently discuss trade-offs in system design
- Lead infrastructure projects at your organization
This curriculum prepares you for AI Infrastructure Engineer roles. For career progression:
-
Gain Experience (1-2 years)
- Work on production ML systems
- Handle incidents and on-call rotations
- Contribute to open-source ML infrastructure projects
-
Advance to Senior Engineer (2-3 years total)
- Our Senior AI Infrastructure Engineer curriculum (coming soon)
- Lead larger projects and mentor juniors
- Design complex systems
-
Become an Architect (4-6 years total)
- Our AI Infrastructure Architect curriculum (coming soon)
- Design enterprise ML platforms
- Strategic technical leadership
Start your journey today!
π Get Started | π View Full Curriculum | π Start Module 01
β Star this repository if you find it valuable!
Share with others learning AI Infrastructure Engineering!
Maintained by the AI Infrastructure Curriculum Project Contact: ai-infra-curriculum@joshua-ferguson.com
Happy Learning! ππ