Skip to content

AI Infrastructure Engineer Learning Track - Production ML infrastructure curriculum (2-4 years experience)

License

Notifications You must be signed in to change notification settings

ai-infra-curriculum/ai-infra-engineer-learning

AI Infrastructure Engineer - Learning Path

License Progress Projects Duration

Master AI Infrastructure Engineering through hands-on projects and practical learning

Prerequisites β€’ Getting Started β€’ Curriculum β€’ Projects β€’ Resources


🎯 Overview

This repository contains a complete, production-ready learning path for becoming an AI Infrastructure Engineer. Through comprehensive modules, real-world projects, and production-grade code stubs with educational TODO comments, you'll develop the skills needed to build, deploy, and maintain ML infrastructure at scale.

Repository Status: βœ… 100% COMPLETE - All modules and projects ready for learning!

What You'll Master

  • βœ… Build ML Infrastructure from scratch (Docker, Kubernetes, cloud platforms)
  • βœ… Deploy Production ML Systems with auto-scaling and comprehensive monitoring
  • βœ… Implement End-to-End MLOps pipelines (Airflow, MLflow, DVC)
  • βœ… Deploy Cutting-Edge LLM Infrastructure (vLLM, RAG, vector databases)
  • βœ… Scale Training with distributed systems and GPU clusters
  • βœ… Monitor and Troubleshoot complex ML systems in production
  • βœ… Optimize Costs across cloud providers (60-80% savings possible)

Why This Learning Path?

  • πŸŽ“ Industry-Aligned: Based on actual job requirements from FAANG and top tech companies
  • πŸ’» Hands-On: Code stubs with TODO comments guide you through real implementations
  • πŸ—οΈ Production-Ready: Learn patterns used at Netflix, Uber, Airbnb, OpenAI
  • πŸ“Š Career-Focused: Directly maps to $120k-$180k AI Infrastructure Engineer roles
  • πŸš€ Progressive: 10 modules building from basics to advanced LLM infrastructure
  • πŸ”₯ Modern Stack: 2024-2025 technologies (vLLM, RAG, GPU optimization)

✨ What's New

Recently Added Content:

  • πŸ“ Comprehensive Quizzes for modules 102-110 (265+ questions)
    • Module 102: Cloud Computing (mid-module + final, 50 questions)
    • Module 103: Containerization (25 questions)
    • Module 104: Kubernetes (30 questions)
    • Module 105: Data Pipelines (25 questions)
    • Module 106: MLOps (30 questions)
    • Module 107: GPU Computing (25 questions)
    • Module 108: Monitoring (25 questions)
    • Module 109: IaC (25 questions)
    • Module 110: LLM Infrastructure (30 questions)
  • πŸ“‹ Technology Versions Guide - Complete specifications for 100+ tools
  • πŸ—ΊοΈ Curriculum Cross-Reference - Mapping to Junior track
  • πŸ“ˆ Career Progression Guide - Engineer to Principal roadmap

πŸ“Š What's Included

10 Complete Learning Modules (130 Files)

Module Topic Hours Status Quiz
01 Foundations 50h βœ… Complete (15 files) βœ… 30Q
02 Cloud Computing 50h βœ… Complete (11 files) ✨ +50Q
03 Containerization 50h βœ… Complete (14 files) ✨ +25Q
04 Kubernetes 50h βœ… Complete (13 files) ✨ +30Q
05 Data Pipelines 50h βœ… Complete (12 files) ✨ +25Q
06 MLOps 50h βœ… Complete (12 files) ✨ +30Q
07 GPU Computing 50h βœ… Complete (12 files) ✨ +25Q
08 Monitoring & Observability 50h βœ… Complete (11 files) ✨ +25Q
09 Infrastructure as Code 50h βœ… Complete (12 files) ✨ +25Q
10 LLM Infrastructure 50h βœ… Complete (12 files) ✨ +30Q

3 Production-Grade Projects (77 Files)

Project Technologies Duration Files Status
01: Basic Model Serving FastAPI + K8s + Monitoring 30h ~30 βœ… Complete
02: MLOps Pipeline Airflow + MLflow + DVC 40h 30 βœ… Complete
03: LLM Deployment vLLM + RAG + Vector DB 50h 47 βœ… Complete

Total Repository: 207 files | ~95,000+ lines of code | 500+ hours of learning content


πŸŽ“ Prerequisites

Option 1: Complete Junior Curriculum (RECOMMENDED)

If you've completed the Junior AI Infrastructure Engineer curriculum, you have ALL required prerequisites! βœ…

The Junior curriculum covers:

  • βœ… Python fundamentals & advanced concepts
  • βœ… Linux/Unix command line mastery
  • βœ… Git & version control workflows
  • βœ… ML basics (PyTorch, TensorFlow)
  • βœ… Docker & containerization
  • βœ… Kubernetes introduction
  • βœ… API development & databases
  • βœ… Monitoring & cloud platforms

Duration: 440 hours (22 weeks part-time, 11 weeks full-time)

Option 2: Self-Assessment

Haven't completed Junior curriculum? Use our comprehensive Prerequisites Guide to:

  • Check your readiness with detailed skill checklists
  • Identify knowledge gaps
  • Get personalized learning recommendations
  • Run automated skill assessment

Minimum Requirements

If self-studying, you must have:

  • Python 3.9+ (intermediate level: OOP, async, testing, type hints)
  • Linux/Unix CLI (bash scripting, processes, debugging)
  • Git fundamentals (branching, merging, collaboration)
  • ML basics (PyTorch/TensorFlow, training, inference, evaluation)
  • Docker basics (images, containers, Compose)
  • Kubernetes intro (pods, deployments, services)

πŸ‘‰ Not sure if you're ready? Read the Prerequisites Guide for detailed assessment.


πŸš€ Getting Started

Quick Start

# 1. Clone repository
git clone https://github.com/ai-infra-curriculum/ai-infra-engineer-learning.git
cd ai-infra-engineer-learning

# 2. Create virtual environment
python3.11 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Start with Module 01
cd lessons/mod-101-foundations
cat README.md

Learning Path

  1. Modules 01-02 (Foundations) - Start here if new to ML infrastructure
  2. Modules 03-04 (Core Infrastructure) - Docker and Kubernetes mastery
  3. Modules 05-06 (MLOps) - Data pipelines and ML operations
  4. Modules 07-08 (Advanced) - GPU computing and monitoring
  5. Modules 09-10 (Modern Stack) - IaC and LLM infrastructure

Detailed guide: GETTING_STARTED.md


πŸ“– Curriculum Overview

Module 01: Foundations βœ…

50 hours | 15 files

Build your foundation in ML infrastructure:

  • ML infrastructure landscape and career paths
  • Python environment setup and best practices
  • ML frameworks (PyTorch, TensorFlow)
  • Docker fundamentals and containerization
  • REST API development with FastAPI

View Module 01 β†’


Module 02: Cloud Computing βœ…

50 hours | 11 files

Master cloud platforms for ML:

  • Cloud architecture for ML workloads
  • AWS (EC2, S3, EKS, SageMaker)
  • GCP (Compute Engine, GCS, GKE, Vertex AI)
  • Azure (VMs, Blob Storage, AKS, Azure ML)
  • Multi-cloud strategies and cost optimization (60-80% savings)

View Module 02 β†’


Module 03: Containerization βœ…

50 hours | 14 files

Deep dive into containers:

  • Docker architecture and best practices
  • Multi-stage builds and optimization
  • Docker Compose for multi-service applications
  • Container registries and image management
  • Security and vulnerability scanning

View Module 03 β†’


Module 04: Kubernetes βœ…

50 hours | 13 files

Master Kubernetes for ML:

  • Kubernetes architecture and components
  • Deployments, Services, ConfigMaps, Secrets
  • GPU resource management and scheduling
  • Autoscaling (HPA, VPA, Cluster Autoscaler)
  • Helm charts and GitOps with ArgoCD

View Module 04 β†’


Module 05: Data Pipelines βœ…

50 hours | 12 files

Build robust data pipelines:

  • Apache Airflow for workflow orchestration
  • Data processing with Apache Spark
  • Streaming data with Apache Kafka
  • Data version control with DVC
  • Data quality validation and monitoring

View Module 05 β†’


Module 06: MLOps βœ…

50 hours | 12 files

Implement MLOps best practices:

  • Experiment tracking with MLflow
  • Model registry and versioning
  • Feature stores and engineering
  • CI/CD for ML models
  • A/B testing and experimentation
  • ML governance and best practices

View Module 06 β†’


Module 07: GPU Computing & Distributed Training βœ…

50 hours | 12 files

Harness GPU power:

  • CUDA programming fundamentals
  • PyTorch GPU acceleration
  • Distributed training (DDP, FSDP)
  • Multi-GPU and multi-node training
  • Model and pipeline parallelism
  • GPU memory optimization

View Module 07 β†’


Module 08: Monitoring & Observability βœ…

50 hours | 11 files

Build comprehensive observability:

  • Prometheus and Grafana
  • Metrics, logs, and traces (OpenTelemetry)
  • Distributed tracing with Jaeger
  • Alerting and incident response
  • Model performance monitoring
  • SLIs, SLOs, and SLAs

View Module 08 β†’


Module 09: Infrastructure as Code βœ…

50 hours | 12 files

Automate infrastructure:

  • Terraform fundamentals and best practices
  • Pulumi for multi-language IaC
  • CloudFormation for AWS
  • State management and modules
  • Multi-environment deployments
  • GitOps workflows

View Module 09 β†’


Module 10: LLM Infrastructure βœ…

50 hours | 12 files

Master cutting-edge LLM infrastructure (2024-2025):

  • LLM serving with vLLM and TensorRT-LLM
  • RAG (Retrieval-Augmented Generation)
  • Vector databases (Pinecone, Weaviate, Milvus)
  • Model quantization (FP16, INT8)
  • GPU optimization for inference
  • Cost tracking and optimization

View Module 10 β†’


πŸ› οΈ Projects

Project 01: Basic Model Serving System βœ…

⭐ Beginner | 30 hours | ~30 files

Build a complete model serving system:

  • FastAPI REST API for image classification
  • Docker containerization with optimization
  • Kubernetes deployment with monitoring
  • Prometheus and Grafana dashboards
  • CI/CD pipeline with GitHub Actions

Technologies: FastAPI, Docker, Kubernetes, PyTorch, Prometheus, Grafana

View Project 01 β†’


Project 02: End-to-End MLOps Pipeline βœ…

⭐⭐ Intermediate | 40 hours | 30 files

Create a production MLOps pipeline:

  • Apache Airflow DAGs (data, training, deployment)
  • MLflow experiment tracking and model registry
  • DVC for data versioning
  • Automated model deployment to Kubernetes
  • Comprehensive monitoring and alerting
  • CI/CD with automated testing

Technologies: Airflow, MLflow, DVC, PostgreSQL, Redis, MinIO, Kubernetes

View Project 02 β†’


Project 03: LLM Deployment Platform βœ…

⭐⭐⭐ Advanced | 50 hours | 47 files

Deploy cutting-edge LLM infrastructure:

  • vLLM/TensorRT-LLM for optimized serving
  • RAG system with vector database (Pinecone/ChromaDB/Milvus)
  • Document ingestion pipeline (PDF, TXT, web)
  • FastAPI with Server-Sent Events streaming
  • Kubernetes with GPU support
  • Cost tracking and optimization
  • Comprehensive monitoring

Technologies: vLLM, LangChain, Vector DBs, FastAPI, Kubernetes + GPU, Transformers

View Project 03 β†’


πŸ’° Cost Considerations

Cloud Costs

All learning materials can be completed within free tier limits:

  • AWS: 750 hours/month t2.micro + $300 credits (varies)
  • GCP: $300 credit (90 days)
  • Azure: $200 credit (30 days)

GPU costs (optional, for advanced projects):

  • On-demand: $1-3/hour
  • Spot instances: $0.30-1/hour (70% savings)
  • Estimated total: $50-150 for complete curriculum

Optimization Tips

  • Use spot instances for training (60-90% savings)
  • Leverage free tiers across multiple cloud providers
  • Delete resources when not in use
  • Use local development where possible

πŸ“š Resources

Included Documentation

  • Comprehensive lesson materials with examples
  • Code stubs with TODO comments for guided implementation
  • Complete project specifications with architecture diagrams
  • Quizzes and assessments for each module
  • Best practices and design patterns

External Resources

Curriculum Documentation


🎯 Learning Outcomes & Career Impact

After Completion, You'll Be Qualified For:

AI Infrastructure Engineer

  • πŸ’° Salary: $120,000 - $180,000
  • 🏒 Companies: Tech companies, AI startups, ML-focused organizations
  • πŸ“ˆ Demand: Very high (growing 35% year-over-year)

ML Platform Engineer

  • πŸ’° Salary: $130,000 - $190,000
  • 🏒 Companies: Large tech firms, enterprises with ML teams
  • πŸ“ˆ Demand: High (specialized role)

MLOps Engineer

  • πŸ’° Salary: $110,000 - $170,000
  • 🏒 Companies: All organizations doing ML at scale
  • πŸ“ˆ Demand: Very high (fastest growing ML role)

Skills You'll Demonstrate

βœ… Kubernetes expertise with GPU scheduling βœ… End-to-end MLOps pipeline implementation βœ… LLM infrastructure and RAG systems βœ… Distributed training and GPU optimization βœ… Production monitoring and observability βœ… Cloud platform mastery (AWS, GCP, Azure) βœ… Infrastructure as Code with Terraform βœ… Cost optimization strategies


πŸ“Š Repository Statistics

  • Total Files: 207
  • Estimated Lines: ~95,000+
  • Modules: 10 (all complete)
  • Projects: 3 (all complete)
  • Learning Hours: 500+
  • Technologies: 50+

Technology Stack Covered

Core Infrastructure: Docker, Kubernetes, Terraform, Helm, ArgoCD

ML & Data: PyTorch, TensorFlow, Apache Airflow, Apache Spark, Kafka, DVC

MLOps: MLflow, Feature Stores, Model Registry, CI/CD

LLM Infrastructure: vLLM, TensorRT-LLM, LangChain, Vector Databases (Pinecone, Milvus, ChromaDB)

Cloud Platforms: AWS (EC2, S3, EKS, SageMaker), GCP (GCE, GCS, GKE, Vertex AI), Azure (VMs, AKS, Azure ML)

Monitoring: Prometheus, Grafana, OpenTelemetry, Jaeger, ELK Stack

GPU Computing: CUDA, NCCL, Multi-GPU training, Distributed training


🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for:

  • Bug reports and fixes
  • Documentation improvements
  • New exercises and examples
  • Updated best practices

πŸ†˜ Getting Help


πŸ“œ License

This project is licensed under the MIT License - see LICENSE for details.


🌟 Success Metrics

Upon completion, you should be able to:

  • Deploy ML models to production with confidence
  • Build complete MLOps pipelines from scratch
  • Implement LLM infrastructure with RAG
  • Optimize cloud costs by 60-80%
  • Debug complex distributed systems
  • Pass technical interviews for AI Infrastructure roles
  • Confidently discuss trade-offs in system design
  • Lead infrastructure projects at your organization

πŸš€ Next Steps After Completion

This curriculum prepares you for AI Infrastructure Engineer roles. For career progression:

  1. Gain Experience (1-2 years)

    • Work on production ML systems
    • Handle incidents and on-call rotations
    • Contribute to open-source ML infrastructure projects
  2. Advance to Senior Engineer (2-3 years total)

    • Our Senior AI Infrastructure Engineer curriculum (coming soon)
    • Lead larger projects and mentor juniors
    • Design complex systems
  3. Become an Architect (4-6 years total)

    • Our AI Infrastructure Architect curriculum (coming soon)
    • Design enterprise ML platforms
    • Strategic technical leadership

Ready to Master AI Infrastructure Engineering?

Start your journey today!

πŸ“˜ Get Started | πŸ“š View Full Curriculum | πŸš€ Start Module 01


⭐ Star this repository if you find it valuable!

Share with others learning AI Infrastructure Engineering!


Maintained by the AI Infrastructure Curriculum Project Contact: ai-infra-curriculum@joshua-ferguson.com

Happy Learning! πŸŽ“πŸš€