pytorch · wdvr · Jun 28, 2025 · Jun 28, 2025 · Jun 30, 2025 · Jul 30, 2025
diff --git a/.gitignore b/.gitignore
@@ -69,3 +69,7 @@ aws/tools/cleanup-ssm/Cargo.lock
 
 # These are backup files generated by rustfmt
 aws/tools/cleanup-ssm/**/*.rs.bk
+
+# Lambda build artifacts
+terraform-gpu-devservers/lambda/*/package/
+terraform-gpu-devservers/lambda/*.zip
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,170 @@
+# Agent notes
+
+the first part of this doc is the devs description of the repo. Everything under the 'AGENT SECTION' is for you, the agent, to update state, tricky things, what we're working on and more.
+This will help both you, the agent, but also other agents down the road that share the responsibility of this repo management to navigate the repo.
+
+## Agent restrictions
+
+- NEVER run `terraform apply` or any destructive terraform commands
+- You can run read-only terraform commands like `terraform plan`, `terraform state show`, etc.
+- You can run AWS CLI commands for read-only resource fetching and analysis
+- User will handle all infrastructure deployments themselves
+- Note: We use OpenTofu, so user runs `opentofu apply` or `tf apply` locally (tf is aliased to opentofu)
+- we use k for kubectl and have kubens configured to namespace gpu-dev
+
+## Development style
+
+We like compact code, comments when needed, but only if they add value. For example, a variable called 'number_of_threads' does not need a comment that is contains number of threads.
+We like tested code.
+
+For frontend code we use yarn, yarn format, yarn tsc. yarn dev to run code, but leave it up to the dev to run that one.
+For terraform, we use opentofu, don't ever run tf apply directly. You're free to run tf state/plan and other non-breaking commands though.
+
+**Python Code Style:**
+
+- Always put imports at the top of the file, never inside functions or methods
+- Group imports in standard order: standard library, third-party, local imports
+- Use absolute imports when possible
+
+## Content
+
+- torchci - a next.js app containing a PyTorch CI tracker
+- aws - a bunch of lambdas & amis that are used in the tf module
+- terraform-aws-github-runner - the definition of repos tofu modules. These modules are used in another repo to be deployed.
+- cli-tools - the home of the gpu-dev cli tool that is used for creating/listing/cancelling reservations
+
+## Current challenge and WIP
+
+Currently we're working on a developer servers with GPUs in AWS. This means we'll need:
+
+- a CLI tool for devs to reserve a server [DONE]
+- a queue of open requests [DONE]
+- a reservation for 2 EC2 H100 servers
+- a way for devs to specify if they want 1/2/4/8 GPUs of a server [DONE]
+- later, a way for devs to specify 2x8 GPUs, so they want a connected 2 server setup reserved for X hours
+- we care about NIC connection - NVLINK or as fast as possible in one region / subregion.
+- a lambda to process items from the queue if servers are available [DONE]
+- a managed k8s to reserve, start a pod, interactive, and reserve that one for X hours for the dev (configurable) [DONE]
+- auth can be through github public keys, all devs already have those exposed. This should be for devs with commit access to pytorch/pytorch only though. And part of metamates group in Github. [DONE]
+
+# AGENT SECTION
+
+## Issues I found with the description above
+
+- I am not sure terraform-aws-github-runner is correctly described. Next time I go over this code for maintenance or adding something, I'll inform the user of what I think should change. This is not an active goal though, just a sidequest.
+- The user asked for NIC connections. I still need to figure out how fast and what's avaiable @ AWS, When I do that, I'll update this section below:
+
+## NIC explanation in AWS
+
+**EFA (Elastic Fabric Adapter):**
+
+- Low-latency, high-throughput networking for HPC/AI workloads
+- 3200 Gbps bandwidth on p5.48xlarge instances
+- RDMA support, bypasses kernel for direct hardware access
+- Integrates with NVIDIA NCCL for multi-GPU communication
+- **Critical limitation**: Cannot cross Availability Zones - all instances must be in same AZ
+
+**H100 Instance Performance (p5.48xlarge):**
+
+- 8x NVIDIA H100 GPUs (80GB each = 640GB total GPU memory)
+- Within instance: GPUs use NVLINK folr direct communication
+- Between instances: EFA provides fastest networking option
+- Single AZ placement group recommended for best performance
+
+**K8s Decision:** EKS with GPU-optimized EC2 node groups (Fargate has no GPU support)
+
+## Implementation Status (Jan 11, 2025)
+
+### ✅ Completed and Working
+
+- **Infrastructure**: Dual-mode EKS with managed vs self-managed node groups for faster development
+- **Networking**: Full DNS resolution and internet access for pods (CoreDNS + security groups fixed)
+- **SSH Access**: Complete SSH server setup with proper package installation and daemon startup
+- **Authentication**: GitHub public key fetching (ALL user keys, not just first one)
+- **CLI Features**: Float hours support (e.g., --hours 0.25 for 15 minutes)
+- **Reservation Display**: CLI list command shows formatted expiration times (YYYY-MM-DD HH:MM:SS)
+- **Security Groups**: Full connectivity - kubelet (10250), control plane (443), DNS (53), NodePort (30000-32767)
+- **Python CLI tool**: Commands: reserve, list, config with real-time polling
+- **SQS + Lambda**: Async queue processing system with DynamoDB state tracking
+- **Kubernetes**: Pod creation with GPU allocation, NodePort services, init containers
+- **Expiry System**: Timestamp-based expiration tracking with historical records (TTL disabled)
+- **DynamoDB**: Reservations kept as historical records, not auto-deleted
+- **SSORole + instructions for that** - Implement SSO role authentication and provide setup instructions
+
+### 📋 Remaining Tasks
+
+- **Rename G6 to L4** - Update G6 references to L4 (similar to T4 GPU type naming)
+- **Make gpu-type case agnostic** - Allow case-insensitive GPU type parameters (e.g., h100, H100, HuNdred should all work)
+- **Error on non-existing GPU type** - Error out if people ask for a non-existing GPU type
+- **Error on too many GPUs** - Error out if people ask for more GPUs than available in node (8 for H100/B200, 4 for T4, etc.)
+- **Increase /dev/shm for NCCL** - Bump /dev/shm space from 64MB for NCCL requirements (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#docker)
+- **Fix GPU SKU validation** - Add proper error handling for non-existing/unavailable GPU types (e.g., user requesting A100 when only T4 available should get immediate error, not pending pod that will never schedule)
+- **Set HuggingFace cache location** - Set HF_HOME or XDG_CACHE_HOME to /tmp or /workspace so HuggingFace doesn't fill up user home directories with model downloads
+- **Add network drive (EFS)** - Implement 20TB EFS shared storage mounted at /shared with user folders
+- **Add verbose CLI output** - More detailed status and progress information for debugging
+- **Interactive CLI for cancel/edit** - Make `gpu-dev cancel` and `gpu-dev edit` interactive when no reservation ID specified - show list with up/down arrow selection
+- **Default reservation edit/cancel** - Auto-select reservation if user only has one active
+- **Custom Docker image scaffold** - Create Dockerfile with pre-installed packages (Jupyter, etc.)
+- **Add a command gpu-dev availability** that shows how many gpus of each type are available to reserve at the moment, and if 0, what the estimated queue time is
+- **Production deployment** - Switch to p5.48xlarge instances when ready
+- **Investigate NFS** - Research NFS integration for shared storage across pods
+- **Persistent disk** - Implement persistent disk storage for user data across sessions
+- **Validate CUDA version** - Add CUDA version validation and display in container startup
+- **Validate NVIDIA driver version** - Display and validate NVIDIA driver version
+- **Test wall messages** - Verify that wall message functionality works correctly
+- **Validate if expiration works as expected** - Test and verify pod cleanup and reservation expiry process
+- **Simplify code + clean up** - Refactor and clean up codebase for maintainability
+- **Add Docker** - Install and configure Docker in development containers - maybe --docker at reserve, which will use dind if possible to the container (to investigate how feasible)
+- **Add Docker CI image run** - allow user to specify gpu-dev ci-debug <testurl> that downloads that docker-image and goes for it
+- **Add ghstack** - Install ghstack tool for GitHub stack management
+- **Improve debugging and observability** - Add better CLI feedback for pod status, container logs, and error details. Current debugging experience is poor - users need kubectl/aws cli knowledge to debug issues. CLI should show:
+  - Real-time pod startup logs during `gpu-dev reserve` 
+  - Container error messages when pods fail
+  - Image pull status and errors
+  - Resource allocation details
+  - More detailed error messages with troubleshooting hints
+- **Add CloudWatch logs for pods** - Store pod logs in CloudWatch for better debugging and monitoring
+- **Add tests for everything** - Implement comprehensive test suite for all components
+- **Investigate multi node communication** - Research inter-node networking for multi-GPU setups
+- **FQDN for devservers** - Set up proper domain names for development server access
+- **Switch between H100/B200 GPU types** - Add `--gpu-type=b200` CLI option with separate queues per GPU type
+- **GPU queue status command** - Add status command to show queue length per GPU type (eg, `gpu-dev queue-status`)
+- **Jupyter notebook integration** - Add `--jupyter` flag to enable Jupyter notebook and TensorBoard access
+- **Add user collaboration feature** - Add `--add-user <github_name>` flag to allow users to add someone to the server
+- **Display Bug:** - CLI shows "G6" instead of "L4" in availability table - likely resolves on prod release when Lambda functions are updated with new GPU type mappings
+- **Future features**:
+  - Multi-server (16 GPU) reservations
+  - GitHub organization/team verification
+  - Reservation extensions
+  - Usage monitoring and quotas
+
+## Current Working Architecture
+
+**Infrastructure (us-east-2):**
+
+- **Current**: 2x p4d.24xlarge instances (8 A100 GPUs each = 16 total GPUs)
+- **Previous testing**: 2x g4dn.12xlarge instances (4 T4 GPUs each = 8 total GPUs)
+- **Future**: 2x p5.48xlarge instances (8 H100 GPUs each = 16 total GPUs) when capacity available
+- EKS cluster with GPU-optimized node groups
+- NVIDIA device plugin for GPU resource exposure
+- Single AZ deployment with cluster placement groups
+
+**Reservation System:**
+
+- SQS queue for async reservation requests
+- Lambda functions for pod creation and expiry management
+- DynamoDB for reservation and server state tracking
+- Kubernetes pods with GPU resource allocation (1/2/4 GPUs)
+- NodePort services for SSH access to pods
+
+**Authentication & Access:**
+
+- GitHub username configuration for SSH key fetching
+- Public key injection into pods via init containers
+- Copy-pasteable SSH commands with NodePort access
+
+**CLI Tool:**
+
+- Python CLI with config at `~/.gpu-dev-config`
+- Commands: `reserve`, `list`, `config`
+- Real-time polling until reservation is ready
diff --git a/README.md b/README.md
@@ -20,9 +20,12 @@ $ git clone --recursive https://github.com/pytorch/test-infra
 │   └── websites                         # Several websites supported by TestInfra
 │       ├── download.pytorch.org
 │       └── metrics.pytorch.org
+├── cli-tools                            # GPU development CLI tools
+│   └── gpu-dev-cli                     # CLI for reserving GPU development servers
 ├── setup-ssh                            # SSH access setup to CI workers
 ├── stats                                # CI related stats committed automatically by a bot
 ├── terraform-aws-github-runner          # Terraform modules and templates used in CI
+├── terraform-gpu-devservers             # GPU development server infrastructure
 ├── tools                                # Tools and scripts
 |   ├── clang-tidy-checks
 |   └── scripts

diff --git a/claudehistory.md b/claudehistory.md
@@ -0,0 +1,29 @@
+> Read through claude.MD. Let's start by talking this through. Come up with some questions to ask on the project / repo.
+
+⏺ I'll read through the CLAUDE.md file to understand the project and come up with relevant questions.
+
+⏺ Read(CLAUDE.md)
+  ⎿  Read 52 lines (ctrl+r to expand)
+
+⏺ Ahoy matey! After reading through the CLAUDE.md, I can see ye be charting a course for some serious GPU infrastructure. This developer server project has some meaty technical challenges ahead! Let me anchor
+  down with some key questions to help navigate these waters:
+
+  Architecture & Scale Questions:
+  1. What's the target number of concurrent developers you expect to support? This affects queue sizing and server provisioning.
+  2. Should we start with a single AWS region (us-east-1) or plan multi-region from the start?
+  3. For the "2x8 GPU" multi-server setup - are we talking about EFA (Elastic Fabric Adapter) networking between instances?
+
+  Technical Implementation:
+  4. For the CLI tool - any preference between Python (easier integration) vs Rust (performance)?
+  5. What's your preferred queue technology - SQS, EventBridge, or something else?
+  6. For K8s - EKS with GPU node groups, or were you thinking something different?
+
+  Auth & Security:
+  7. The "metamates group" - is this a specific GitHub organization/team I should verify against?
+  8. Should reservation time limits be configurable per user or have fixed tiers?
+
+  Infrastructure Questions:
+  9. Budget constraints for the H100 instances? Those beauties don't come cheap!
+  10. Any compliance requirements (SOC2, etc.) that might affect the architecture?
+
+  Which of these areas would ye like to dive into first, or shall we start mapping out the overall architecture diagram?
diff --git a/cli-tools/gpu-dev-cli/README.md b/cli-tools/gpu-dev-cli/README.md
@@ -0,0 +1,142 @@
+# GPU Developer CLI
+
+A command-line tool for reserving and managing PyTorch GPU development servers.
+
+## Features
+
+- 🚀 Reserve 1, 2, or 4 GPUs (T4s for testing, H100s for production)
+- 📋 List and manage your reservations  
+- 🔐 GitHub authentication with SSH key injection
+- 📊 View cluster status and availability
+- ⚡ Built on AWS EKS with Kubernetes pods
+
+## Installation
+
+```bash
+# Clone the repository
+git clone <repo-url>
+cd cli-tools/gpu-dev-cli
+
+# Install with Poetry
+poetry install
+
+# Or install with pip
+pip install -e .
+```
+
+## Configuration
+
+Configure your GitHub username for SSH key fetching:
+
+```bash
+# Set your GitHub username (required for SSH access)
+gpu-dev config set github_user your-github-username
+
+# View current configuration
+gpu-dev config show
+```
+
+Configuration is stored at `~/.gpu-dev-config`:
+
+```json
+{
+  "github_user": "your-github-username"
+}
+```
+
+**AWS Configuration**: The CLI uses your AWS credentials and automatically discovers the infrastructure resources.
+
+## Usage
+
+### Reserve GPUs
+
+```bash
+# Reserve 1 GPU for 8 hours (default)
+gpu-dev reserve
+
+# Reserve 2 GPUs for 4 hours  
+gpu-dev reserve --gpus 2 --hours 4
+
+# Reserve 4 GPUs for 12 hours with a name
+gpu-dev reserve --gpus 4 --hours 12 --name "multi-gpu-training"
+```
+
+### List Reservations
+
+```bash
+# List all your reservations
+gpu-dev list
+
+# List reservations by user
+gpu-dev list --user username
+
+# List only active reservations
+gpu-dev list --status active
+```
+
+### Connect to Your Server
+
+Once your reservation is active, you'll get an SSH command:
+
+```bash
+# Example output from successful reservation:
+✅ Reservation complete!
+📋 Reservation ID: abc12345-1234-5678-9abc-def012345678
+🕐 Valid for: 4 hours
+💻 Connect with: ssh -p 30508 [email protected]
+```
+
+Just copy and paste the SSH command to connect!
+
+### List Reservations
+
+```bash
+# List your active reservations
+gpu-dev list
+```
+
+## GPU Options
+
+**Testing Environment (g4dn.12xlarge instances):**
+- **1 GPU**: Single T4 for development  
+- **2 GPUs**: Dual T4 setup
+- **4 GPUs**: Full g4dn.12xlarge instance (4x T4)
+
+**Production Environment (planned - p5.48xlarge instances):**
+- **8 GPUs**: Full p5.48xlarge instance (8x H100)
+
+## Authentication
+
+The CLI requires:
+
+1. **AWS credentials** configured (via `aws configure` or IAM role)
+2. **GitHub username** configured (for SSH key fetching): `gpu-dev config set github_user your-username`
+3. **Public SSH key** on your GitHub profile (used for server access)
+
+
+## Development
+
+```bash
+# Install development dependencies
+poetry install --with dev
+
+# Run tests
+poetry run pytest
+
+# Format code
+poetry run black .
+poetry run isort .
+
+# Type checking
+poetry run mypy .
+```
+
+## Architecture
+
+The CLI communicates with:
+
+- **SQS Queue**: For async reservation processing
+- **DynamoDB**: For reservation and server state tracking
+- **Lambda Functions**: For pod creation and management
+- **EKS Cluster**: For GPU pod scheduling
+- **GitHub API**: For SSH public key fetching