Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
159f678
first CLAUDE.md to get started
wdvr Jun 28, 2025
6bf4d3d
updated CLAUDE.md by gcam updated CLAUDE.md by agent with plan
wdvr Jun 28, 2025
ce26325
first version - note CLAUDE generated, not tested
wdvr Jun 30, 2025
66f3aad
cli tool and cheap version
wdvr Jul 30, 2025
9a65ac5
fixes
wdvr Aug 4, 2025
f399b40
pod creation now working
wdvr Aug 8, 2025
873e143
update reservation and docs
wdvr Aug 8, 2025
e4828cc
allow float hours reservations & bug fixes
wdvr Aug 11, 2025
120f888
lots of fixes
wdvr Aug 12, 2025
97cf705
more
wdvr Aug 12, 2025
80e4cf0
claude code
wdvr Aug 12, 2025
779c02c
working claude/nvcc/env/zsh/more
wdvr Aug 13, 2025
bd70720
bug fixes and more
wdvr Aug 13, 2025
8eddfbd
devservers: Add backend.tf
seemethere Aug 13, 2025
8951532
reuse k8s client in lambdas
wdvr Aug 13, 2025
08256cc
h100 + doc updates
wdvr Aug 15, 2025
9d68fb1
update
wdvr Aug 15, 2025
29306cc
diagram
wdvr Aug 15, 2025
92a4e5c
jupyter and multiple gpu types
wdvr Aug 16, 2025
0157244
jupyter and multiple gpu types
wdvr Aug 16, 2025
46d4728
added one more B200 type, fixed 401 k8s api, upgrade to amazon linux …
wdvr Aug 18, 2025
111f276
bugfixes for amazon linux 2023
wdvr Aug 18, 2025
8c334ac
expiration fixes and more
wdvr Aug 18, 2025
fc63cb6
fix
wdvr Aug 18, 2025
008f234
fix path for archive
seemethere Aug 19, 2025
8fe0986
Switch to capacity-block for p5.48xlarge instances
seemethere Aug 19, 2025
ab0a90f
make placement group optional
seemethere Aug 20, 2025
8afcab2
add h100 instances and format
seemethere Aug 20, 2025
d0f1040
fix h100 node placement, use correct template
seemethere Aug 21, 2025
f76b218
fix format
seemethere Aug 21, 2025
c3b3bc7
note asg was fixed, remove explicit h100 instances
seemethere Aug 21, 2025
a43877b
eks: Be more permissive for claude access for bedrock
seemethere Aug 21, 2025
236bc83
fix alignment of gpu-dev --help
seemethere Aug 21, 2025
dcb7795
fix more example wrapping
seemethere Aug 21, 2025
3dbef69
cli: add ability to default to gpu-dev if profile exists
seemethere Aug 21, 2025
048298a
cli: add vscode remote output
seemethere Aug 21, 2025
bc03b1b
fix formatting
seemethere Aug 21, 2025
aa35d42
add agent forwarding to commands
seemethere Aug 22, 2025
d906272
format files with uv format
seemethere Aug 22, 2025
53e1656
add debugging + cleanup of expired pods
wdvr Aug 26, 2025
140caa8
update h100 LT and add B200
wdvr Aug 26, 2025
053c63b
update
wdvr Aug 26, 2025
906e1ad
persistent disks
wdvr Aug 26, 2025
a04fa1a
presistent disks, migration across AZ, b200
wdvr Aug 27, 2025
6a0488c
initial persistent disk adn docker move
wdvr Aug 29, 2025
c110b93
docker working
wdvr Aug 30, 2025
9784aef
fix
wdvr Aug 30, 2025
5e246d3
interactive cli, multinode torch distributed, efs drive,
wdvr Aug 30, 2025
416d2b8
run format
seemethere Aug 30, 2025
b1b7d4e
cli: Add check for valid ssh keys
seemethere Aug 30, 2025
680f290
interactive cli, multinode torch distributed, efs drive,
wdvr Aug 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -69,3 +69,7 @@ aws/tools/cleanup-ssm/Cargo.lock

# These are backup files generated by rustfmt
aws/tools/cleanup-ssm/**/*.rs.bk

# Lambda build artifacts
terraform-gpu-devservers/lambda/*/package/
terraform-gpu-devservers/lambda/*.zip
170 changes: 170 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# Agent notes

the first part of this doc is the devs description of the repo. Everything under the 'AGENT SECTION' is for you, the agent, to update state, tricky things, what we're working on and more.
This will help both you, the agent, but also other agents down the road that share the responsibility of this repo management to navigate the repo.

## Agent restrictions

- NEVER run `terraform apply` or any destructive terraform commands
- You can run read-only terraform commands like `terraform plan`, `terraform state show`, etc.
- You can run AWS CLI commands for read-only resource fetching and analysis
- User will handle all infrastructure deployments themselves
- Note: We use OpenTofu, so user runs `opentofu apply` or `tf apply` locally (tf is aliased to opentofu)
- we use k for kubectl and have kubens configured to namespace gpu-dev

## Development style

We like compact code, comments when needed, but only if they add value. For example, a variable called 'number_of_threads' does not need a comment that is contains number of threads.
We like tested code.

For frontend code we use yarn, yarn format, yarn tsc. yarn dev to run code, but leave it up to the dev to run that one.
For terraform, we use opentofu, don't ever run tf apply directly. You're free to run tf state/plan and other non-breaking commands though.

**Python Code Style:**

- Always put imports at the top of the file, never inside functions or methods
- Group imports in standard order: standard library, third-party, local imports
- Use absolute imports when possible

## Content

- torchci - a next.js app containing a PyTorch CI tracker
- aws - a bunch of lambdas & amis that are used in the tf module
- terraform-aws-github-runner - the definition of repos tofu modules. These modules are used in another repo to be deployed.
- cli-tools - the home of the gpu-dev cli tool that is used for creating/listing/cancelling reservations

## Current challenge and WIP

Currently we're working on a developer servers with GPUs in AWS. This means we'll need:

- a CLI tool for devs to reserve a server [DONE]
- a queue of open requests [DONE]
- a reservation for 2 EC2 H100 servers
- a way for devs to specify if they want 1/2/4/8 GPUs of a server [DONE]
- later, a way for devs to specify 2x8 GPUs, so they want a connected 2 server setup reserved for X hours
- we care about NIC connection - NVLINK or as fast as possible in one region / subregion.
- a lambda to process items from the queue if servers are available [DONE]
- a managed k8s to reserve, start a pod, interactive, and reserve that one for X hours for the dev (configurable) [DONE]
- auth can be through github public keys, all devs already have those exposed. This should be for devs with commit access to pytorch/pytorch only though. And part of metamates group in Github. [DONE]

# AGENT SECTION

## Issues I found with the description above

- I am not sure terraform-aws-github-runner is correctly described. Next time I go over this code for maintenance or adding something, I'll inform the user of what I think should change. This is not an active goal though, just a sidequest.
- The user asked for NIC connections. I still need to figure out how fast and what's avaiable @ AWS, When I do that, I'll update this section below:

## NIC explanation in AWS

**EFA (Elastic Fabric Adapter):**

- Low-latency, high-throughput networking for HPC/AI workloads
- 3200 Gbps bandwidth on p5.48xlarge instances
- RDMA support, bypasses kernel for direct hardware access
- Integrates with NVIDIA NCCL for multi-GPU communication
- **Critical limitation**: Cannot cross Availability Zones - all instances must be in same AZ

**H100 Instance Performance (p5.48xlarge):**

- 8x NVIDIA H100 GPUs (80GB each = 640GB total GPU memory)
- Within instance: GPUs use NVLINK folr direct communication
- Between instances: EFA provides fastest networking option
- Single AZ placement group recommended for best performance

**K8s Decision:** EKS with GPU-optimized EC2 node groups (Fargate has no GPU support)

## Implementation Status (Jan 11, 2025)

### ✅ Completed and Working

- **Infrastructure**: Dual-mode EKS with managed vs self-managed node groups for faster development
- **Networking**: Full DNS resolution and internet access for pods (CoreDNS + security groups fixed)
- **SSH Access**: Complete SSH server setup with proper package installation and daemon startup
- **Authentication**: GitHub public key fetching (ALL user keys, not just first one)
- **CLI Features**: Float hours support (e.g., --hours 0.25 for 15 minutes)
- **Reservation Display**: CLI list command shows formatted expiration times (YYYY-MM-DD HH:MM:SS)
- **Security Groups**: Full connectivity - kubelet (10250), control plane (443), DNS (53), NodePort (30000-32767)
- **Python CLI tool**: Commands: reserve, list, config with real-time polling
- **SQS + Lambda**: Async queue processing system with DynamoDB state tracking
- **Kubernetes**: Pod creation with GPU allocation, NodePort services, init containers
- **Expiry System**: Timestamp-based expiration tracking with historical records (TTL disabled)
- **DynamoDB**: Reservations kept as historical records, not auto-deleted
- **SSORole + instructions for that** - Implement SSO role authentication and provide setup instructions

### 📋 Remaining Tasks

- **Rename G6 to L4** - Update G6 references to L4 (similar to T4 GPU type naming)
- **Make gpu-type case agnostic** - Allow case-insensitive GPU type parameters (e.g., h100, H100, HuNdred should all work)
- **Error on non-existing GPU type** - Error out if people ask for a non-existing GPU type
- **Error on too many GPUs** - Error out if people ask for more GPUs than available in node (8 for H100/B200, 4 for T4, etc.)
- **Increase /dev/shm for NCCL** - Bump /dev/shm space from 64MB for NCCL requirements (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#docker)
- **Fix GPU SKU validation** - Add proper error handling for non-existing/unavailable GPU types (e.g., user requesting A100 when only T4 available should get immediate error, not pending pod that will never schedule)
- **Set HuggingFace cache location** - Set HF_HOME or XDG_CACHE_HOME to /tmp or /workspace so HuggingFace doesn't fill up user home directories with model downloads
- **Add network drive (EFS)** - Implement 20TB EFS shared storage mounted at /shared with user folders
- **Add verbose CLI output** - More detailed status and progress information for debugging
- **Interactive CLI for cancel/edit** - Make `gpu-dev cancel` and `gpu-dev edit` interactive when no reservation ID specified - show list with up/down arrow selection
- **Default reservation edit/cancel** - Auto-select reservation if user only has one active
- **Custom Docker image scaffold** - Create Dockerfile with pre-installed packages (Jupyter, etc.)
- **Add a command gpu-dev availability** that shows how many gpus of each type are available to reserve at the moment, and if 0, what the estimated queue time is
- **Production deployment** - Switch to p5.48xlarge instances when ready
- **Investigate NFS** - Research NFS integration for shared storage across pods
- **Persistent disk** - Implement persistent disk storage for user data across sessions
- **Validate CUDA version** - Add CUDA version validation and display in container startup
- **Validate NVIDIA driver version** - Display and validate NVIDIA driver version
- **Test wall messages** - Verify that wall message functionality works correctly
- **Validate if expiration works as expected** - Test and verify pod cleanup and reservation expiry process
- **Simplify code + clean up** - Refactor and clean up codebase for maintainability
- **Add Docker** - Install and configure Docker in development containers - maybe --docker at reserve, which will use dind if possible to the container (to investigate how feasible)
- **Add Docker CI image run** - allow user to specify gpu-dev ci-debug <testurl> that downloads that docker-image and goes for it
- **Add ghstack** - Install ghstack tool for GitHub stack management
- **Improve debugging and observability** - Add better CLI feedback for pod status, container logs, and error details. Current debugging experience is poor - users need kubectl/aws cli knowledge to debug issues. CLI should show:
- Real-time pod startup logs during `gpu-dev reserve`
- Container error messages when pods fail
- Image pull status and errors
- Resource allocation details
- More detailed error messages with troubleshooting hints
- **Add CloudWatch logs for pods** - Store pod logs in CloudWatch for better debugging and monitoring
- **Add tests for everything** - Implement comprehensive test suite for all components
- **Investigate multi node communication** - Research inter-node networking for multi-GPU setups
- **FQDN for devservers** - Set up proper domain names for development server access
- **Switch between H100/B200 GPU types** - Add `--gpu-type=b200` CLI option with separate queues per GPU type
- **GPU queue status command** - Add status command to show queue length per GPU type (eg, `gpu-dev queue-status`)
- **Jupyter notebook integration** - Add `--jupyter` flag to enable Jupyter notebook and TensorBoard access
- **Add user collaboration feature** - Add `--add-user <github_name>` flag to allow users to add someone to the server
- **Display Bug:** - CLI shows "G6" instead of "L4" in availability table - likely resolves on prod release when Lambda functions are updated with new GPU type mappings
- **Future features**:
- Multi-server (16 GPU) reservations
- GitHub organization/team verification
- Reservation extensions
- Usage monitoring and quotas

## Current Working Architecture

**Infrastructure (us-east-2):**

- **Current**: 2x p4d.24xlarge instances (8 A100 GPUs each = 16 total GPUs)
- **Previous testing**: 2x g4dn.12xlarge instances (4 T4 GPUs each = 8 total GPUs)
- **Future**: 2x p5.48xlarge instances (8 H100 GPUs each = 16 total GPUs) when capacity available
- EKS cluster with GPU-optimized node groups
- NVIDIA device plugin for GPU resource exposure
- Single AZ deployment with cluster placement groups

**Reservation System:**

- SQS queue for async reservation requests
- Lambda functions for pod creation and expiry management
- DynamoDB for reservation and server state tracking
- Kubernetes pods with GPU resource allocation (1/2/4 GPUs)
- NodePort services for SSH access to pods

**Authentication & Access:**

- GitHub username configuration for SSH key fetching
- Public key injection into pods via init containers
- Copy-pasteable SSH commands with NodePort access

**CLI Tool:**

- Python CLI with config at `~/.gpu-dev-config`
- Commands: `reserve`, `list`, `config`
- Real-time polling until reservation is ready
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,12 @@ $ git clone --recursive https://github.com/pytorch/test-infra
│ └── websites # Several websites supported by TestInfra
│ ├── download.pytorch.org
│ └── metrics.pytorch.org
├── cli-tools # GPU development CLI tools
│ └── gpu-dev-cli # CLI for reserving GPU development servers
├── setup-ssh # SSH access setup to CI workers
├── stats # CI related stats committed automatically by a bot
├── terraform-aws-github-runner # Terraform modules and templates used in CI
├── terraform-gpu-devservers # GPU development server infrastructure
├── tools # Tools and scripts
| ├── clang-tidy-checks
| └── scripts
Expand Down
29 changes: 29 additions & 0 deletions claudehistory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
> Read through claude.MD. Let's start by talking this through. Come up with some questions to ask on the project / repo.

⏺ I'll read through the CLAUDE.md file to understand the project and come up with relevant questions.

⏺ Read(CLAUDE.md)
⎿  Read 52 lines (ctrl+r to expand)

⏺ Ahoy matey! After reading through the CLAUDE.md, I can see ye be charting a course for some serious GPU infrastructure. This developer server project has some meaty technical challenges ahead! Let me anchor
down with some key questions to help navigate these waters:

Architecture & Scale Questions:
1. What's the target number of concurrent developers you expect to support? This affects queue sizing and server provisioning.
2. Should we start with a single AWS region (us-east-1) or plan multi-region from the start?
3. For the "2x8 GPU" multi-server setup - are we talking about EFA (Elastic Fabric Adapter) networking between instances?

Technical Implementation:
4. For the CLI tool - any preference between Python (easier integration) vs Rust (performance)?
5. What's your preferred queue technology - SQS, EventBridge, or something else?
6. For K8s - EKS with GPU node groups, or were you thinking something different?

Auth & Security:
7. The "metamates group" - is this a specific GitHub organization/team I should verify against?
8. Should reservation time limits be configurable per user or have fixed tiers?

Infrastructure Questions:
9. Budget constraints for the H100 instances? Those beauties don't come cheap!
10. Any compliance requirements (SOC2, etc.) that might affect the architecture?

Which of these areas would ye like to dive into first, or shall we start mapping out the overall architecture diagram?
142 changes: 142 additions & 0 deletions cli-tools/gpu-dev-cli/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# GPU Developer CLI

A command-line tool for reserving and managing PyTorch GPU development servers.

## Features

- 🚀 Reserve 1, 2, or 4 GPUs (T4s for testing, H100s for production)
- 📋 List and manage your reservations
- 🔐 GitHub authentication with SSH key injection
- 📊 View cluster status and availability
- ⚡ Built on AWS EKS with Kubernetes pods

## Installation

```bash
# Clone the repository
git clone <repo-url>
cd cli-tools/gpu-dev-cli

# Install with Poetry
poetry install

# Or install with pip
pip install -e .
```

## Configuration

Configure your GitHub username for SSH key fetching:

```bash
# Set your GitHub username (required for SSH access)
gpu-dev config set github_user your-github-username

# View current configuration
gpu-dev config show
```

Configuration is stored at `~/.gpu-dev-config`:

```json
{
"github_user": "your-github-username"
}
```

**AWS Configuration**: The CLI uses your AWS credentials and automatically discovers the infrastructure resources.

## Usage

### Reserve GPUs

```bash
# Reserve 1 GPU for 8 hours (default)
gpu-dev reserve

# Reserve 2 GPUs for 4 hours
gpu-dev reserve --gpus 2 --hours 4

# Reserve 4 GPUs for 12 hours with a name
gpu-dev reserve --gpus 4 --hours 12 --name "multi-gpu-training"
```

### List Reservations

```bash
# List all your reservations
gpu-dev list

# List reservations by user
gpu-dev list --user username

# List only active reservations
gpu-dev list --status active
```

### Connect to Your Server

Once your reservation is active, you'll get an SSH command:

```bash
# Example output from successful reservation:
✅ Reservation complete!
📋 Reservation ID: abc12345-1234-5678-9abc-def012345678
🕐 Valid for: 4 hours
💻 Connect with: ssh -p 30508 [email protected]
```

Just copy and paste the SSH command to connect!

### List Reservations

```bash
# List your active reservations
gpu-dev list
```

## GPU Options

**Testing Environment (g4dn.12xlarge instances):**
- **1 GPU**: Single T4 for development
- **2 GPUs**: Dual T4 setup
- **4 GPUs**: Full g4dn.12xlarge instance (4x T4)

**Production Environment (planned - p5.48xlarge instances):**
- **8 GPUs**: Full p5.48xlarge instance (8x H100)

## Authentication

The CLI requires:

1. **AWS credentials** configured (via `aws configure` or IAM role)
2. **GitHub username** configured (for SSH key fetching): `gpu-dev config set github_user your-username`
3. **Public SSH key** on your GitHub profile (used for server access)


## Development

```bash
# Install development dependencies
poetry install --with dev

# Run tests
poetry run pytest

# Format code
poetry run black .
poetry run isort .

# Type checking
poetry run mypy .
```

## Architecture

The CLI communicates with:

- **SQS Queue**: For async reservation processing
- **DynamoDB**: For reservation and server state tracking
- **Lambda Functions**: For pod creation and management
- **EKS Cluster**: For GPU pod scheduling
- **GitHub API**: For SSH public key fetching
Loading
Loading