-
Notifications
You must be signed in to change notification settings - Fork 84
feat: Add mock servers for DGX H100, H200, and B200 systems #163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add mock servers for DGX H100, H200, and B200 systems #163
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive mock implementations for three additional NVIDIA GPU architectures (H100, H200, and B200) to support multi-architecture testing without requiring physical hardware.
- Implements complete mock servers for DGX H100, H200, and B200 systems with 8-GPU configurations
- Provides architecture-specific MIG profiles with realistic memory allocations and compute capabilities
- Adds fabric management integration and enhanced topology features for newer GPU architectures
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
pkg/nvml/mock/dgxh100/dgxh100.go | Main mock server implementation for H100 with Hopper architecture (80GB memory, 9.0 compute capability) |
pkg/nvml/mock/dgxh100/mig-profile.go | H100-specific MIG profiles with enhanced memory allocations compared to A100 |
pkg/nvml/mock/dgxh200/dgxh200.go | H200 mock server with enhanced memory (141GB) and improved fabric capabilities |
pkg/nvml/mock/dgxh200/mig-profile.go | H200 MIG profiles with 76% more memory than H100 profiles |
pkg/nvml/mock/dgxb200/dgxb200.go | B200 mock server with Blackwell architecture (192GB memory, 10.0 compute capability) |
pkg/nvml/mock/dgxb200/mig-profile.go | B200 MIG profiles with massive memory allocations for next-generation AI workloads |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
e892f1c
to
9252b27
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass looks good, I'll run some manual tests and provide feedback
9252b27
to
697c1d4
Compare
pkg/nvml/mock/dgxb200/dgxb200.go
Outdated
type Server struct { | ||
mock.Interface | ||
mock.ExtendedInterface | ||
Devices [8]nvml.Device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the GB200 nodes not only have 4 devices? Or are these each addressable as 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU 8 x NVIDIA B200 GPUs that provide 1,440 GB total GPU memory
as read in https://docs.nvidia.com/dgx/dgxb200-user-guide/dgxb200-user-guide.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the difference here is that this is a B200 and not a GB200 -- the latter has 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, but here (this PR) is a B200
pkg/nvml/mock/dgxb200/dgxb200.go
Outdated
return device | ||
} | ||
|
||
func (d *Device) GetUUID() (string, nvml.Return) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we not set the mock functions as we do for the a100
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree @fabiendupont let's do as https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/mock/dgxa100/dgxa100.go#L195 and https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/mock/dgxa100/dgxa100.go#L136 for the 3 B200,H100 and H200
pkg/nvml/mock/dgxb200/dgxb200.go
Outdated
CudaDriverVersion int | ||
} | ||
|
||
type Device struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why do we need a new type. Are the properties not expected to be the same across different devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @fabiendupont intend was to create a folder for each GPU type (given that current A100 implementation is a folder named a100
) If we were to add more GPU types, then we should rename that folder to a generic name, and that way we don't need to duplicate this much code, simply a file per GPU type and MIG profile. reusing a core set of defined variables
Establish baseline tests for the original A100 implementation before refactoring to shared architecture. This enables proper TDD approach where we can detect regressions during refactoring. Tests cover: - Server creation and basic properties - Device handling and indexing (8 devices) - Device properties (name, architecture, memory, etc.) - Device access by UUID and PCI bus ID - MIG mode operations - MIG profile configurations and access - GPU instance lifecycle and placements - Compute instance lifecycle - Init/shutdown behavior - Multiple device uniqueness - A100-specific characteristics All tests pass with the existing implementation at fd3e42f. Signed-off-by: Fabien Dupont <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
pkg/nvml/mock/dgxh200/dgxh200.go
Outdated
MigMode: nvml.DEVICE_MIG_ENABLE, | ||
GpuInstances: make(map[*GpuInstance]struct{}), | ||
GpuInstanceCounter: 0, | ||
MemoryInfo: nvml.Memory{Total: 151397597184}, // 141GB for H200 (vs 80GB for H100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory calculation is incorrect. 151397597184 bytes equals approximately 141GB, but this should be calculated as 141 * 1024^3 = 151473709056 bytes for exact 141GB.
MemoryInfo: nvml.Memory{Total: 151397597184}, // 141GB for H200 (vs 80GB for H100) | |
MemoryInfo: nvml.Memory{Total: 141 * 1024 * 1024 * 1024}, // 141GB for H200 (vs 80GB for H100) |
Copilot uses AI. Check for mistakes.
pkg/nvml/mock/dgxb200/dgxb200.go
Outdated
MigMode: nvml.DEVICE_MIG_ENABLE, | ||
GpuInstances: make(map[*GpuInstance]struct{}), | ||
GpuInstanceCounter: 0, | ||
MemoryInfo: nvml.Memory{Total: 206158430208}, // 192GB for B200 (massive memory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory calculation is incorrect. 206158430208 bytes equals approximately 192GB, but this should be calculated as 192 * 1024^3 = 206158430208 bytes. While the value is correct, it should be calculated consistently.
MemoryInfo: nvml.Memory{Total: 206158430208}, // 192GB for B200 (massive memory) | |
MemoryInfo: nvml.Memory{Total: 192 * 1024 * 1024 * 1024}, // 192GB for B200 (massive memory) |
Copilot uses AI. Check for mistakes.
pkg/nvml/mock/dgxh100/dgxh100.go
Outdated
MigMode: nvml.DEVICE_MIG_ENABLE, | ||
GpuInstances: make(map[*GpuInstance]struct{}), | ||
GpuInstanceCounter: 0, | ||
MemoryInfo: nvml.Memory{Total: 85899345920}, // 80GB for H100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory calculation is incorrect. 85899345920 bytes equals approximately 80GB, but this should be calculated as 80 * 1024^3 = 85899345920 bytes for exact 80GB.
MemoryInfo: nvml.Memory{Total: 85899345920}, // 80GB for H100 | |
MemoryInfo: nvml.Memory{Total: 80 * 1024 * 1024 * 1024}, // 80GB for H100 |
Copilot uses AI. Check for mistakes.
- Implement shared factory system in pkg/nvml/mock/shared/ to eliminate code duplication - Add comprehensive A30 GPU configurations with MIG profiles (56 SMs, 1/2/4-slice support) - Refactor dgxa100 to use shared factory while maintaining backward compatibility - Create modular GPU configurations in shared/gpus/ for A100 and A30 families - Add comprehensive documentation covering architecture and usage examples - Maintain thread safety and proper NVML return codes - Support all A100 variants (SXM4 40GB/80GB, PCIe 40GB/80GB) and A30 PCIe 24GB Signed-off-by: Fabien Dupont <[email protected]>
Implements DGX H100 and H200 GPU mocks following the established shared factory pattern for consistency with existing A100/A30 implementations. - Add H100 SXM5 80GB configuration with complete MIG profile support - Add H200 SXM5 141GB configuration with complete MIG profile support - Implement dgxh100 and dgxh200 packages using shared factory pattern - Include all 7 MIG profiles (standard, REV1 media, REV2 double memory) - Maintain full backward compatibility with legacy globals and type aliases - Use NVIDIA-spec compliant memory allocations and SM distributions Signed-off-by: Fabien Dupont <[email protected]>
Implements DGX B200 mock following the established shared factory pattern: - Add B200 SXM5 180GB GPU configuration with Blackwell architecture - Comprehensive MIG profiles matching NVIDIA specifications: * Memory allocations: 23GB, 45GB, 90GB, 180GB per NVIDIA MIG User Guide * REV1 (media extensions) and REV2 (expanded memory) profiles * Full P2P support in MIG mode (IsP2pSupported: 1) * 144 SMs total with 18 SMs per slice - Complete DGX B200 server implementation with 8 GPUs - Driver version 560.28.03, NVML 12.560.28.03, CUDA 12060 - Comprehensive test suite covering server, device, and MIG operations - Backward compatible legacy global variables (MIGProfiles, MIGPlacements) Memory values corrected from initial 192GB to 180GB based on official NVIDIA MIG User Guide specifications. Signed-off-by: Fabien Dupont <[email protected]>
…ntation Updates the mock framework documentation to include all GPU generations: - Add H100, H200, and B200 to architecture diagram and file structure - Document all GPU specifications: * H100 SXM5 80GB (Hopper, 132 SMs, CUDA 9.0, P2P MIG support) * H200 SXM5 141GB (Hopper, 132 SMs, CUDA 9.0, P2P MIG support) * B200 SXM5 180GB (Blackwell, 144 SMs, CUDA 10.0, P2P MIG support) - Add complete server model documentation with driver versions - Expand MIG support section with detailed SM allocations and P2P capabilities - Update usage examples to include all four GPU generations - Add comprehensive testing instructions for all mock implementations - Update backward compatibility section to reflect all generations The documentation now accurately reflects the complete shared factory implementation with comprehensive coverage of Ampere, Hopper, and Blackwell GPU architectures. Signed-off-by: Fabien Dupont <[email protected]>
697c1d4
to
51f3c75
Compare
Added mock tests, to have basic regression testsing. |
This commit adds comprehensive mock implementations for three additional NVIDIA GPU architectures to support multi-architecture testing:
Each mock server provides:
Key Features:
This enables comprehensive testing across multiple GPU architectures without requiring physical hardware.