Credit to Daniyal Asif from 'AI Discussions' for bringing this idea to light
Summary
Add multi-GPU topology detection and NCCL version compatibility checking. Currently env-doctor detects individual GPUs but doesn't analyze inter-GPU communication topology or validate NCCL against the CUDA toolkit.
Motivation
Multi-GPU training failures due to NCCL/topology misconfigurations are among the hardest issues to debug. Users often get cryptic NCCL errors without understanding whether their GPU interconnect or NCCL version is the
root cause.
Proposed Implementation
New Detector: NCCLDetector
- Register via
@DetectorRegistry.register("nccl")
- Search for
libnccl.so* in standard paths (/usr/lib/x86_64-linux-gnu/, /usr/local/cuda/lib64/, conda envs)
- Extract NCCL version from library or
nccl.h header
- Cross-reference NCCL version against CUDA toolkit version for known incompatibilities
Extend NvidiaDriverDetector
- Parse
nvidia-smi topo -m output to detect GPU interconnect topology (PCIe, NVLink, NVSwitch)
- Add topology metadata:
metadata["topology"] with connection type between each GPU pair
- Flag mixed-generation GPU setups (e.g., A100 + V100) that cause peer-to-peer issues
New CLI Integration
- Add topology and NCCL results to
env-doctor check output
- Add NCCL section to
env-doctor check --json output
Compatibility Database
- Add
nccl_compatibility.json mapping NCCL versions to supported CUDA versions
- Include known problematic NCCL + driver combinations
Acceptance Criteria
Credit to Daniyal Asif from 'AI Discussions' for bringing this idea to light
Summary
Add multi-GPU topology detection and NCCL version compatibility checking. Currently env-doctor detects individual GPUs but doesn't analyze inter-GPU communication topology or validate NCCL against the CUDA toolkit.
Motivation
Multi-GPU training failures due to NCCL/topology misconfigurations are among the hardest issues to debug. Users often get cryptic NCCL errors without understanding whether their GPU interconnect or NCCL version is the
root cause.
Proposed Implementation
New Detector:
NCCLDetector@DetectorRegistry.register("nccl")libnccl.so*in standard paths (/usr/lib/x86_64-linux-gnu/,/usr/local/cuda/lib64/, conda envs)nccl.hheaderExtend
NvidiaDriverDetectornvidia-smi topo -moutput to detect GPU interconnect topology (PCIe, NVLink, NVSwitch)metadata["topology"]with connection type between each GPU pairNew CLI Integration
env-doctor checkoutputenv-doctor check --jsonoutputCompatibility Database
nccl_compatibility.jsonmapping NCCL versions to supported CUDA versionsAcceptance Criteria
NCCLDetectorfollows existing detector pattern (Detectorbase class,DetectionResult)--jsonand--cioutput modes