[Feature] Multi-GPU / NCCL Compatibility Checker

Credit to Daniyal Asif from 'AI Discussions' for bringing this idea to light

 ## Summary

 Add multi-GPU topology detection and NCCL version compatibility checking. Currently env-doctor detects individual GPUs but doesn't analyze inter-GPU communication topology or validate NCCL against the CUDA toolkit.

 ## Motivation

 Multi-GPU training failures due to NCCL/topology misconfigurations are among the hardest issues to debug. Users often get cryptic NCCL errors without understanding whether their GPU interconnect or NCCL version is the
 root cause.

 ## Proposed Implementation

 ### New Detector: `NCCLDetector`
 - Register via `@DetectorRegistry.register("nccl")`
 - Search for `libnccl.so*` in standard paths (`/usr/lib/x86_64-linux-gnu/`, `/usr/local/cuda/lib64/`, conda envs)
 - Extract NCCL version from library or `nccl.h` header
 - Cross-reference NCCL version against CUDA toolkit version for known incompatibilities

 ### Extend `NvidiaDriverDetector`
 - Parse `nvidia-smi topo -m` output to detect GPU interconnect topology (PCIe, NVLink, NVSwitch)
 - Add topology metadata: `metadata["topology"]` with connection type between each GPU pair
 - Flag mixed-generation GPU setups (e.g., A100 + V100) that cause peer-to-peer issues

 ### New CLI Integration
 - Add topology and NCCL results to `env-doctor check` output
 - Add NCCL section to `env-doctor check --json` output

 ### Compatibility Database
 - Add `nccl_compatibility.json` mapping NCCL versions to supported CUDA versions
 - Include known problematic NCCL + driver combinations

 ## Acceptance Criteria
 - [ ] `NCCLDetector` follows existing detector pattern (`Detector` base class, `DetectionResult`)
 - [ ] GPU topology detected and displayed for multi-GPU systems
 - [ ] NCCL version cross-referenced against CUDA toolkit
 - [ ] Known incompatibilities flagged with actionable recommendations
 - [ ] Single-GPU systems gracefully skip topology checks
 - [ ] Results included in `--json` and `--ci` output modes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Multi-GPU / NCCL Compatibility Checker #92

Summary

Motivation

Proposed Implementation

New Detector: `NCCLDetector`

Extend `NvidiaDriverDetector`

New CLI Integration

Compatibility Database

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature] Multi-GPU / NCCL Compatibility Checker #92

Description

Summary

Motivation

Proposed Implementation

New Detector: NCCLDetector

Extend NvidiaDriverDetector

New CLI Integration

Compatibility Database

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

New Detector: `NCCLDetector`

Extend `NvidiaDriverDetector`