Skip to content

[Feature] Multi-GPU / NCCL Compatibility Checker #92

@mitulgarg

Description

@mitulgarg

Credit to Daniyal Asif from 'AI Discussions' for bringing this idea to light

Summary

Add multi-GPU topology detection and NCCL version compatibility checking. Currently env-doctor detects individual GPUs but doesn't analyze inter-GPU communication topology or validate NCCL against the CUDA toolkit.

Motivation

Multi-GPU training failures due to NCCL/topology misconfigurations are among the hardest issues to debug. Users often get cryptic NCCL errors without understanding whether their GPU interconnect or NCCL version is the
root cause.

Proposed Implementation

New Detector: NCCLDetector

  • Register via @DetectorRegistry.register("nccl")
  • Search for libnccl.so* in standard paths (/usr/lib/x86_64-linux-gnu/, /usr/local/cuda/lib64/, conda envs)
  • Extract NCCL version from library or nccl.h header
  • Cross-reference NCCL version against CUDA toolkit version for known incompatibilities

Extend NvidiaDriverDetector

  • Parse nvidia-smi topo -m output to detect GPU interconnect topology (PCIe, NVLink, NVSwitch)
  • Add topology metadata: metadata["topology"] with connection type between each GPU pair
  • Flag mixed-generation GPU setups (e.g., A100 + V100) that cause peer-to-peer issues

New CLI Integration

  • Add topology and NCCL results to env-doctor check output
  • Add NCCL section to env-doctor check --json output

Compatibility Database

  • Add nccl_compatibility.json mapping NCCL versions to supported CUDA versions
  • Include known problematic NCCL + driver combinations

Acceptance Criteria

  • NCCLDetector follows existing detector pattern (Detector base class, DetectionResult)
  • GPU topology detected and displayed for multi-GPU systems
  • NCCL version cross-referenced against CUDA toolkit
  • Known incompatibilities flagged with actionable recommendations
  • Single-GPU systems gracefully skip topology checks
  • Results included in --json and --ci output modes

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions