Skip to content

RFC-0044: Make c10d tests backend agnostic #78

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions RFC-0044-refactor-c10d-tests-backend-agnostic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# [RFC] Refactor test_c10d_nccl.py to Support Backend-Agnostic Testing for PyTorch

**Authors:**
* @harikodali

## **Summary**
Propose splitting the `test_c10d_nccl.py` file into two separate files: `test_c10d_nccl.py` for NCCL-specific tests and `test_c10d_backend.py` for backend-agnostic tests that can run on multiple device backends (e.g., NCCL, HCCL, XCCL) by dynamically detecting the device type and corresponding backend at runtime. This will enhance test coverage for non-CUDA accelerators like Intel Gaudi (HPU) and Intel XPU in the official PyTorch repository.

## **Motivation**
The `test_c10d_nccl.py` file contains tests for collectives, process group initialization, Distributed Data Parallel (DDP), and other NCCL-specific functionality. While these tests are critical for validating CUDA-based NCCL backends, they are not reusable for other backends like Intel's HCCL for Gaudi devices and Intel's XCCL or other non-CUDA accelerators.
To address this, we propose refactoring `test_c10d_nccl.py` to extract common, backend-agnostic tests into a new `test_c10d_backend.py` file. This file will dynamically detect the device type and backend at runtime, allowing tests to run on any supported accelerator (e.g., CUDA, HPU, XPU) without hardcoding backend-specific logic. This approach aligns with prior efforts to generalize PyTorch tests for non-CUDA devices, enabling broader test coverage and native support for accelerators like HPU/XPU in the official PyTorch repository.

### **Examples**
The following test classes in `test_c10d_nccl.py` can be generalized to run on multiple backends:
* `RendezvousEnvTest`, `TimeoutTest`, `ProcessGroupNCCLInitTest`, `DistributedDataParallelTest`, `WorkHookTest`, `CommTest`, `NcclProcessGroupWithDispatchedCollectivesTests`, `LargeCommTest`, `SparseCollective`: These tests can be refactored to use device-agnostic APIs.

Currently, these tests are hardcoded for NCCL and CUDA, using decorators like `@requires_nccl()` or explicit `"nccl"` backend references, which block execution on non-CUDA devices.

## **Proposed Implementation**
We propose a staggered approach to refactor `test_c10d_nccl.py` and introduce `test_c10d_backend.py`:

1. **Extract Backend-Agnostic Tests**:
- Move the identified test classes (`RendezvousEnvTest`, `TimeoutTest`, etc.) from `test_c10d_nccl.py` to `test_c10d_backend.py`.
- Replace NCCL-specific decorators (e.g., `@requires_nccl()`) with a generalized `@requires_accelerator_dist_backend` that checks for backend availability dynamically.

2. **Dynamic Device and Backend Detection**:
- Use `torch.accelerator.current_device()` and runtime backend detection (e.g., `torch.distributed.get_default_backend_for_device()`) to substitute hardcoded `"cuda"` and `"nccl"` references with the current device type and corresponding backend.

3. **Generalize Test Logic**:
- Refactor test logic to use abstract backend APIs instead of NCCL-specific calls.


### **Metrics**
- **Adoption**: Enable other non-CUDA devices to use `test_c10d_backend.py` in their CI pipelines, increasing participation in PyTorch’s test ecosystem.

### **Additional Context**
This proposal builds on prior efforts to generalize PyTorch tests for non-CUDA devices:
- Removed `@onlyNativeDevice` restrictions: [PR #128584](https://github.com/pytorch/pytorch/pull/128584)
- Generalized Dynamo content: [PR #130714](https://github.com/pytorch/pytorch/pull/130714)
- Generalized Distributed content: [PR #131758](https://github.com/pytorch/pytorch/pull/131758)
- Generalized FSDP content: [PR #133209](https://github.com/pytorch/pytorch/pull/133209)

These efforts enabled native test support for Intel Gaudi (HPU) by dynamically substituting device and backend types, allowing tests to run seamlessly on other devices. This RFC extends the same philosophy to distributed backend tests, ensuring broader backend support.