Fix NCCL broadcast error on CPU tensors in distributed inference by Pratham-Nayak1 · Pull Request #257 · mistralai/mistral-inference

Pratham-Nayak1 · 2025-10-01T01:32:29Z

This PR fixes a runtime error in distributed inference with the NCCL backend:
RuntimeError: No backend type associated with device type cpu

Root Cause:
When using NCCL, collective operations require CUDA tensors. The code attempted to run:
dist.broadcast(length_tensor, src=0)
while length_tensor was on CPU. This caused the runtime error on non-zero ranks.

Fix:
Before broadcasting, the small metadata tensor is moved to the local CUDA device if dist.get_backend() == "nccl". After the broadcast, it is converted back to CPU to extract the Python integer.

Testing:
I do not have access to a Linux multi-GPU setup, so I could not reproduce the original crash.
Since the issue provides reproduction steps (#252), I’d appreciate if maintainers or contributors could verify this fix in that environment.

Notes
This change preserves NCCL performance while ensuring compatibility.
Fixes #252.

Pratham-Nayak1 · 2025-10-16T12:51:47Z

@kmk142789 Thanks for the review and approval! Sorry for the late reply — really appreciate your time and feedback.

fix:use LOCAL_RANK and move broadcast tensor to correct GPU for NCCL

7673025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix NCCL broadcast error on CPU tensors in distributed inference#257

Fix NCCL broadcast error on CPU tensors in distributed inference#257
Pratham-Nayak1 wants to merge 1 commit intomistralai:mainfrom
Pratham-Nayak1:fix-NCCL-broadcast

Pratham-Nayak1 commented Oct 1, 2025

Uh oh!

Pratham-Nayak1 commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Pratham-Nayak1 commented Oct 1, 2025

Uh oh!

Pratham-Nayak1 commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant