[CI/Build] Test torchrun with 8 cards #27548

22quinn · 2025-10-27T05:38:28Z

Purpose

Increase torchrun / external launcher mode CI coverage to prevent issues like #27502. Reverting #27502 would fail this test.

Test Plan

TP_SIZE=2 DP_SIZE=4 ENABLE_EP=1 torchrun --nproc-per-node=8 examples/offline_inference/torchrun_dp_example.py

Test Result

Pass

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: 22quinn <[email protected]>

mergify · 2025-10-27T05:39:07Z

Documentation preview: https://vllm--27548.org.readthedocs.build/en/27548/

gemini-code-assist

Code Review

This pull request adds a new CI test for torchrun with 8 GPUs to improve coverage for distributed inference, which is a great addition. The changes in torchrun_dp_example.py to allow configuration via environment variables are necessary for this. I've found a code quality issue in the example script where new module-level variables shadow later assignments, which can be confusing. I've left a couple of suggestions to improve clarity and avoid potential future bugs.

examples/offline_inference/torchrun_dp_example.py

houseroad · 2025-10-27T06:35:38Z

.buildkite/test-pipeline.yaml

  - VLLM_ALLOW_INSECURE_SERIALIZATION=1 RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py
  - popd

+- label: Distributed Tests (8 GPUs) # ?min


do we have the capacity to run this?

Seems not https://github.com/vllm-project/ci-infra/blob/a64d73f396ba8629e8fb9c5e5b933a5f87d1edc2/buildkite/pipeline_generator/utils.py#L32
Possible to add 8-card capacity? @hl475

if we want to use 8-card capacity, maybe try with H100? example PR https://github.com/vllm-project/vllm/pull/27113/files

22quinn added 2 commits October 26, 2025 22:30

[CI/Build] Test torchrun with 8 cards

d6677c9

Signed-off-by: 22quinn <[email protected]>

organize triggers

3a8f5e8

Signed-off-by: 22quinn <[email protected]>

mergify bot added documentation Improvements or additions to documentation ci/build labels Oct 27, 2025

22quinn added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 27, 2025

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

examples/offline_inference/torchrun_dp_example.py Show resolved Hide resolved

examples/offline_inference/torchrun_dp_example.py Show resolved Hide resolved

houseroad reviewed Oct 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[CI/Build] Test torchrun with 8 cards #27548

[CI/Build] Test torchrun with 8 cards #27548

22quinn commented Oct 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

houseroad Oct 27, 2025

Uh oh!

22quinn Oct 27, 2025

Uh oh!

hl475 Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

[CI/Build] Test torchrun with 8 cards #27548

Are you sure you want to change the base?

[CI/Build] Test torchrun with 8 cards #27548

Conversation

22quinn commented Oct 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

houseroad Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

22quinn Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

hl475 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

22quinn commented Oct 27, 2025 •

edited by github-actions bot

Loading