-
-
Couldn't load subscription status.
- Fork 10.8k
[CI/Build] Test torchrun with 8 cards #27548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: 22quinn <[email protected]>
Signed-off-by: 22quinn <[email protected]>
|
Documentation preview: https://vllm--27548.org.readthedocs.build/en/27548/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new CI test for torchrun with 8 GPUs to improve coverage for distributed inference, which is a great addition. The changes in torchrun_dp_example.py to allow configuration via environment variables are necessary for this. I've found a code quality issue in the example script where new module-level variables shadow later assignments, which can be confusing. I've left a couple of suggestions to improve clarity and avoid potential future bugs.
| - VLLM_ALLOW_INSECURE_SERIALIZATION=1 RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py | ||
| - popd | ||
|
|
||
| - label: Distributed Tests (8 GPUs) # ?min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have the capacity to run this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems not https://github.com/vllm-project/ci-infra/blob/a64d73f396ba8629e8fb9c5e5b933a5f87d1edc2/buildkite/pipeline_generator/utils.py#L32
Possible to add 8-card capacity? @hl475
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to use 8-card capacity, maybe try with H100? example PR https://github.com/vllm-project/vllm/pull/27113/files
Purpose
Increase torchrun / external launcher mode CI coverage to prevent issues like #27502. Reverting #27502 would fail this test.
Test Plan
TP_SIZE=2 DP_SIZE=4 ENABLE_EP=1 torchrun --nproc-per-node=8 examples/offline_inference/torchrun_dp_example.py
Test Result
Pass
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.