Skip to content

Conversation

draftbk
Copy link
Contributor

@draftbk draftbk commented Aug 29, 2025

Purpose

Update pytorch rocm from 6.3 to 6.4

Initial purpose for myself: trying to test TorchAO FP8 on MI300X, and from the doc, it is tested with ROCm 6.4.

Test Plan & Test Result

# MI300X
lm_eval \
  --model vllm \
  --model_args pretrained=/home/lifans/hf_models/meta-llama/Llama-3.1-8B-Instruct,max_model_len=4096 \
  --tasks gsm8k \
  --batch_size auto
# rocm 6.4
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7862|±  |0.0113|
|     |       |strict-match    |     5|exact_match|↑  |0.7619|±  |0.0117|

# rocm 6.3
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7824|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7566|±  |0.0118|

@mergify mergify bot added ci/build rocm Related to AMD ROCm labels Aug 29, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the PyTorch ROCm dependency from version 6.3 to 6.4. While the change in requirements/rocm-build.txt is correct, the PR is incomplete as it misses a corresponding update to the base Docker image. The build environment defined in docker/Dockerfile.rocm_base still uses a ROCm 6.3.1 image, which will likely cause a version mismatch with the new PyTorch wheel, leading to build or runtime failures. This critical issue needs to be addressed by updating the base image to a ROCm 6.4 compatible version.

@@ -1,7 +1,7 @@
# Common dependencies
-r common.txt

--extra-index-url https://download.pytorch.org/whl/rocm6.3
--extra-index-url https://download.pytorch.org/whl/rocm6.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

While this change correctly points to the PyTorch wheel for ROCm 6.4, it introduces a critical inconsistency with the project's build environment. The base Docker image, defined in docker/Dockerfile.rocm_base, is still configured to use ROCm 6.3.1 (rocm/dev-ubuntu-22.04:6.3.1-complete).

This mismatch between the ROCm version expected by the PyTorch wheel and the version provided by the Docker image will likely lead to build failures or runtime errors. To resolve this, you must also update the BASE_IMAGE argument in docker/Dockerfile.rocm_base to a corresponding ROCm 6.4 image.

For example:

-ARG BASE_IMAGE=rocm/dev-ubuntu-22.04:6.3.1-complete
+ARG BASE_IMAGE=rocm/dev-ubuntu-22.04:6.4-complete

(Please verify the correct image tag for ROCm 6.4).

Additionally, you should check if other related dependencies in docker/Dockerfile.rocm_base, such as HIPBLASLT_BRANCH and RCCL_BRANCH, need to be updated for compatibility with ROCm 6.4.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^see this, i think it's a good catch

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dockerfile.rocm_base and rocm-build.txt are used in different circumstances.
rocm-build.txt is for bare metal build, without the docker.
When torch is installed from whl, as in the case with rocm-build.txt, blas libraries, and pretty much the entire rocm stack are brought together with this whl, that's why it's so bit, and also why in the normal workflow we don't use it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that, the ROCm dockerfiles here were out of sync, so I created a parallel PR: #24279

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gshtras Out of curiosity, what is the reason we want to maintain a ROCm version of vLLM? Are there any cool features in ROCm that have not yet been brought to vLLM?

And thanks for the quick review and fix!

Copy link
Collaborator

@gshtras gshtras Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROCm fork is mainly deprecated, when we build our nightly dockers, it's from this upstream repo.
The only thing it's used for is the two dockerfiles, mainly due to our CI constrains (and possibly at least partially because we forgot about it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information : )

@draftbk
Copy link
Contributor Author

draftbk commented Aug 29, 2025

cc: @huydhn.
Since you just updated PyTorch to 2.8.0 (#20358) and might know of any risks I’m unaware of. : )

@huydhn
Copy link
Contributor

huydhn commented Sep 1, 2025

cc: @huydhn. Since you just updated PyTorch to 2.8.0 (#20358) and might know of any risks I’m unaware of. : )

AFAICT, there is no risk here. Upgrading ROCm from 6.3 to 6.4 will align with PyTorch 2.8 which has 6.4 as the default version of ROCm https://pytorch.org/get-started/locally/

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@draftbk
Copy link
Contributor Author

draftbk commented Sep 1, 2025

cc: @huydhn. Since you just updated PyTorch to 2.8.0 (#20358) and might know of any risks I’m unaware of. : )

AFAICT, there is no risk here. Upgrading ROCm from 6.3 to 6.4 will align with PyTorch 2.8 which has 6.4 as the default version of ROCm https://pytorch.org/get-started/locally/

Thank you for the review!

cc @youkaichao for review and approval.

@gshtras
Copy link
Collaborator

gshtras commented Sep 4, 2025

cc @SageMoore since this file is used mainly (only) by NM on the bare metal build

@draftbk
Copy link
Contributor Author

draftbk commented Sep 7, 2025

Hi, I wanted to follow up and check if there are any concerns about merging this code. cc: @simon-mo @gshtras

BTW, I also tested torchao quantization with torch-rocm6.4, in case you are interested: #24400

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build rocm Related to AMD ROCm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants