Skip to content

Conversation

Alexei-V-Ivanov-AMD
Copy link
Collaborator

@Alexei-V-Ivanov-AMD Alexei-V-Ivanov-AMD commented Oct 21, 2025

Mirroring changes in test-pipeline.yaml into test-amd.yaml

Signed-off-by: Alexei V. Ivanov [email protected]

@mergify mergify bot added ci/build rocm Related to AMD ROCm labels Oct 21, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request mirrors changes from test-pipeline.yaml into test-amd.yaml. While many changes are valid, several NVIDIA-specific test configurations (for Blackwell and H200 GPUs) have been incorrectly copied into the AMD pipeline file. These steps will fail on AMD hardware and should be removed. There is also a CUDA-specific test dependency that should be removed.

Comment on lines +940 to +941
- label: Blackwell Test # 21 min
timeout_in_minutes: 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

While this change updates the label and timeout, this entire 'Blackwell Test' step is configured for NVIDIA hardware (gpu: b200 on line 943) and is not suitable for the test-amd.yaml pipeline. This step should be removed entirely from this file.

Comment on lines +976 to +995
- label: Blackwell Fusion Tests # 30 min
timeout_in_minutes: 40
working_dir: "/vllm-workspace/"
gpu: b200
source_file_dependencies:
- csrc/quantization/fp4/
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
- vllm/v1/attention/backends/flashinfer.py
- vllm/compilation/
# can affect pattern matching
- vllm/model_executor/layers/layernorm.py
- vllm/model_executor/layers/activation.py
- vllm/model_executor/layers/quantization/input_quant_fp8.py
commands:
- nvidia-smi
- pytest -v -s tests/compile/test_fusion_attn.py
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
# this runner has 2 GPUs available even though num_gpus=2 is not set
- pytest -v -s tests/compile/test_fusion_all_reduce.py
- pytest -v -s tests/compile/test_fusions_e2e.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This newly added 'Blackwell Fusion Tests' step is configured for NVIDIA Blackwell (B200) GPUs and includes the nvidia-smi command. This is incorrect for the test-amd.yaml pipeline, which is intended for AMD hardware. This entire step should be removed.

Comment on lines +1282 to 1293
- label: Distributed Tests (H200) # optional
gpu: h200
optional: true
working_dir: "/vllm-workspace/"
num_gpus: 2
commands:
- pytest -v -s tests/compile/test_async_tp.py
- pytest -v -s tests/compile/test_sequence_parallelism.py
- pytest -v -s tests/compile/test_fusion_all_reduce.py
- pytest -v -s tests/compile/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm
- pytest -v -s tests/distributed/test_context_parallel.py
- CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This test step is configured for NVIDIA H200 GPUs (gpu: h200) and is not suitable for the test-amd.yaml pipeline. While the changes mirror another file, this entire step, including the typo fix and newly added tests, should be removed from the AMD-specific configuration.

source_file_dependencies:
- csrc/
- tests/kernels/core
- tests/kernels/test_top_k_per_row.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test file tests/kernels/test_top_k_per_row.py is CUDA-specific, as indicated by @pytest.mark.skipif(not current_platform.is_cuda(), reason="This test requires CUDA") within the file. Including this as a source dependency in test-amd.yaml is incorrect. This dependency should be removed from the AMD pipeline.

- cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model # Otherwise, mp_method="spawn" doesn't work

- label: Multi-Modal Accuracy Eval (Small Models) # 50min
mirror_hardwares: [amdexperimental]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we use amd production to enable this in actual CI?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pipeline in its current state is meant to be for monitoring only and not gating for the PRs


- label: Blackwell Test # 38 min
timeout_in_minutes: 60
- label: Blackwell Test # 21 min
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to test Blackwell on this AMD specific pipeline?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We include that test group for completeness reasons. For the moment it is just a comment and not an instruction for action in test-amd.yaml. But things may change in the future.

Copy link
Collaborator

@gshtras gshtras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming gpu: h200 are skipped in this pipeline and not put load on the actual h200 nodes

@gshtras gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 21, 2025
@gshtras gshtras merged commit 49c00fe into vllm-project:main Oct 22, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants