[RFC][CI/Build] Add Llama4 Maverick FP8 GSM8K + ChartQA Accuracy Tests #21810

zhewenl · 2025-07-29T07:11:33Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Add more large models to accuracy testing, note they are unable to run on A100, so we will add them only after we get H100/MI300X capacity

This PR also picks up #19959, where we added supports for MM evals

Test Plan

pytest -s -v test_lm_eval_correctness.py \
    --config-list-file=configs/models-large-h100.txt \
    --tp-size=8

pytest -s -v test_lm_eval_correctness.py \
    --config-list-file=configs/models-mm-large-h100.txt \
    --tp-size=8

Test Result

================================================================= warnings summary ==================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.MessageMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.ScalarMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /usr/lib64/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=2989776) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

.buildkite/lm-eval-harness/test_lm_eval_correctness.py: 51 warnings
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/lm_eval/models/vllm_causallms.py:419: DeprecationWarning: The keyword arguments {'prompt_token_ids'} are deprecated and will be removed in a future update. Please use the 'prompts' parameter instead.
    cont = self._model_generate(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================== 2 passed, 54 warnings in 865.25s (0:14:25) =====================================================

full: gist:26a18e47250c03fee1ba9d8ffa4c431b

(Optional) Documentation Update

Signed-off-by: Ye (Charlotte) Qi <[email protected]>

github-actions · 2025-07-29T07:11:44Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces support for two new large models, DeepSeek-V3 and Llama-4-Maverick-FP8, to the accuracy testing suite, targeting H100/MI300X hardware. The changes involve adding new YAML configuration files for these models and updating the test_lm_eval_correctness.py script to allow for configurable gpu_memory_utilization and batch_size. While the changes are well-structured, I've identified a potential issue with the default gpu_memory_utilization value, which could lead to test instability.

.buildkite/lm-eval-harness/test_lm_eval_correctness.py

zhewenl · 2025-07-29T07:14:05Z

.buildkite/lm-eval-harness/configs/DeepSeek-V3.yaml

+num_fewshot: 8
+trust_remote_code: True
+max_model_len: 1024
+batch_size: 1


I am testing on my local H100 using this config, it's not ideal(batch size=1 and only 1k seq len) and perhaps we should test it using MI300X where it has much more GPU memory

noted that publicly available H100 only has 80GB so it’ll oom here. wonder if we have AMD or H200 in CI?

if not could you validate how much difference is there between full ds3 vs ds2 with tp/ep/so on.

if we cannot get ds3 added in ci then let’s try to add them to cd @huydnn if it’s not already there.

I don't think we have any MI300X/H200 in CI, and will follow up with @huydhn whether we can add it to Pytorch CI (it has H100/MI300X now)

yeqcharlotte · 2025-07-29T09:02:30Z

.buildkite/lm-eval-harness/configs/DeepSeek-V3.yaml

+num_fewshot: 8
+trust_remote_code: True
+max_model_len: 1024
+batch_size: 1


noted that publicly available H100 only has 80GB so it’ll oom here. wonder if we have AMD or H200 in CI?

if not could you validate how much difference is there between full ds3 vs ds2 with tp/ep/so on.

if we cannot get ds3 added in ci then let’s try to add them to cd @huydnn if it’s not already there.

zhewenl · 2025-07-30T04:15:47Z

.buildkite/lm-eval-harness/configs/Meta-Llama-4-Maverick-17B-128E-Instruct-FP8-MM.yaml

@@ -0,0 +1,11 @@
+# For hf script, without -t option (tensor parallel size).


we will need to define model name + tasks(text/MM) which is not that scalable, we can think about refactoring the yaml/code to support tasks_groups where we can define the tests suite per model like this
cc @robertgshaw2-redhat

task_groups: mm_tasks: name: "chartqa" ... text_tasks: name: "gsm8k"

yeqcharlotte added 2 commits July 25, 2025 18:16

add text config

60f945b

Signed-off-by: Ye (Charlotte) Qi <[email protected]>

add mm config for qwen2.5vl 7b

6d9cbdd

Signed-off-by: Ye (Charlotte) Qi <[email protected]>

zhewenl requested review from mgoin and simon-mo as code owners July 29, 2025 07:11

zhewenl requested review from robertgshaw2-redhat and yeqcharlotte July 29, 2025 07:11

mergify bot added the ci/build label Jul 29, 2025

zhewenl requested a review from houseroad July 29, 2025 07:12

mergify bot added deepseek Related to DeepSeek models llama Related to Llama models labels Jul 29, 2025

gemini-code-assist bot reviewed Jul 29, 2025

View reviewed changes

.buildkite/lm-eval-harness/test_lm_eval_correctness.py Show resolved Hide resolved

zhewenl commented Jul 29, 2025

View reviewed changes

yeqcharlotte suggested changes Jul 29, 2025

View reviewed changes

zhewenl force-pushed the add-more-large-model branch from c48e0de to 6496ade Compare July 29, 2025 22:12

zhewenl requested review from hmellor, DarkLight1337, ywang96 and aarnphm as code owners July 29, 2025 22:12

mergify bot added documentation Improvements or additions to documentation frontend new-model Requests to new models performance Performance-related issues labels Jul 29, 2025

zhewenl force-pushed the add-more-large-model branch from 7a7d83e to 9fb7562 Compare July 30, 2025 00:20

zhewenl changed the title ~~[RFC][CI/Build] Add Deepseek v3 and Llama4 Maverick FP8~~ [RFC][CI/Build] Add Llama4 Maverick FP8 GSM8K + ChartQA Accuracy Tests Jul 30, 2025

add llama4 evals

851ccc9

zhewenl force-pushed the add-more-large-model branch from 9fb7562 to 851ccc9 Compare July 30, 2025 00:28

zhewenl commented Jul 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC][CI/Build] Add Llama4 Maverick FP8 GSM8K + ChartQA Accuracy Tests #21810

[RFC][CI/Build] Add Llama4 Maverick FP8 GSM8K + ChartQA Accuracy Tests #21810

zhewenl commented Jul 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

zhewenl Jul 29, 2025

Uh oh!

yeqcharlotte Jul 29, 2025

Uh oh!

zhewenl Jul 30, 2025

Uh oh!

yeqcharlotte Jul 29, 2025

Uh oh!

zhewenl Jul 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

		@@ -0,0 +1,11 @@
		# For hf script, without -t option (tensor parallel size).

Uh oh!

[RFC][CI/Build] Add Llama4 Maverick FP8 GSM8K + ChartQA Accuracy Tests #21810

Are you sure you want to change the base?

[RFC][CI/Build] Add Llama4 Maverick FP8 GSM8K + ChartQA Accuracy Tests #21810

Conversation

zhewenl commented Jul 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

zhewenl Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

zhewenl Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

yeqcharlotte Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

zhewenl Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhewenl commented Jul 29, 2025 •

edited by github-actions bot

Loading

zhewenl Jul 30, 2025 •

edited

Loading