Skip to content

[RFC][CI/Build] Add Llama4 Maverick FP8 GSM8K + ChartQA Accuracy Tests #21810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

zhewenl
Copy link
Collaborator

@zhewenl zhewenl commented Jul 29, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Add more large models to accuracy testing, note they are unable to run on A100, so we will add them only after we get H100/MI300X capacity

This PR also picks up #19959, where we added supports for MM evals

Test Plan

pytest -s -v test_lm_eval_correctness.py \
    --config-list-file=configs/models-large-h100.txt \
    --tp-size=8

pytest -s -v test_lm_eval_correctness.py \
    --config-list-file=configs/models-mm-large-h100.txt \
    --tp-size=8

Test Result

================================================================= warnings summary ==================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.MessageMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: Type google._upb._message.ScalarMapContainer uses PyType_Spec with a metaclass that has custom tp_new. This is deprecated and will no longer be allowed in Python 3.14.

.buildkite/lm-eval-harness/test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename0]
  /usr/lib64/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=2989776) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

.buildkite/lm-eval-harness/test_lm_eval_correctness.py: 51 warnings
  /home/zhewenli/uv_env/vllm/lib64/python3.12/site-packages/lm_eval/models/vllm_causallms.py:419: DeprecationWarning: The keyword arguments {'prompt_token_ids'} are deprecated and will be removed in a future update. Please use the 'prompts' parameter instead.
    cont = self._model_generate(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================== 2 passed, 54 warnings in 865.25s (0:14:25) =====================================================

full: gist:26a18e47250c03fee1ba9d8ffa4c431b

(Optional) Documentation Update

Signed-off-by: Ye (Charlotte) Qi <[email protected]>
Signed-off-by: Ye (Charlotte) Qi <[email protected]>
@zhewenl zhewenl requested review from mgoin and simon-mo as code owners July 29, 2025 07:11
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Jul 29, 2025
@zhewenl zhewenl requested a review from houseroad July 29, 2025 07:12
@mergify mergify bot added deepseek Related to DeepSeek models llama Related to Llama models labels Jul 29, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for two new large models, DeepSeek-V3 and Llama-4-Maverick-FP8, to the accuracy testing suite, targeting H100/MI300X hardware. The changes involve adding new YAML configuration files for these models and updating the test_lm_eval_correctness.py script to allow for configurable gpu_memory_utilization and batch_size. While the changes are well-structured, I've identified a potential issue with the default gpu_memory_utilization value, which could lead to test instability.

num_fewshot: 8
trust_remote_code: True
max_model_len: 1024
batch_size: 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am testing on my local H100 using this config, it's not ideal(batch size=1 and only 1k seq len) and perhaps we should test it using MI300X where it has much more GPU memory

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted that publicly available H100 only has 80GB so it’ll oom here. wonder if we have AMD or H200 in CI?

if not could you validate how much difference is there between full ds3 vs ds2 with tp/ep/so on.

if we cannot get ds3 added in ci then let’s try to add them to cd @huydnn if it’s not already there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have any MI300X/H200 in CI, and will follow up with @huydhn whether we can add it to Pytorch CI (it has H100/MI300X now)

num_fewshot: 8
trust_remote_code: True
max_model_len: 1024
batch_size: 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted that publicly available H100 only has 80GB so it’ll oom here. wonder if we have AMD or H200 in CI?

if not could you validate how much difference is there between full ds3 vs ds2 with tp/ep/so on.

if we cannot get ds3 added in ci then let’s try to add them to cd @huydnn if it’s not already there.

@zhewenl zhewenl force-pushed the add-more-large-model branch from c48e0de to 6496ade Compare July 29, 2025 22:12
@mergify mergify bot added documentation Improvements or additions to documentation frontend new-model Requests to new models performance Performance-related issues labels Jul 29, 2025
@zhewenl zhewenl force-pushed the add-more-large-model branch from 7a7d83e to 9fb7562 Compare July 30, 2025 00:20
@zhewenl zhewenl changed the title [RFC][CI/Build] Add Deepseek v3 and Llama4 Maverick FP8 [RFC][CI/Build] Add Llama4 Maverick FP8 GSM8K + ChartQA Accuracy Tests Jul 30, 2025
@zhewenl zhewenl force-pushed the add-more-large-model branch from 9fb7562 to 851ccc9 Compare July 30, 2025 00:28
@@ -0,0 +1,11 @@
# For hf script, without -t option (tensor parallel size).
Copy link
Collaborator Author

@zhewenl zhewenl Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will need to define model name + tasks(text/MM) which is not that scalable, we can think about refactoring the yaml/code to support tasks_groups where we can define the tests suite per model like this
cc @robertgshaw2-redhat

task_groups:
    mm_tasks:
        name: "chartqa"
        ...
    text_tasks:
        name: "gsm8k"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend llama Related to Llama models new-model Requests to new models performance Performance-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants