-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
[RFC][CI/Build] Add Llama4 Maverick FP8 GSM8K + ChartQA Accuracy Tests #21810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ye (Charlotte) Qi <[email protected]>
Signed-off-by: Ye (Charlotte) Qi <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for two new large models, DeepSeek-V3 and Llama-4-Maverick-FP8, to the accuracy testing suite, targeting H100/MI300X hardware. The changes involve adding new YAML configuration files for these models and updating the test_lm_eval_correctness.py
script to allow for configurable gpu_memory_utilization
and batch_size
. While the changes are well-structured, I've identified a potential issue with the default gpu_memory_utilization
value, which could lead to test instability.
num_fewshot: 8 | ||
trust_remote_code: True | ||
max_model_len: 1024 | ||
batch_size: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am testing on my local H100 using this config, it's not ideal(batch size=1 and only 1k seq len) and perhaps we should test it using MI300X where it has much more GPU memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted that publicly available H100 only has 80GB so it’ll oom here. wonder if we have AMD or H200 in CI?
if not could you validate how much difference is there between full ds3 vs ds2 with tp/ep/so on.
if we cannot get ds3 added in ci then let’s try to add them to cd @huydnn if it’s not already there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have any MI300X/H200 in CI, and will follow up with @huydhn whether we can add it to Pytorch CI (it has H100/MI300X now)
num_fewshot: 8 | ||
trust_remote_code: True | ||
max_model_len: 1024 | ||
batch_size: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted that publicly available H100 only has 80GB so it’ll oom here. wonder if we have AMD or H200 in CI?
if not could you validate how much difference is there between full ds3 vs ds2 with tp/ep/so on.
if we cannot get ds3 added in ci then let’s try to add them to cd @huydnn if it’s not already there.
c48e0de
to
6496ade
Compare
7a7d83e
to
9fb7562
Compare
9fb7562
to
851ccc9
Compare
@@ -0,0 +1,11 @@ | |||
# For hf script, without -t option (tensor parallel size). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will need to define model name + tasks(text/MM) which is not that scalable, we can think about refactoring the yaml/code to support tasks_groups
where we can define the tests suite per model like this
cc @robertgshaw2-redhat
task_groups:
mm_tasks:
name: "chartqa"
...
text_tasks:
name: "gsm8k"
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Add more large models to accuracy testing, note they are unable to run on A100, so we will add them only after we get H100/MI300X capacity
This PR also picks up #19959, where we added supports for MM evals
Test Plan
Test Result
full: gist:26a18e47250c03fee1ba9d8ffa4c431b
(Optional) Documentation Update