-
Couldn't load subscription status.
- Fork 519
[Text]Add accuracy test for model Mistral-7B-Instruct-v0.1 #3742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new accuracy test for the Mistral-7B-Instruct-v0.1 model. My review focuses on the correctness and security of the new test configuration. I've identified a critical security issue with the use of trust_remote_code: True and the use of a mirrored model from a non-official source. I recommend switching to the official mistralai model, which also resolves the security concern. Additionally, I've pointed out that the expected accuracy metrics in the test are significantly lower than published benchmarks, which could weaken the test's ability to catch regressions. I've provided suggestions to address these points.
| model_name: "AI-ModelScope/Mistral-7B-Instruct-v0.1" | ||
| runner: "linux-aarch64-a2-1" | ||
| hardware: "Atlas A2 Series" | ||
| tasks: | ||
| - name: "gsm8k" | ||
| metrics: | ||
| - name: "exact_match,strict-match" | ||
| value: 0.35 | ||
| - name: "exact_match,flexible-extract" | ||
| value: 0.38 | ||
| trust_remote_code: True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a mirrored model from AI-ModelScope and setting trust_remote_code: True introduces security and correctness risks. trust_remote_code: True allows arbitrary code execution and should be avoided. It's highly recommended to use the official mistralai/Mistral-7B-Instruct-v0.1 model, which is more secure as it does not require this flag, and ensures you are testing against the canonical model version. Please update the model name and remove the trust_remote_code setting.
model_name: "mistralai/Mistral-7B-Instruct-v0.1"
runner: "linux-aarch64-a2-1"
hardware: "Atlas A2 Series"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.35
- name: "exact_match,flexible-extract"
value: 0.38| - name: "exact_match,strict-match" | ||
| value: 0.35 | ||
| - name: "exact_match,flexible-extract" | ||
| value: 0.38 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected accuracy values for gsm8k (35% and 38%) are significantly lower than the published score of 42.7% for Mistral-7B-Instruct-v0.1 on this benchmark (with 8-shot, which appears to be the setting used here). Using such a low expectation for accuracy can mask future performance regressions. For example, if the model performance degrades but remains above this low threshold, the test would still pass. It is recommended to investigate the cause of this discrepancy, which might be related to using a mirrored model. The expected values should be as close as possible to the actual measured performance to make the test meaningful for detecting regressions.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: MrZ20 <[email protected]>
c2ff1d2 to
87851ce
Compare
What this PR does / why we need it?
Add accuracy test for model Mistral-7B-Instruct-v0.1
Does this PR introduce any user-facing change?
How was this patch tested?