Skip to content

Conversation

@MrZ20
Copy link

@MrZ20 MrZ20 commented Oct 25, 2025

What this PR does / why we need it?

Add accuracy test for model Mistral-7B-Instruct-v0.1

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new accuracy test for the Mistral-7B-Instruct-v0.1 model. My review focuses on the correctness and security of the new test configuration. I've identified a critical security issue with the use of trust_remote_code: True and the use of a mirrored model from a non-official source. I recommend switching to the official mistralai model, which also resolves the security concern. Additionally, I've pointed out that the expected accuracy metrics in the test are significantly lower than published benchmarks, which could weaken the test's ability to catch regressions. I've provided suggestions to address these points.

Comment on lines +1 to +11
model_name: "AI-ModelScope/Mistral-7B-Instruct-v0.1"
runner: "linux-aarch64-a2-1"
hardware: "Atlas A2 Series"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.35
- name: "exact_match,flexible-extract"
value: 0.38
trust_remote_code: True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using a mirrored model from AI-ModelScope and setting trust_remote_code: True introduces security and correctness risks. trust_remote_code: True allows arbitrary code execution and should be avoided. It's highly recommended to use the official mistralai/Mistral-7B-Instruct-v0.1 model, which is more secure as it does not require this flag, and ensures you are testing against the canonical model version. Please update the model name and remove the trust_remote_code setting.

model_name: "mistralai/Mistral-7B-Instruct-v0.1"
runner: "linux-aarch64-a2-1"
hardware: "Atlas A2 Series"
tasks:
- name: "gsm8k"
  metrics:
  - name: "exact_match,strict-match"
    value: 0.35
  - name: "exact_match,flexible-extract"
    value: 0.38

Comment on lines +7 to +10
- name: "exact_match,strict-match"
value: 0.35
- name: "exact_match,flexible-extract"
value: 0.38
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The expected accuracy values for gsm8k (35% and 38%) are significantly lower than the published score of 42.7% for Mistral-7B-Instruct-v0.1 on this benchmark (with 8-shot, which appears to be the setting used here). Using such a low expectation for accuracy can mask future performance regressions. For example, if the model performance degrades but remains above this low threshold, the test would still pass. It is recommended to investigate the cause of this discrepancy, which might be related to using a mirrored model. The expected values should be as close as possible to the actual measured performance to make the test meaningful for detecting regressions.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: MrZ20 <[email protected]>
@MrZ20 MrZ20 force-pushed the Mistral_7B_Instruct_v0.1 branch from c2ff1d2 to 87851ce Compare October 27, 2025 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant