Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/accuracy_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@ jobs:
model_name: DeepSeek-V2-Lite
- runner: a2-4
model_name: Qwen3-Next-80B-A3B-Instruct
- runner: a2-1
model_name: Mistral-7B-Instruct-v0.1
fail-fast: false
# test will be triggered when tag 'accuracy-test' & 'ready-for-test'
if: >-
Expand Down
11 changes: 11 additions & 0 deletions tests/e2e/models/configs/Mistral-7B-Instruct-v0.1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
model_name: "AI-ModelScope/Mistral-7B-Instruct-v0.1"
runner: "linux-aarch64-a2-1"
hardware: "Atlas A2 Series"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.35
- name: "exact_match,flexible-extract"
value: 0.38
Comment on lines +7 to +10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The expected accuracy values for gsm8k (35% and 38%) are significantly lower than the published score of 42.7% for Mistral-7B-Instruct-v0.1 on this benchmark (with 8-shot, which appears to be the setting used here). Using such a low expectation for accuracy can mask future performance regressions. For example, if the model performance degrades but remains above this low threshold, the test would still pass. It is recommended to investigate the cause of this discrepancy, which might be related to using a mirrored model. The expected values should be as close as possible to the actual measured performance to make the test meaningful for detecting regressions.

trust_remote_code: True
Comment on lines +1 to +11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using a mirrored model from AI-ModelScope and setting trust_remote_code: True introduces security and correctness risks. trust_remote_code: True allows arbitrary code execution and should be avoided. It's highly recommended to use the official mistralai/Mistral-7B-Instruct-v0.1 model, which is more secure as it does not require this flag, and ensures you are testing against the canonical model version. Please update the model name and remove the trust_remote_code setting.

model_name: "mistralai/Mistral-7B-Instruct-v0.1"
runner: "linux-aarch64-a2-1"
hardware: "Atlas A2 Series"
tasks:
- name: "gsm8k"
  metrics:
  - name: "exact_match,strict-match"
    value: 0.35
  - name: "exact_match,flexible-extract"
    value: 0.38

1 change: 1 addition & 0 deletions tests/e2e/models/configs/accuracy.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ Qwen2-7B.yaml
Qwen2-VL-7B-Instruct.yaml
Qwen2-Audio-7B-Instruct.yaml
Qwen3-VL-30B-A3B-Instruct.yaml
Mistral-7B-Instruct-v0.1.yaml
Loading