-
Couldn't load subscription status.
- Fork 521
[Text]Add accuracy test for model Mistral-7B-Instruct-v0.1 #3742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| model_name: "AI-ModelScope/Mistral-7B-Instruct-v0.1" | ||
| runner: "linux-aarch64-a2-1" | ||
| hardware: "Atlas A2 Series" | ||
| tasks: | ||
| - name: "gsm8k" | ||
| metrics: | ||
| - name: "exact_match,strict-match" | ||
| value: 0.35 | ||
| - name: "exact_match,flexible-extract" | ||
| value: 0.38 | ||
| trust_remote_code: True | ||
|
Comment on lines
+1
to
+11
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using a mirrored model from model_name: "mistralai/Mistral-7B-Instruct-v0.1"
runner: "linux-aarch64-a2-1"
hardware: "Atlas A2 Series"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.35
- name: "exact_match,flexible-extract"
value: 0.38 |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected accuracy values for
gsm8k(35% and 38%) are significantly lower than the published score of 42.7% for Mistral-7B-Instruct-v0.1 on this benchmark (with 8-shot, which appears to be the setting used here). Using such a low expectation for accuracy can mask future performance regressions. For example, if the model performance degrades but remains above this low threshold, the test would still pass. It is recommended to investigate the cause of this discrepancy, which might be related to using a mirrored model. The expected values should be as close as possible to the actual measured performance to make the test meaningful for detecting regressions.