|
1 |
| -# Dataset Preprocessing Documentation - DeepSeek-R1 |
| 1 | +# DeepSeek-R1 Preprocessing |
2 | 2 |
|
3 |
| -## Model: DeepSeek-R1 |
4 |
| -**Dataset:** Multi-domain Evaluation Ensemble |
5 |
| -**Evaluation Task:** Multi-domain Reasoning and Code Generation |
6 |
| - |
7 |
| -## Data Source |
8 |
| -- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2 bucket |
9 |
| -- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/` |
10 |
| -- **Components:** AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench (code_generation_lite) |
11 |
| -- **Licenses:** |
12 |
| - - AIME: [CC0](https://creativecommons.org/public-domain/cc0/) |
13 |
| - - MATH500: [MIT](https://opensource.org/license/mit) |
14 |
| - - GPQA: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) |
15 |
| - - MMLU-Pro: [MIT](https://opensource.org/license/mit) |
16 |
| - - LiveCodeBench: [CC](https://creativecommons.org/share-your-work/cclicenses/) |
17 |
| - |
18 |
| -## Current Implementation |
19 |
| - |
20 |
| -### Files Available |
21 |
| -- **Main Dataset:** `mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl` |
22 |
| -- **Calibration Set:** `mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl` |
23 |
| -- **Format:** Preprocessed pickle files ready for evaluation |
24 |
| - |
25 |
| -### Download Process |
26 |
| -```bash |
27 |
| -# Install Rclone |
28 |
| -sudo -v ; curl https://rclone.org/install.sh | sudo bash |
29 |
| - |
30 |
| -# Configure access |
31 |
| -rclone config create mlc-inference s3 provider=Cloudflare \ |
32 |
| - access_key_id=f65ba5eef400db161ea49967de89f47b \ |
33 |
| - secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b \ |
34 |
| - endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com |
35 |
| - |
36 |
| -# Download datasets |
37 |
| -rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P |
38 |
| -rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P |
| 3 | +## Tokenization |
| 4 | +```python |
| 5 | +tokenizer = AutoTokenizer.from_pretrained(model_path) |
| 6 | +max_length = 4096 |
39 | 7 | ```
|
40 | 8 |
|
41 |
| -## Missing Documentation (Addresses Issue #2245) |
42 |
| - |
43 |
| -The following preprocessing information is **not currently available**, making reproduction and adaptation difficult: |
44 |
| - |
45 |
| -### 1. Original Data Sources |
46 |
| -- **Raw Dataset Locations:** Where each component dataset was obtained |
47 |
| -- **Version Information:** Specific versions/commits of source datasets |
48 |
| -- **Access Methods:** How to obtain raw data independently |
49 |
| - |
50 |
| -### 2. Preprocessing Pipeline |
51 |
| -- **Tokenization Method:** Which tokenizer was used and configuration |
52 |
| -- **Input Formatting:** How different dataset formats were standardized |
53 |
| -- **Quality Filtering:** Criteria for sample inclusion/exclusion |
54 |
| -- **Ensemble Strategy:** How multiple datasets were combined |
55 |
| - |
56 |
| -### 3. Dataset Statistics |
57 |
| -- **Sample Counts:** Number of samples from each component dataset |
58 |
| -- **Distribution:** How samples are balanced across domains |
59 |
| -- **Difficulty Levels:** Complexity distribution of included problems |
60 |
| - |
61 |
| -### 4. Validation Process |
62 |
| -- **Quality Control:** How preprocessing quality was verified |
63 |
| -- **Consistency Checks:** Validation of format standardization |
64 |
| -- **Error Handling:** How malformed samples were addressed |
65 |
| - |
66 |
| -## Adaptation Challenges |
67 |
| - |
68 |
| -**For Different Tokenizers:** |
69 |
| -- Cannot modify tokenization without access to raw data |
70 |
| -- No documentation of original tokenization parameters |
71 |
| -- Unable to test preprocessing consistency |
72 |
| - |
73 |
| -**For Different Models:** |
74 |
| -- Cannot adapt input formatting without preprocessing scripts |
75 |
| -- No guidance on prompt template modifications |
76 |
| -- Unable to reproduce dataset with different filtering criteria |
77 |
| - |
78 |
| -## Recommended Improvements |
79 |
| - |
80 |
| -To fully address issue #2245 and improve reproducibility: |
81 |
| - |
82 |
| -### 1. Raw Data Access |
83 |
| -- Provide scripts to download original datasets |
84 |
| -- Document exact versions and sources used |
85 |
| -- Include data licenses and attribution |
86 |
| - |
87 |
| -### 2. Preprocessing Scripts |
88 |
| -- Create preprocessing pipeline (similar to `llama2-70b/processorca.py`) |
89 |
| -- Document tokenization and formatting steps |
90 |
| -- Include quality filtering logic |
91 |
| - |
92 |
| -### 3. Documentation |
93 |
| -- Add detailed preprocessing methodology |
94 |
| -- Include dataset statistics and composition |
95 |
| -- Provide adaptation guidelines |
96 |
| - |
97 |
| -### 4. Validation |
98 |
| -- Include preprocessing verification scripts |
99 |
| -- Document expected outputs and checksums |
100 |
| -- Provide quality metrics |
101 |
| - |
102 |
| -## Temporary Workaround |
| 9 | +## Prompt Template (GSM8K) |
| 10 | +``` |
| 11 | +<|im_start|>system |
| 12 | +You are a helpful assistant that thinks step by step.<|im_end|> |
| 13 | +<|im_start|>user |
| 14 | +{question} |
103 | 15 |
|
104 |
| -Until full preprocessing documentation is available: |
105 |
| -1. Use provided preprocessed datasets for standard evaluation |
106 |
| -2. Contact maintainers for specific adaptation requirements |
107 |
| -3. Reference `llama2-70b/processorca.py` for preprocessing patterns |
108 |
| -4. Consider contributing preprocessing scripts based on reverse engineering |
| 16 | +Let's think about this step by step.<|im_end|> |
| 17 | +<|im_start|>assistant |
| 18 | +``` |
109 | 19 |
|
110 |
| -## See Also |
111 |
| -- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing |
112 |
| -- `PREPROCESSING-TEMPLATE.md` - Standard template for future models |
113 |
| -- Repository issue #2245 - Discussion of preprocessing documentation gaps |
| 20 | +## Answer Extraction |
| 21 | +```python |
| 22 | +# Remove reasoning, extract final answer |
| 23 | +output = full_output.split('<|/thinking|>')[-1] if '<|/thinking|>' in full_output else full_output |
| 24 | +answer = re.search(r'####\s*(\d+)', output) |
| 25 | +final_answer = answer.group(1) if answer else output.strip() |
| 26 | +``` |
0 commit comments