Skip to content

Commit b288c06

Browse files
author
Anivar A Aravind
committed
Simplify preprocessing docs to match PR #2300 approach
- Remove over-engineered validation scripts - Keep only essential information: tokenizer, prompt template, verification - Add answer extraction for DeepSeek CoT handling - Focus on what directly impacts accuracy variance
1 parent 483d1c2 commit b288c06

File tree

2 files changed

+36
-180
lines changed

2 files changed

+36
-180
lines changed
Lines changed: 21 additions & 108 deletions
Original file line numberDiff line numberDiff line change
@@ -1,113 +1,26 @@
1-
# Dataset Preprocessing Documentation - DeepSeek-R1
1+
# DeepSeek-R1 Preprocessing
22

3-
## Model: DeepSeek-R1
4-
**Dataset:** Multi-domain Evaluation Ensemble
5-
**Evaluation Task:** Multi-domain Reasoning and Code Generation
6-
7-
## Data Source
8-
- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2 bucket
9-
- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`
10-
- **Components:** AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench (code_generation_lite)
11-
- **Licenses:**
12-
- AIME: [CC0](https://creativecommons.org/public-domain/cc0/)
13-
- MATH500: [MIT](https://opensource.org/license/mit)
14-
- GPQA: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
15-
- MMLU-Pro: [MIT](https://opensource.org/license/mit)
16-
- LiveCodeBench: [CC](https://creativecommons.org/share-your-work/cclicenses/)
17-
18-
## Current Implementation
19-
20-
### Files Available
21-
- **Main Dataset:** `mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl`
22-
- **Calibration Set:** `mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl`
23-
- **Format:** Preprocessed pickle files ready for evaluation
24-
25-
### Download Process
26-
```bash
27-
# Install Rclone
28-
sudo -v ; curl https://rclone.org/install.sh | sudo bash
29-
30-
# Configure access
31-
rclone config create mlc-inference s3 provider=Cloudflare \
32-
access_key_id=f65ba5eef400db161ea49967de89f47b \
33-
secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b \
34-
endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
35-
36-
# Download datasets
37-
rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P
38-
rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P
3+
## Tokenization
4+
```python
5+
tokenizer = AutoTokenizer.from_pretrained(model_path)
6+
max_length = 4096
397
```
408

41-
## Missing Documentation (Addresses Issue #2245)
42-
43-
The following preprocessing information is **not currently available**, making reproduction and adaptation difficult:
44-
45-
### 1. Original Data Sources
46-
- **Raw Dataset Locations:** Where each component dataset was obtained
47-
- **Version Information:** Specific versions/commits of source datasets
48-
- **Access Methods:** How to obtain raw data independently
49-
50-
### 2. Preprocessing Pipeline
51-
- **Tokenization Method:** Which tokenizer was used and configuration
52-
- **Input Formatting:** How different dataset formats were standardized
53-
- **Quality Filtering:** Criteria for sample inclusion/exclusion
54-
- **Ensemble Strategy:** How multiple datasets were combined
55-
56-
### 3. Dataset Statistics
57-
- **Sample Counts:** Number of samples from each component dataset
58-
- **Distribution:** How samples are balanced across domains
59-
- **Difficulty Levels:** Complexity distribution of included problems
60-
61-
### 4. Validation Process
62-
- **Quality Control:** How preprocessing quality was verified
63-
- **Consistency Checks:** Validation of format standardization
64-
- **Error Handling:** How malformed samples were addressed
65-
66-
## Adaptation Challenges
67-
68-
**For Different Tokenizers:**
69-
- Cannot modify tokenization without access to raw data
70-
- No documentation of original tokenization parameters
71-
- Unable to test preprocessing consistency
72-
73-
**For Different Models:**
74-
- Cannot adapt input formatting without preprocessing scripts
75-
- No guidance on prompt template modifications
76-
- Unable to reproduce dataset with different filtering criteria
77-
78-
## Recommended Improvements
79-
80-
To fully address issue #2245 and improve reproducibility:
81-
82-
### 1. Raw Data Access
83-
- Provide scripts to download original datasets
84-
- Document exact versions and sources used
85-
- Include data licenses and attribution
86-
87-
### 2. Preprocessing Scripts
88-
- Create preprocessing pipeline (similar to `llama2-70b/processorca.py`)
89-
- Document tokenization and formatting steps
90-
- Include quality filtering logic
91-
92-
### 3. Documentation
93-
- Add detailed preprocessing methodology
94-
- Include dataset statistics and composition
95-
- Provide adaptation guidelines
96-
97-
### 4. Validation
98-
- Include preprocessing verification scripts
99-
- Document expected outputs and checksums
100-
- Provide quality metrics
101-
102-
## Temporary Workaround
9+
## Prompt Template (GSM8K)
10+
```
11+
<|im_start|>system
12+
You are a helpful assistant that thinks step by step.<|im_end|>
13+
<|im_start|>user
14+
{question}
10315
104-
Until full preprocessing documentation is available:
105-
1. Use provided preprocessed datasets for standard evaluation
106-
2. Contact maintainers for specific adaptation requirements
107-
3. Reference `llama2-70b/processorca.py` for preprocessing patterns
108-
4. Consider contributing preprocessing scripts based on reverse engineering
16+
Let's think about this step by step.<|im_end|>
17+
<|im_start|>assistant
18+
```
10919

110-
## See Also
111-
- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing
112-
- `PREPROCESSING-TEMPLATE.md` - Standard template for future models
113-
- Repository issue #2245 - Discussion of preprocessing documentation gaps
20+
## Answer Extraction
21+
```python
22+
# Remove reasoning, extract final answer
23+
output = full_output.split('<|/thinking|>')[-1] if '<|/thinking|>' in full_output else full_output
24+
answer = re.search(r'####\s*(\d+)', output)
25+
final_answer = answer.group(1) if answer else output.strip()
26+
```
Lines changed: 15 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -1,82 +1,25 @@
1-
# Dataset Preprocessing Documentation - Llama3.1-8B
1+
# Llama 3.1 8B Preprocessing
22

3-
## Model: Llama3.1-8B
4-
**Dataset:** CNN/DailyMail 3.0.0
5-
**Evaluation Task:** Text Summarization
6-
7-
## Data Source
8-
- **Raw Dataset:** Hugging Face `cnn_dailymail` dataset v3.0.0
9-
- **Download Method:** `datasets.load_dataset("cnn_dailymail", "3.0.0", split="train")`
10-
- **License:** Apache 2.0
11-
- **Download Script:** `download_cnndm.py`
12-
13-
## Preprocessing Pipeline
14-
15-
### 1. Tokenization
3+
## Tokenization
164
```python
17-
from transformers import AutoTokenizer
18-
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
19-
tokenizer.padding_side = "left"
20-
tokenizer.pad_token = tokenizer.eos_token
21-
tokenizer.model_max_length = 8000
5+
tokenizer = AutoTokenizer.from_pretrained(model_path) # Use model's tokenizer
6+
max_length = 2048
227
```
238

24-
### 2. Input Template
9+
## Prompt Template (CNN/DailyMail)
2510
```
26-
Summarize the following news article in 128 tokens. Please output the summary only, without any other text.
27-
28-
Article:
29-
{article}
30-
31-
Summary:
32-
```
33-
34-
### 3. Current Implementation
35-
- **Download:** `download_cnndm.py` loads CNN/DailyMail dataset
36-
- **Calibration:** `prepare-calibration.py` creates calibration subset
37-
- **Evaluation:** Uses `evaluation.py` for accuracy assessment
11+
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
3812
39-
## Missing Documentation (Addresses Issue #2245)
13+
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
4014
41-
The following preprocessing steps are **not currently documented** but would be needed for full reproducibility:
15+
Summarize this article:
4216
43-
### 4. Filtering Steps (Recommended)
44-
Based on `llama2-70b/processorca.py` patterns:
45-
- **Language Filter:** English-only content validation
46-
- **Length Filter:** Input/output sequence length limits
47-
- **Quality Filter:** Remove very short summaries
48-
- **Content Filter:** Handle special characters and formatting
17+
{article}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
4918
50-
### 5. Sampling Strategy (Recommended)
51-
- **Dataset Size:** Specify number of evaluation samples
52-
- **Selection Method:** Random vs stratified sampling
53-
- **Validation:** How to verify preprocessing consistency
54-
55-
## Adaptation Guide
56-
57-
**For Different Tokenizers:**
58-
1. Update `model-id` parameter in scripts
59-
2. Adjust `model_max_length` based on tokenizer capabilities
60-
3. Verify special token handling (pad_token, eos_token)
61-
62-
**For Different Models:**
63-
1. Modify input template format
64-
2. Adjust summary length requirements (currently 128 tokens)
65-
3. Update evaluation criteria as needed
66-
67-
## Files Generated
68-
- **Main Dataset:** Downloaded via `download_cnndm.py`
69-
- **Calibration Set:** Generated via `prepare-calibration.py`
70-
- **Format:** Standard CNN/DailyMail format from Hugging Face
71-
72-
## Next Steps for Full Reproducibility
73-
74-
To fully address issue #2245, consider adding:
75-
1. Complete preprocessing script (similar to `llama2-70b/processorca.py`)
76-
2. Documentation of filtering criteria
77-
3. Sampling methodology
78-
4. Quality validation steps
19+
```
7920

80-
## See Also
81-
- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing
82-
- `PREPROCESSING-TEMPLATE.md` - Standard template for future models
21+
## Verification
22+
```bash
23+
# Check first token is 128000 (begin_of_text)
24+
python -c "print(tokenizer.encode('<|begin_of_text|>')[0])"
25+
```

0 commit comments

Comments
 (0)