Simplify preprocessing docs to match PR #2300 approach

Anivar A Aravind · Anivar A Aravind · commit b288c06859d5 · 2025-08-03T17:41:21.000+05:30
- Remove over-engineered validation scripts
- Keep only essential information: tokenizer, prompt template, verification
- Add answer extraction for DeepSeek CoT handling
- Focus on what directly impacts accuracy variance
diff --git a/language/deepseek-r1/PREPROCESSING.md b/language/deepseek-r1/PREPROCESSING.md
@@ -1,113 +1,26 @@
-# Dataset Preprocessing Documentation - DeepSeek-R1
+# DeepSeek-R1 Preprocessing
 
-## Model: DeepSeek-R1
-**Dataset:** Multi-domain Evaluation Ensemble  
-**Evaluation Task:** Multi-domain Reasoning and Code Generation
-
-## Data Source
-- **Preprocessed Dataset:** Available via Rclone from Cloudflare R2 bucket
-- **Download Method:** `rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/`
-- **Components:** AIME, MATH500, GPQA, MMLU-Pro, LiveCodeBench (code_generation_lite)
-- **Licenses:** 
-  - AIME: [CC0](https://creativecommons.org/public-domain/cc0/)
-  - MATH500: [MIT](https://opensource.org/license/mit)
-  - GPQA: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
-  - MMLU-Pro: [MIT](https://opensource.org/license/mit)
-  - LiveCodeBench: [CC](https://creativecommons.org/share-your-work/cclicenses/)
-
-## Current Implementation
-
-### Files Available
-- **Main Dataset:** `mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl`
-- **Calibration Set:** `mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl`
-- **Format:** Preprocessed pickle files ready for evaluation
-
-### Download Process
-```bash
-# Install Rclone
-sudo -v ; curl https://rclone.org/install.sh | sudo bash
-
-# Configure access
-rclone config create mlc-inference s3 provider=Cloudflare \
-  access_key_id=f65ba5eef400db161ea49967de89f47b \
-  secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b \
-  endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
-
-# Download datasets
-rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P
-rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P
+## Tokenization
+```python
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+max_length = 4096
 ```
 
-## Missing Documentation (Addresses Issue #2245)
-
-The following preprocessing information is **not currently available**, making reproduction and adaptation difficult:
-
-### 1. Original Data Sources
-- **Raw Dataset Locations:** Where each component dataset was obtained
-- **Version Information:** Specific versions/commits of source datasets
-- **Access Methods:** How to obtain raw data independently
-
-### 2. Preprocessing Pipeline
-- **Tokenization Method:** Which tokenizer was used and configuration
-- **Input Formatting:** How different dataset formats were standardized
-- **Quality Filtering:** Criteria for sample inclusion/exclusion
-- **Ensemble Strategy:** How multiple datasets were combined
-
-### 3. Dataset Statistics
-- **Sample Counts:** Number of samples from each component dataset
-- **Distribution:** How samples are balanced across domains
-- **Difficulty Levels:** Complexity distribution of included problems
-
-### 4. Validation Process
-- **Quality Control:** How preprocessing quality was verified
-- **Consistency Checks:** Validation of format standardization
-- **Error Handling:** How malformed samples were addressed
-
-## Adaptation Challenges
-
-**For Different Tokenizers:**
-- Cannot modify tokenization without access to raw data
-- No documentation of original tokenization parameters
-- Unable to test preprocessing consistency
-
-**For Different Models:**
-- Cannot adapt input formatting without preprocessing scripts
-- No guidance on prompt template modifications
-- Unable to reproduce dataset with different filtering criteria
-
-## Recommended Improvements
-
-To fully address issue #2245 and improve reproducibility:
-
-### 1. Raw Data Access
-- Provide scripts to download original datasets
-- Document exact versions and sources used
-- Include data licenses and attribution
-
-### 2. Preprocessing Scripts
-- Create preprocessing pipeline (similar to `llama2-70b/processorca.py`)
-- Document tokenization and formatting steps
-- Include quality filtering logic
-
-### 3. Documentation
-- Add detailed preprocessing methodology
-- Include dataset statistics and composition
-- Provide adaptation guidelines
-
-### 4. Validation
-- Include preprocessing verification scripts
-- Document expected outputs and checksums
-- Provide quality metrics
-
-## Temporary Workaround
+## Prompt Template (GSM8K)
+```
+<|im_start|>system
+You are a helpful assistant that thinks step by step.<|im_end|>
+<|im_start|>user
+{question}
 
-Until full preprocessing documentation is available:
-1. Use provided preprocessed datasets for standard evaluation
-2. Contact maintainers for specific adaptation requirements
-3. Reference `llama2-70b/processorca.py` for preprocessing patterns
-4. Consider contributing preprocessing scripts based on reverse engineering
+Let's think about this step by step.<|im_end|>
+<|im_start|>assistant
+```
 
-## See Also
-- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing
-- `PREPROCESSING-TEMPLATE.md` - Standard template for future models
-- Repository issue #2245 - Discussion of preprocessing documentation gaps
+## Answer Extraction
+```python
+# Remove reasoning, extract final answer
+output = full_output.split('<|/thinking|>')[-1] if '<|/thinking|>' in full_output else full_output
+answer = re.search(r'####\s*(\d+)', output)
+final_answer = answer.group(1) if answer else output.strip()
+```
diff --git a/language/llama3.1-8b/PREPROCESSING.md b/language/llama3.1-8b/PREPROCESSING.md
@@ -1,82 +1,25 @@
-# Dataset Preprocessing Documentation - Llama3.1-8B
+# Llama 3.1 8B Preprocessing
 
-## Model: Llama3.1-8B
-**Dataset:** CNN/DailyMail 3.0.0  
-**Evaluation Task:** Text Summarization
-
-## Data Source
-- **Raw Dataset:** Hugging Face `cnn_dailymail` dataset v3.0.0
-- **Download Method:** `datasets.load_dataset("cnn_dailymail", "3.0.0", split="train")`
-- **License:** Apache 2.0
-- **Download Script:** `download_cnndm.py`
-
-## Preprocessing Pipeline
-
-### 1. Tokenization
+## Tokenization
 ```python
-from transformers import AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
-tokenizer.padding_side = "left"
-tokenizer.pad_token = tokenizer.eos_token
-tokenizer.model_max_length = 8000
+tokenizer = AutoTokenizer.from_pretrained(model_path)  # Use model's tokenizer
+max_length = 2048
 ```
 
-### 2. Input Template
+## Prompt Template (CNN/DailyMail)
 ```
-Summarize the following news article in 128 tokens. Please output the summary only, without any other text.
-
-Article:
-{article}
-
-Summary:
-```
-
-### 3. Current Implementation
-- **Download:** `download_cnndm.py` loads CNN/DailyMail dataset
-- **Calibration:** `prepare-calibration.py` creates calibration subset
-- **Evaluation:** Uses `evaluation.py` for accuracy assessment
+<|begin_of_text|><|start_header_id|>system<|end_header_id|>
 
-## Missing Documentation (Addresses Issue #2245)
+You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
 
-The following preprocessing steps are **not currently documented** but would be needed for full reproducibility:
+Summarize this article:
 
-### 4. Filtering Steps (Recommended)
-Based on `llama2-70b/processorca.py` patterns:
-- **Language Filter:** English-only content validation
-- **Length Filter:** Input/output sequence length limits
-- **Quality Filter:** Remove very short summaries
-- **Content Filter:** Handle special characters and formatting
+{article}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
 
-### 5. Sampling Strategy (Recommended)
-- **Dataset Size:** Specify number of evaluation samples
-- **Selection Method:** Random vs stratified sampling
-- **Validation:** How to verify preprocessing consistency
-
-## Adaptation Guide
-
-**For Different Tokenizers:**
-1. Update `model-id` parameter in scripts
-2. Adjust `model_max_length` based on tokenizer capabilities
-3. Verify special token handling (pad_token, eos_token)
-
-**For Different Models:**
-1. Modify input template format
-2. Adjust summary length requirements (currently 128 tokens)
-3. Update evaluation criteria as needed
-
-## Files Generated
-- **Main Dataset:** Downloaded via `download_cnndm.py`
-- **Calibration Set:** Generated via `prepare-calibration.py`
-- **Format:** Standard CNN/DailyMail format from Hugging Face
-
-## Next Steps for Full Reproducibility
-
-To fully address issue #2245, consider adding:
-1. Complete preprocessing script (similar to `llama2-70b/processorca.py`)
-2. Documentation of filtering criteria
-3. Sampling methodology
-4. Quality validation steps
+```
 
-## See Also
-- `llama2-70b/processorca.py` - Reference implementation for comprehensive preprocessing
-- `PREPROCESSING-TEMPLATE.md` - Standard template for future models
+## Verification
+```bash
+# Check first token is 128000 (begin_of_text)
+python -c "print(tokenizer.encode('<|begin_of_text|>')[0])"
+```