Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 35 additions & 28 deletions language/deepseek-r1/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Mlperf Inference DeepSeek Reference Implementation
# MLPerf Inference DeepSeek Reference Implementation

## Automated command to run the benchmark via MLFlow

Expand All @@ -13,6 +13,22 @@ You can also do pip install mlc-scripts and then use `mlcr` commands for downloa
- DeepSeek-R1 model is automatically downloaded as part of setup
- Checkpoint conversion is done transparently when needed.

**Using the MLC R2 Downloader**

Download the model using the MLCommons R2 Downloader:

```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
https://inference.mlcommons-storage.org/metadata/deepseek-r1-0528.uri
```

To specify a custom download directory, use the `-d` flag:
```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
-d /path/to/download/directory \
https://inference.mlcommons-storage.org/metadata/deepseek-r1-0528.uri
```

## Dataset Download

The dataset is an ensemble of the datasets: AIME, MATH500, gpqa, MMLU-Pro, livecodebench(code_generation_lite). They are covered by the following licenses:
Expand All @@ -23,49 +39,40 @@ The dataset is an ensemble of the datasets: AIME, MATH500, gpqa, MMLU-Pro, livec
- MMLU-Pro: [MIT](https://opensource.org/license/mit)
- livecodebench(code_generation_lite): [CC](https://creativecommons.org/share-your-work/cclicenses/)

### Preprocessed

**Using MLCFlow Automation**

```
mlcr get,dataset,whisper,_preprocessed,_mlc,_rclone --outdirname=<path to download> -j
```
### Preprocessed & Calibration

**Using Native method**
**Using the MLC R2 Downloader**

You can use Rclone to download the preprocessed dataset from a Cloudflare R2 bucket.
Download the full preprocessed dataset and calibration dataset using the MLCommons R2 Downloader:

To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
To install Rclone on Linux/macOS/BSD systems, run:
```
sudo -v ; curl https://rclone.org/install.sh | sudo bash
```
Once Rclone is installed, run the following command to authenticate with the bucket:
```
rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
-d ./ https://inference.mlcommons-storage.org/metadata/deepseek-r1-datasets-fp8-eval.uri
```
You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:

```
rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/datasets/mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl ./ -P
This will download the full preprocessed dataset file (`mlperf_deepseek_r1_dataset_4388_fp8_eval.pkl`) and the calibration dataset file (`mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl`).

To specify a custom download directory, use the `-d` flag:
```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
-d /path/to/download/directory \
https://inference.mlcommons-storage.org/metadata/deepseek-r1-datasets-fp8-eval.uri
```

### Calibration
### Preprocessed

**Using MLCFlow Automation**

```
mlcr get,preprocessed,dataset,deepseek-r1,_calibration,_mlc,_rclone --outdirname=<path to download> -j
mlcr get,dataset,whisper,_preprocessed,_mlc,_rclone --outdirname=<path to download> -j
```

**Using Native method**

Download and install Rclone as described in the previous section.
### Calibration

Then navigate in the terminal to your desired download directory and run the following command to download the dataset:
**Using MLCFlow Automation**

```
rclone copy mlc-inference:mlcommons-inference-wg-public/deepseek_r1/datasets/mlperf_deepseek_r1_calibration_dataset_500_fp8_eval.pkl ./ -P
mlcr get,preprocessed,dataset,deepseek-r1,_calibration,_mlc,_rclone --outdirname=<path to download> -j
```

## Docker
Expand Down
40 changes: 18 additions & 22 deletions language/llama3.1-8b/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,18 +137,7 @@ Downloading llama3.1-8b model from Hugging Face will require an [**access token*

### Preprocessed

You can use Rclone to download the preprocessed dataset from a Cloudflare R2 bucket.

To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
To install Rclone on Linux/macOS/BSD systems, run:
```
sudo -v ; curl https://rclone.org/install.sh | sudo bash
```
Once Rclone is installed, run the following command to authenticate with the bucket:
```
rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
```
You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:
Download the preprocessed datasets using the MLCommons downloader:

#### Full dataset (datacenter)

Expand All @@ -158,9 +147,11 @@ mlcr get,dataset,cnndm,_validation,_datacenter,_llama3,_mlc,_rclone --outdirname
```

**Native method**
```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
https://inference.mlcommons-storage.org/metadata/llama3-1-8b-cnn-eval.uri
```
rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/datasets/cnn_eval.json ./ -P
```
This will download `cnn_eval.json`.

#### 5000 samples (edge)

Expand All @@ -170,9 +161,11 @@ mlcr get,dataset,cnndm,_validation,_edge,_llama3,_mlc,_rclone --outdirname=<path
```

**Native method**
```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
https://inference.mlcommons-storage.org/metadata/llama3-1-8b-sample-cnn-eval-5000.uri
```
rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/datasets/sample_cnn_eval_5000.json ./ -P
```
This will download `sample_cnn_eval_5000.json`.

#### Calibration

Expand All @@ -182,14 +175,17 @@ mlcr get,dataset,cnndm,_calibration,_llama3,_mlc,_rclone --outdirname=<path to d
```

**Native method**
```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
https://inference.mlcommons-storage.org/metadata/llama3-1-8b-cnn-dailymail-calibration.uri
```
rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/datasets/cnn_dailymail_calibration.json ./ -P
```
This will download `cnn_dailymail_calibration.json`.

You can also download the calibration dataset from the Cloudflare R2 bucket by running the following command:

```
rclone copy mlc-inference:mlcommons-inference-wg-public/llama3.1_8b/cnn_eval.json ./ -P
To specify a custom download directory for any of these, use the `-d` flag:
```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
-d /path/to/download/directory \
<URI>
```


Expand Down
40 changes: 23 additions & 17 deletions speech2text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,23 +105,21 @@ You can download the model automatically via the below command
mlcr get,ml-model,whisper,_rclone,_mlc --outdirname=<path_to_download> -j
```

**Official Model download using native method**
**Official Model download using MLC R2 Downloader**

You can use Rclone to download the preprocessed dataset from a Cloudflare R2 bucket.
Download the Whisper model using the MLCommons downloader:

To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
To install Rclone on Linux/macOS/BSD systems, run:
```
sudo -v ; curl https://rclone.org/install.sh | sudo bash
```
Once Rclone is installed, run the following command to authenticate with the bucket:
```
rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d whisper/model https://inference.mlcommons-storage.org/metadata/whisper-model.uri
```
You can then navigate in the terminal to your desired download directory and run the following command to download the model:

```
rclone copy mlc-inference:mlcommons-inference-wg-public/Whisper/model/ ./ -P
This will download the Whisper model files.

To specify a custom download directory, use the `-d` flag:
```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
-d /path/to/download/directory \
https://inference.mlcommons-storage.org/metadata/whisper-model.uri
```

### External Download (Not recommended for official submission)
Expand Down Expand Up @@ -156,13 +154,21 @@ We use dev-clean and dev-other splits, which are approximately 10 hours.
mlcr get,dataset,whisper,_preprocessed,_mlc,_rclone --outdirname=<path to download> -j
```

**Native method**
**Using MLC R2 Downloader**

Download and install rclone as decribed in the [MLCommons Download section](#mlcommons-download)
Download the preprocessed dataset using the MLCommons R2 Downloader:

You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:
```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) -d whisper/dataset https://inference.mlcommons-storage.org/metadata/whisper-dataset.uri
```
rclone copy mlc-inference:mlcommons-inference-wg-public/Whisper/dataset/ ./ -P

This will download the LibriSpeech dataset files.

To specify a custom download directory, use the `-d` flag:
```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
-d /path/to/download/directory \
https://inference.mlcommons-storage.org/metadata/whisper-dataset.uri
```

### Unprocessed
Expand Down