Skip to content

Conversation

anivar
Copy link
Contributor

@anivar anivar commented Jul 25, 2025

Summary

  • Replaces rclone-based download instructions with the new MLCommons downloader infrastructure
  • Updates documentation for DeepSeek-R1, Llama 3.1 8b, and Whisper benchmarks only (as requested)
  • Maintains both MLCFlow automation commands and native download methods

Changes

  • Removed all rclone installation and configuration instructions
  • Added new download commands using https://inference.mlcommons-storage.org
  • Added documentation for the -d flag for custom download directories

Test plan

  • Verify new download commands work correctly for DeepSeek-R1 dataset and calibration files
  • Verify new download commands work correctly for Llama 3.1 8b datasets (full, edge, calibration)
  • Verify new download commands work correctly for Whisper model and dataset
  • Confirm MLCFlow automation commands remain unchanged and functional

Fixes #2265

@anivar anivar requested a review from a team as a code owner July 25, 2025 04:05
Copy link
Contributor

github-actions bot commented Jul 25, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@hanyunfan
Copy link
Contributor

Shall we keep both methods?

@nathanw-mlc
Copy link
Member

Shall we keep both methods?

We will hopefully be deprecating the Rclone method eventually, as it requires publicly sharing API keys and is a source of issues on account of people using different version of Rclone that behave differently, as well as folks incorrectly configuring Rclone remotes. The R2 downloader runs with a single command that handles everything.

Copy link
Member

@nathanw-mlc nathanw-mlc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR @anivar. Looking over the download commands in the modified README.md files, the commands are incorrect, pointing to JSON files instead of the URI files in the commands on the download site: https://inference.mlcommons-storage.org

@hanyunfan
Copy link
Contributor

Shall we keep both methods?

We will hopefully be deprecating the Rclone method eventually, as it requires publicly sharing API keys and is a source of issues on account of people using different version of Rclone that behave differently, as well as folks incorrectly configuring Rclone remotes. The R2 downloader runs with a single command that handles everything.

My concern is that users who prefer not to use the downloader should have an alternative — such as providing the Hugging Face URL — in case the downloader becomes inaccessible or stops working.

@anivar anivar force-pushed the update-download-docs-remove-rclone branch from be566a4 to c1f697e Compare July 25, 2025 22:18
@anivar anivar requested a review from nathanw-mlc July 25, 2025 22:20
@anivar
Copy link
Contributor Author

anivar commented Jul 25, 2025

@nathanw-mlc Thank you for the review! I've updated the PR to address your feedback.

The download commands have been corrected to use the proper URI files from the metadata directory instead of JSON files:

  • DeepSeek-R1: Now using deepseek-r1-datasets-fp8-eval.uri and deepseek-r1-0528.uri
  • Llama 3.1 8b: Now using llama3-1-8b-cnn-eval.uri, llama3-1-8b-sample-cnn-eval-5000.uri, and llama3-1-8b-cnn-dailymail-calibration.uri
  • Whisper: Now using whisper-model.uri and whisper-dataset.uri

All URLs now correctly point to https://inference.mlcommons-storage.org/metadata/*.uri as shown on the download site.

@nathanw-mlc
Copy link
Member

My concern is that users who prefer not to use the downloader should have an alternative — such as providing the Hugging Face URL — in case the downloader becomes inaccessible or stops working.

The MLC R2 Downloader is downloading from the same location as the Rclone commands; that location being R2 buckets maintained by MLCommons. If the models and datasets are taken from another public location, such as HuggingFace, we often do provide point out where that is so folks can download it there if need be. I see that that has not done for some of the README files modified by this PR. Folks in the Inference Working Group can make that addition in another PR.

@anivar
Copy link
Contributor Author

anivar commented Jul 25, 2025

@nathanw-mlc Thanks for pointing that out! You're right - I focused on fixing the download URLs but didn't add the alternative sources like HuggingFace. Since the Inference Working Group can handle adding those alternative download locations in a follow-up PR, should this PR be good to merge as-is?

anivar and others added 2 commits July 26, 2025 04:13
…k-R1, Llama 3.1 8b, and Whisper

- Replace rclone-based download instructions with new MLCommons downloader infrastructure
- Update DeepSeek-R1, Llama 3.1 8b, and Whisper READMEs to use https://inference.mlcommons-storage.org
- Maintain MLCFlow automation commands alongside native download methods
- Add file size information for each download
- Include -d flag documentation for custom download directories

Fixes mlcommons#2265
…ect URIs

- Remove rclone-based download instructions
- Replace .json URLs with correct .uri files from metadata directory
- Update download commands for DeepSeek-R1, Llama 3.1 8b, and Whisper
- Use new MLCommons downloader infrastructure
- Remove file size information from download instructions
@anivar anivar force-pushed the update-download-docs-remove-rclone branch from be09d60 to a001c35 Compare July 25, 2025 22:43
nathanw-mlc
nathanw-mlc previously approved these changes Jul 25, 2025
Copy link
Member

@nathanw-mlc nathanw-mlc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a few tweaks, but it LGTM.

@anivar
Copy link
Contributor Author

anivar commented Jul 26, 2025

Thanks @nathanw-mlc for the fixes and improvements - especially catching that MLCFlow typo! 🙏

@anandhu-eng
Copy link
Contributor

Hi folks, I've updated the MLCFlow commands for all three benchmarks in this PR.

Thanks for the PR, @anivar.

@nathanw-mlc
Copy link
Member

Hi folks, I've updated the MLCFlow commands for all three benchmarks in this PR.

Thanks @anandhu-eng!

@arjunsuresh
Copy link
Contributor

This Friday is the submission deadline :)

@anivar
Copy link
Contributor Author

anivar commented Aug 17, 2025

Hi team,

This has been approved for a while now. Any reason for the delay in merging?

@arjunsuresh
Copy link
Contributor

The delay was due to the inference submission where code freeze was in place. Even though the changes in in this PR are good there were some breaking changes and the WG did not want to disturb the submissions. Since it was already approved, I'm merging it now.

Copy link
Contributor

@arjunsuresh arjunsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously approved by WG

@arjunsuresh arjunsuresh merged commit e064826 into mlcommons:master Aug 20, 2025
1 check passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 20, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deepseek Dataset will not download using rclone
5 participants