fix: check if the voc dataset folder exists before downloading. #9129

GdoongMathew · 2025-06-28T08:22:10Z

fix #9059

pytorch-bot · 2025-06-28T08:22:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9129

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 67c37ac with merge base b818d32 ():

NEW FAILURES - The following jobs have failed:

Build Aarch64 Linux Wheels / pytorch/vision / build-wheel-py3_9-cuda-aarch6412_9-aarch64 (gh)
ImportError: /lib64/libm.so.6: version GLIBC_2.29' not found (required by /usr/local/cuda/lib64/libnvshmem_host.so.3)`
Build Aarch64 Linux Wheels / pytorch/vision / upload / upload-wheel-py3_9-cuda-aarch6412_9-aarch64 (gh)
Unable to download artifact(s): Artifact not found for name: pytorch_vision__3.9_cu129_aarch64

This comment was automatically generated by Dr. CI and updates every 15 minutes.

GdoongMathew · 2025-07-13T07:29:19Z

A friendly ping to @NicolasHug for a code review.

scotts · 2025-08-08T14:40:39Z

torchvision/datasets/voc.py

+    def _download(self, voc_root: str) -> None:
+        if self._check_exists(voc_root):
+            return
+        download_and_extract_archive(self.url, self.root, filename=self.filename, md5=self.md5)


I'm actually a little confused, as download_and_extract_archive() calls download_url(), which already does an existence check:

vision/torchvision/datasets/utils.py

Lines 110 to 112 in 98f8b37

# check if file is already present locally

if check_integrity(fpath, md5):

return

That is, my code reading makes me think that the current code should already avoid re-downloading the files. If it does not, then we may have a bug where we're not passing the right thing to the utility functions, and we should fix that instead of implementing a new kind of existence check.

The current existence check does perform an md5 sum over all of the files, and that might also be expensive, but I wouldn't expect it to be the 8 minutes the user reported in #9059:

vision/torchvision/datasets/utils.py

Lines 35 to 43 in 98f8b37

def calculate_md5(fpath: Union[str, pathlib.Path], chunk_size: int = 1024 * 1024) -> str:

# Setting the `usedforsecurity` flag does not change anything about the functionality, but indicates that we are

# not using the MD5 checksum for cryptography. This enables its usage in restricted environments like FIPS. Without

# it torchvision.datasets is unusable in these environments since we perform a MD5 check everywhere.

md5 = hashlib.md5(usedforsecurity=False)

with open(fpath, "rb") as f:

while chunk := f.read(chunk_size):

md5.update(chunk)

return md5.hexdigest()

Yes!! I didn't noticed the internal downloading mechanism... and your conclusion from the code reading is correct. After testing locally, I can confirm that the check_integrity function should indeed return early before starting a new downloading session. My bad on being so sloppy on re-validating the issue.

Perhaps we could close this PR for now and check if the issue reappears in the future?

@GdoongMathew, no worries, I also had assumed there must be a problem! I agree with closing the issue and seeing if the issue comes up again.

scotts · 2025-08-08T14:41:14Z

@GdoongMathew, thanks for the PR! I'm surprised that any changes are necessary - can you take a look at the comment I left on the code?

scotts · 2025-08-12T01:42:10Z

Closing because we think the code already accomplishes this behavior.

fix: check if the voc dataset folder exists before downloading.

8b21b43

facebook-github-bot added the cla signed label Jun 28, 2025

GdoongMathew added 2 commits July 6, 2025 16:31

Merge branch 'main' into fix/voc_download

be1151e

Merge branch 'main' into fix/voc_download

67c37ac

scotts reviewed Aug 8, 2025

View reviewed changes

scotts closed this Aug 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: check if the voc dataset folder exists before downloading. #9129

fix: check if the voc dataset folder exists before downloading. #9129

Uh oh!

GdoongMathew commented Jun 28, 2025

Uh oh!

pytorch-bot bot commented Jun 28, 2025 •

edited

Loading

Uh oh!

GdoongMathew commented Jul 13, 2025

Uh oh!

scotts Aug 8, 2025

Uh oh!

GdoongMathew Aug 11, 2025

Uh oh!

scotts Aug 12, 2025

Uh oh!

scotts commented Aug 8, 2025

Uh oh!

scotts commented Aug 12, 2025

Uh oh!

Uh oh!

	# check if file is already present locally
	if check_integrity(fpath, md5):
	return

	def calculate_md5(fpath: Union[str, pathlib.Path], chunk_size: int = 1024 * 1024) -> str:
	# Setting the `usedforsecurity` flag does not change anything about the functionality, but indicates that we are
	# not using the MD5 checksum for cryptography. This enables its usage in restricted environments like FIPS. Without
	# it torchvision.datasets is unusable in these environments since we perform a MD5 check everywhere.
	md5 = hashlib.md5(usedforsecurity=False)
	with open(fpath, "rb") as f:
	while chunk := f.read(chunk_size):
	md5.update(chunk)
	return md5.hexdigest()

fix: check if the voc dataset folder exists before downloading. #9129

fix: check if the voc dataset folder exists before downloading. #9129

Uh oh!

Conversation

GdoongMathew commented Jun 28, 2025

Uh oh!

pytorch-bot bot commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9129

❌ 2 New Failures

Uh oh!

GdoongMathew commented Jul 13, 2025

Uh oh!

scotts Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

GdoongMathew Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

scotts Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

scotts commented Aug 8, 2025

Uh oh!

scotts commented Aug 12, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 28, 2025 •

edited

Loading