Skip to content

fix: check if the voc dataset folder exists before downloading. #9129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

GdoongMathew
Copy link
Contributor

fix #9059

Copy link

pytorch-bot bot commented Jun 28, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9129

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 67c37ac with merge base b818d32 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@GdoongMathew
Copy link
Contributor Author

A friendly ping to @NicolasHug for a code review.

def _download(self, voc_root: str) -> None:
if self._check_exists(voc_root):
return
download_and_extract_archive(self.url, self.root, filename=self.filename, md5=self.md5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually a little confused, as download_and_extract_archive() calls download_url(), which already does an existence check:

# check if file is already present locally
if check_integrity(fpath, md5):
return

That is, my code reading makes me think that the current code should already avoid re-downloading the files. If it does not, then we may have a bug where we're not passing the right thing to the utility functions, and we should fix that instead of implementing a new kind of existence check.

The current existence check does perform an md5 sum over all of the files, and that might also be expensive, but I wouldn't expect it to be the 8 minutes the user reported in #9059:

def calculate_md5(fpath: Union[str, pathlib.Path], chunk_size: int = 1024 * 1024) -> str:
# Setting the `usedforsecurity` flag does not change anything about the functionality, but indicates that we are
# not using the MD5 checksum for cryptography. This enables its usage in restricted environments like FIPS. Without
# it torchvision.datasets is unusable in these environments since we perform a MD5 check everywhere.
md5 = hashlib.md5(usedforsecurity=False)
with open(fpath, "rb") as f:
while chunk := f.read(chunk_size):
md5.update(chunk)
return md5.hexdigest()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!! I didn't noticed the internal downloading mechanism... and your conclusion from the code reading is correct. After testing locally, I can confirm that the check_integrity function should indeed return early before starting a new downloading session. My bad on being so sloppy on re-validating the issue.

Perhaps we could close this PR for now and check if the issue reappears in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GdoongMathew, no worries, I also had assumed there must be a problem! I agree with closing the issue and seeing if the issue comes up again.

@scotts
Copy link
Contributor

scotts commented Aug 8, 2025

@GdoongMathew, thanks for the PR! I'm surprised that any changes are necessary - can you take a look at the comment I left on the code?

@scotts
Copy link
Contributor

scotts commented Aug 12, 2025

Closing because we think the code already accomplishes this behavior.

@scotts scotts closed this Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

download=True argument for VOCDetection will always download and rewrite the dataset again
3 participants