-
Notifications
You must be signed in to change notification settings - Fork 7.1k
fix: check if the voc dataset folder exists before downloading. #9129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9129
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit 67c37ac with merge base b818d32 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
A friendly ping to @NicolasHug for a code review. |
def _download(self, voc_root: str) -> None: | ||
if self._check_exists(voc_root): | ||
return | ||
download_and_extract_archive(self.url, self.root, filename=self.filename, md5=self.md5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually a little confused, as download_and_extract_archive()
calls download_url()
, which already does an existence check:
vision/torchvision/datasets/utils.py
Lines 110 to 112 in 98f8b37
# check if file is already present locally | |
if check_integrity(fpath, md5): | |
return |
That is, my code reading makes me think that the current code should already avoid re-downloading the files. If it does not, then we may have a bug where we're not passing the right thing to the utility functions, and we should fix that instead of implementing a new kind of existence check.
The current existence check does perform an md5 sum over all of the files, and that might also be expensive, but I wouldn't expect it to be the 8 minutes the user reported in #9059:
vision/torchvision/datasets/utils.py
Lines 35 to 43 in 98f8b37
def calculate_md5(fpath: Union[str, pathlib.Path], chunk_size: int = 1024 * 1024) -> str: | |
# Setting the `usedforsecurity` flag does not change anything about the functionality, but indicates that we are | |
# not using the MD5 checksum for cryptography. This enables its usage in restricted environments like FIPS. Without | |
# it torchvision.datasets is unusable in these environments since we perform a MD5 check everywhere. | |
md5 = hashlib.md5(usedforsecurity=False) | |
with open(fpath, "rb") as f: | |
while chunk := f.read(chunk_size): | |
md5.update(chunk) | |
return md5.hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!! I didn't noticed the internal downloading mechanism... and your conclusion from the code reading is correct. After testing locally, I can confirm that the check_integrity
function should indeed return early before starting a new downloading session. My bad on being so sloppy on re-validating the issue.
Perhaps we could close this PR for now and check if the issue reappears in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GdoongMathew, no worries, I also had assumed there must be a problem! I agree with closing the issue and seeing if the issue comes up again.
@GdoongMathew, thanks for the PR! I'm surprised that any changes are necessary - can you take a look at the comment I left on the code? |
Closing because we think the code already accomplishes this behavior. |
fix #9059