Skip to content

fix: check if the voc dataset folder exists before downloading. #9129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 22 additions & 6 deletions torchvision/datasets/voc.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import collections
import os
from pathlib import Path
from typing import Any, Callable, Optional, Union
from typing import Any, Callable, Optional, Tuple, Union
from xml.etree.ElementTree import Element as ET_Element

try:
Expand Down Expand Up @@ -64,6 +64,8 @@ class _VOCBase(VisionDataset):
_SPLITS_DIR: str
_TARGET_DIR: str
_TARGET_FILE_EXT: str
_IMAGE_SET: str = "ImageSets"
_IMAGE_DIR: str = "JPEGImages"

def __init__(
self,
Expand Down Expand Up @@ -95,24 +97,38 @@ def __init__(
voc_root = os.path.join(self.root, base_dir)

if download:
download_and_extract_archive(self.url, self.root, filename=self.filename, md5=self.md5)
self._download(voc_root)

if not os.path.isdir(voc_root):
if not self._check_exists(voc_root):
raise RuntimeError("Dataset not found or corrupted. You can use download=True to download it")

splits_dir = os.path.join(voc_root, "ImageSets", self._SPLITS_DIR)
splits_dir, image_dir, target_dir = self._voc_subfolders(voc_root)
split_f = os.path.join(splits_dir, image_set.rstrip("\n") + ".txt")
with open(os.path.join(split_f)) as f:
file_names = [x.strip() for x in f.readlines()]

image_dir = os.path.join(voc_root, "JPEGImages")
self.images = [os.path.join(image_dir, x + ".jpg") for x in file_names]

target_dir = os.path.join(voc_root, self._TARGET_DIR)
self.targets = [os.path.join(target_dir, x + self._TARGET_FILE_EXT) for x in file_names]

assert len(self.images) == len(self.targets)

def _voc_subfolders(self, voc_root) -> Tuple[str, str, str]:
"""Returns the subfolders for the VOC dataset."""
splits_dir = os.path.join(voc_root, self._IMAGE_SET, self._SPLITS_DIR)
image_dir = os.path.join(voc_root, self._IMAGE_DIR)
target_dir = os.path.join(voc_root, self._TARGET_DIR)
return splits_dir, image_dir, target_dir

def _download(self, voc_root: str) -> None:
if self._check_exists(voc_root):
return
download_and_extract_archive(self.url, self.root, filename=self.filename, md5=self.md5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually a little confused, as download_and_extract_archive() calls download_url(), which already does an existence check:

# check if file is already present locally
if check_integrity(fpath, md5):
return

That is, my code reading makes me think that the current code should already avoid re-downloading the files. If it does not, then we may have a bug where we're not passing the right thing to the utility functions, and we should fix that instead of implementing a new kind of existence check.

The current existence check does perform an md5 sum over all of the files, and that might also be expensive, but I wouldn't expect it to be the 8 minutes the user reported in #9059:

def calculate_md5(fpath: Union[str, pathlib.Path], chunk_size: int = 1024 * 1024) -> str:
# Setting the `usedforsecurity` flag does not change anything about the functionality, but indicates that we are
# not using the MD5 checksum for cryptography. This enables its usage in restricted environments like FIPS. Without
# it torchvision.datasets is unusable in these environments since we perform a MD5 check everywhere.
md5 = hashlib.md5(usedforsecurity=False)
with open(fpath, "rb") as f:
while chunk := f.read(chunk_size):
md5.update(chunk)
return md5.hexdigest()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!! I didn't noticed the internal downloading mechanism... and your conclusion from the code reading is correct. After testing locally, I can confirm that the check_integrity function should indeed return early before starting a new downloading session. My bad on being so sloppy on re-validating the issue.

Perhaps we could close this PR for now and check if the issue reappears in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GdoongMathew, no worries, I also had assumed there must be a problem! I agree with closing the issue and seeing if the issue comes up again.


def _check_exists(self, voc_root: str) -> bool:
"""Check if the dataset exists."""
return all(os.path.isdir(d) and len(os.listdir(d)) for d in self._voc_subfolders(voc_root))

def __len__(self) -> int:
return len(self.images)

Expand Down
Loading