-
Notifications
You must be signed in to change notification settings - Fork 8
Add from_path utility #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 7 commits
a0cb523
4ccd8ce
df7c4e9
99d7c8f
a7d80a7
fafa88a
1cfade8
18dc877
baac3cf
ebc1737
64b01ae
edc6807
4b84e9c
be1b066
2d5455c
a7c4154
0ca7c2d
0c8f0a7
6c7e67a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -232,6 +232,93 @@ def from_dataset( | |
| immutable_warranty=immutable_warranty, name=name) | ||
|
|
||
|
|
||
| def from_path( | ||
| root: Union[str, Path], | ||
| suffix: Union[str, List[str]], | ||
| immutable_warranty: str = 'pickle', | ||
| name: str = None, | ||
| parents: Optional[int] = None, | ||
| sep: str = "_", | ||
| ) -> "DictDataset": | ||
| """Create a new DictDataset from a directory path. | ||
|
|
||
| Scan and include all files in `root` that end with a suffix in `suffix`. | ||
| New examples are created for each unique file stem. The example_id is | ||
| derived from the file path. | ||
|
|
||
| >>> import tempfile | ||
| >>> temp_dir = tempfile.TemporaryDirectory() | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this leak a folder? Or does the destructor of temp_dir do some cleanup?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The cleanup is not done automatically unless used as a context manager. I added a call to |
||
| >>> fp = Path(temp_dir.name) / "test1.txt" | ||
| >>> fp.touch() | ||
| >>> fp = Path(temp_dir.name) / "test1.wav" | ||
| >>> fp.touch() | ||
| >>> ds = from_path(temp_dir.name, suffix=".txt") | ||
| >>> ds | ||
| DictDataset(len=1) | ||
| MapDataset(_pickle.loads) | ||
| >>> ds[0] # doctest: +ELLIPSIS | ||
| {'example_id': 'test1', 'txt': PosixPath('.../test1.txt')} | ||
|
|
||
| >>> ds = from_path(temp_dir.name, suffix=[".txt", ".wav"]) | ||
| >>> ds | ||
| DictDataset(len=1) | ||
| MapDataset(_pickle.loads) | ||
| >>> ds[0] # doctest: +ELLIPSIS | ||
| {'example_id': 'test1', 'txt': PosixPath('.../test1.txt'), 'wav': PosixPath('.../test1.wav')} | ||
|
|
||
| Args: | ||
| root (Union[str, Path]): Root directory to scan for files. | ||
| suffix (Union[str, List[str]]): List of file suffixes to scan for. | ||
| Files with these suffixes will be added to the dataset. | ||
| immutable_warranty (str, optional): | ||
| name (str, optional): | ||
| parents (Optional[int], optional): Level of parent folders to include in | ||
| the example_id. If `None`, only the file stem is used. `parents=1` | ||
| includes the immediate parent folder. Defaults to None. | ||
| sep (str, optional): Separator to use for joining folder names. | ||
| Defaults to "_". | ||
|
|
||
| Returns: | ||
| DictDataset: A dataset containing the scanned files. | ||
| """ | ||
| from collections import defaultdict | ||
| import os | ||
| # https://stackoverflow.com/a/59803793/16085876 | ||
| def _run_fast_scandir(root: Path, ext: List[str]): | ||
| subfolders, files = [], [] | ||
|
|
||
| for f in os.scandir(root): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about keeping that function simpler? def _run_fast_scandir(root: Path, ext: List[str]):
for f in os.scandir(root):
if f.is_dir():
yield from _run_fast_scandir(f)
if f.is_file():
if any(f.name.endswith(e) for e in ext):
yield Path(f.path)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. I remember that I had a very special case where I wanted to match only part of the extension, but I can't reproduce that scenario anymore. I adopted your suggestion.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok. If you need that again, we could allow callables for verification. |
||
| if f.is_dir(): | ||
| subfolders.append(f.path) | ||
| if f.is_file(): | ||
| if any(e in f.name.lower() for e in ext): | ||
| files.append(Path(f.path)) | ||
|
|
||
| for folder in list(subfolders): | ||
| sf, f = _run_fast_scandir(folder, ext) | ||
| subfolders.extend(sf) | ||
| files.extend(f) | ||
| return subfolders, files | ||
|
|
||
| def _make_example_id(file_path: Path): | ||
| if parents is None: | ||
| return file_path.stem | ||
| example_id = file_path.stem | ||
| prefix = sep.join(file_path.parts[-(2+parents):-1]) | ||
|
||
| return sep.join((prefix, example_id)) | ||
|
|
||
| if isinstance(suffix, str): | ||
| suffix = [suffix] | ||
| _, files = _run_fast_scandir(Path(root), suffix) | ||
| files = map(Path, files) | ||
| examples = defaultdict(dict) | ||
| for file in files: | ||
|
||
| example_id = _make_example_id(file) | ||
| examples[example_id]["example_id"] = example_id | ||
| examples[example_id][file.suffix.lstrip(".")] = file | ||
|
||
| return from_dict(examples, immutable_warranty, name) | ||
|
|
||
|
|
||
| def concatenate(*datasets): | ||
| """ | ||
| Create a new `Dataset` by concatenation of all passed datasets. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a small text for a motivation and a warning that this function should usually not be used? I lack currently an motivation, when this function is recommented. (Scanning directories with large number of files shouldn't e done on demand.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I often need to evaluate audio files that were generated by TTS systems. I usually store them in a single directory (sometimes with a nested structure), and this is a convenient way to quickly obtain an iterable over these files (e.g., here). I could add a warning that is raised in the beginning to be cautious with this function. Otherwise, I would delegate the responsibility to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok.
How about adding something like the following to the doc string (Feel free to improve the text):
Note: This function is not intended to be used for frequently used large datasets,
since the indexing overhead can get significant
For one time small datasets, it is a convenient way to load them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.