-
Notifications
You must be signed in to change notification settings - Fork 8
Add from_path utility #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Create a lazy_dataseet from a folder by globbing all files that end in the specified suffices.
ubuntu-latest no longer supports py37
| def _run_fast_scandir(root: Path, ext: List[str]): | ||
| subfolders, files = [], [] | ||
|
|
||
| for f in os.scandir(root): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about keeping that function simpler?
A recursive implementation is simpler, when you use generator style.
Why do you use in and lower() for an ext check?
def _run_fast_scandir(root: Path, ext: List[str]):
for f in os.scandir(root):
if f.is_dir():
yield from _run_fast_scandir(f)
if f.is_file():
if any(f.name.endswith(e) for e in ext):
yield Path(f.path)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I remember that I had a very special case where I wanted to match only part of the extension, but I can't reproduce that scenario anymore. I adopted your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. If you need that again, we could allow callables for verification.
lazy_dataset/core.py
Outdated
| if parents is None: | ||
| return file_path.stem | ||
| example_id = file_path.stem | ||
| prefix = sep.join(file_path.parts[-(2+parents):-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the option to select the folder, that are included in the example id be interpreted from the left?
e.g. the first example that I had in mind is the following:
ex1/meta.json
ex1/audio/ch1.wav
ex1/audio/ch2.wav
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the option be allowed to be a callable? To support all kind of mappings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the option be allowed to be a callable? To support all kind of mappings?
Done. parents can now also be a callable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the option to select the folder, that are included in the example id be interpreted from the left? e.g. the first example that I had in mind is the following: ex1/meta.json ex1/audio/ch1.wav ex1/audio/ch2.wav
I had something more homogeneous in mind, like:
speaker1/
001.wav
002.wav
speaker2/
001.wav
002.wav
Here, I want to add the speaker ID to resolve collisions, and a right-to-left approach seemed more natural to me. What behaviour would you expect in your example? Is it
ex1_meta
ex1_ch1
ex1_ch2
or
ex1_meta
ex1_audio_ch1
ex1_audio_ch2?
The second is doable with the current implementation, the first not (unless you use a callable for parents).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
speaker1/
001.wav
002.wav
speaker2/
001.wav
002.wav
001/
speaker1.wav
speaker2.wav
002/
speaker1.wav
speaker2.wav
I think both folder structures are common. In the past I often used the first, while recently I used mainly the second. In another conversation in the PR, I added code to support both.
lazy_dataset/core.py
Outdated
| for file in files: | ||
| example_id = _make_example_id(file) | ||
| examples[example_id]["example_id"] = example_id | ||
| examples[example_id][file.suffix.lstrip(".")] = file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about, letting the _make_example_id to a split of the file path (excluding the root path)? In that way, there will be o collision. Otherwise, the multiple files may overwrite each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about, letting the _make_example_id do a split of the file path (excluding the root path)? In that way, there will be no collision. Otherwise, the multiple files may overwrite each other.
There were a few typos and my text was a bit too short, that made it difficult to get the idea of my comment.
Suffixes may not be unique. The sort makes it now deterministic, but
I think it is still undesired to silently ignore a file.
Suggestion: Add an assert.
Depending on the value of parent, we may want to have another key (e.g., source separation typically yields two wav files, so the suffix is not enough to distingush them):
example_id, key = _make_example_id(file_path: Path)
examples[example_id]["example_id"] = example_id
assert key not in examples[example_id], (key, example_id, examples[example_id])
examples[example_id][key] = fileHow about slightly change the definition of parents to make it able to handle more cases?
I considered three cases:
- Flat folder (parents == 0): suffix is the key, while stem is the example_id
- folder per key (parents < ): The example_id is the stem, but the folder is the key (In your example it is the suffix, but the folder feels more natural to me. Or using both?)
- A folder per example (parents > 0)
To have the code more natural, I had to change the sign of parents for your case.
from pathlib import Path
def _make_example_id(file_path: Path, parents, sep: str = "/") -> str:
"""
parents:
0: '{example_id}.{key}', all folders are considered as part of example id.
>0: '{example_id}/{key}', abs(parents) is the number of folders to consider as part of example id.
<0: '{key}/{example_id}', abs(parents) is the number of path parts to consider as part of example id.
>>> def test(parents, paths):
... for p in paths:
... example_id, key = _make_example_id(p, parents)
... print(f'ex[{example_id!r}][{key!r}] = root / {str(p)!r}')
# Assuming, all files are in the 'root' directory
# and the naming is '{example_id}.{key}', where key is the file extension.
>>> test(0, [Path('ex1.wav'), Path('ex1.txt'), Path('ex2.wav'), Path('ex2.txt')])
ex['ex1']['wav'] = root / 'ex1.wav'
ex['ex1']['txt'] = root / 'ex1.txt'
ex['ex2']['wav'] = root / 'ex2.wav'
ex['ex2']['txt'] = root / 'ex2.txt'
# Assuming an folder per example '{example_id}/{key}',
# where key is the file name with extension.
>>> test(1, [Path('ex1/audio/1.wav'), Path('ex1/audio/2.txt'), Path('ex1/meta.json')])
ex['ex1']['audio/1.wav'] = root / 'ex1/audio/1.wav'
ex['ex1']['audio/2.txt'] = root / 'ex1/audio/2.txt'
ex['ex1']['meta.json'] = root / 'ex1/meta.json'
# Assuming file stem is example id '{key}/{example_id}.{ext}',
# where key is the file name with extension.
>>> test(-1, [Path('audio/ex1.wav'), Path('meta/txt/ex1.txt')])
ex['ex1']['audio'] = root / 'audio/ex1.wav'
ex['ex1']['meta/txt'] = root / 'meta/txt/ex1.txt'
"""
if parents is None:
return file_path.stem
if callable(parents):
return parents(file_path, sep)
if parents == 0:
return sep.join(file_path.with_suffix("").parts[parents:]), file_path.suffix.lstrip('.')
elif parents > 0:
return sep.join(file_path.with_suffix("").parts[:parents]), sep.join(file_path.parts[parents:])
elif parents < 0:
return sep.join(file_path.with_suffix("").parts[parents:]), sep.join(file_path.parts[:parents])There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I agree that it is undesirable to ignore files silently. I also like the new behaviour of parents. This should cover the most common cases. I adopted your suggestions (I moved _make_example_id to the outer scope so that the test cases are correctly executed).
lazy_dataset/core.py
Outdated
| _, files = _run_fast_scandir(Path(root), suffix) | ||
| files = map(Path, files) | ||
| examples = defaultdict(dict) | ||
| for file in files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a sort?
The order of os.scandir depends on the physical location of the file on the HDD/SSD.
The order of the examples (and maybe even the content in case of collisions) should
be reproducible accross filesystems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| parents: Optional[int] = None, | ||
| sep: str = "_", | ||
| ) -> "DictDataset": | ||
| """Create a new DictDataset from a directory path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a small text for a motivation and a warning that this function should usually not be used? I lack currently an motivation, when this function is recommented. (Scanning directories with large number of files shouldn't e done on demand.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I often need to evaluate audio files that were generated by TTS systems. I usually store them in a single directory (sometimes with a nested structure), and this is a convenient way to quickly obtain an iterable over these files (e.g., here). I could add a warning that is raised in the beginning to be cautious with this function. Otherwise, I would delegate the responsibility to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok.
How about adding something like the following to the doc string (Feel free to improve the text):
Note: This function is not intended to be used for frequently used large datasets,
since the indexing overhead can get significant
For one time small datasets, it is a convenient way to load them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Use a recursive generator style
Allows for more customizable example IDs
Avoids unintended collisions of example IDs
| small datasets, it is a convenient way to load them. | ||
| >>> import tempfile | ||
| >>> temp_dir = tempfile.TemporaryDirectory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this leak a folder? Or does the destructor of temp_dir do some cleanup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanup is not done automatically unless used as a context manager. I added a call to cleanup at the end of the doctest.
boeddeker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last comment, how about changing from_path to from_folder or from_dir?
When I first read that name, I thought the argument would be a file.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #67 +/- ##
==========================================
- Coverage 76.63% 76.17% -0.47%
==========================================
Files 5 5
Lines 1725 1792 +67
==========================================
+ Hits 1322 1365 +43
- Misses 403 427 +24 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Changed it to |
from_pathallows for simple creation of a dataset from a directory.Additionally, update the workflow configuration that has become outdated.