Add from_path utility #67

michael-kuhlmann · 2025-11-12T13:40:39Z

from_path allows for simple creation of a dataset from a directory.

Additionally, update the workflow configuration that has become outdated.

Create a lazy_dataseet from a folder by globbing all files that end in the specified suffices.

ubuntu-latest no longer supports py37

boeddeker · 2025-11-13T00:12:45Z

lazy_dataset/core.py

+    def _run_fast_scandir(root: Path, ext: List[str]):
+        subfolders, files = [], []
+
+        for f in os.scandir(root):


How about keeping that function simpler?
A recursive implementation is simpler, when you use generator style.
Why do you use in and lower() for an ext check?

def _run_fast_scandir(root: Path, ext: List[str]): for f in os.scandir(root): if f.is_dir(): yield from _run_fast_scandir(f) if f.is_file(): if any(f.name.endswith(e) for e in ext): yield Path(f.path)

Done. I remember that I had a very special case where I wanted to match only part of the extension, but I can't reproduce that scenario anymore. I adopted your suggestion.

Ok. If you need that again, we could allow callables for verification.

boeddeker · 2025-11-13T00:19:45Z

lazy_dataset/core.py

+        if parents is None:
+            return file_path.stem
+        example_id = file_path.stem
+        prefix = sep.join(file_path.parts[-(2+parents):-1])


Shouldn't the option to select the folder, that are included in the example id be interpreted from the left?
e.g. the first example that I had in mind is the following:
ex1/meta.json
ex1/audio/ch1.wav
ex1/audio/ch2.wav

Should the option be allowed to be a callable? To support all kind of mappings?

Should the option be allowed to be a callable? To support all kind of mappings?

Done. parents can now also be a callable.

Shouldn't the option to select the folder, that are included in the example id be interpreted from the left? e.g. the first example that I had in mind is the following: ex1/meta.json ex1/audio/ch1.wav ex1/audio/ch2.wav

I had something more homogeneous in mind, like:

speaker1/ 001.wav 002.wav speaker2/ 001.wav 002.wav

Here, I want to add the speaker ID to resolve collisions, and a right-to-left approach seemed more natural to me. What behaviour would you expect in your example? Is it

ex1_meta
ex1_ch1
ex1_ch2

or

ex1_meta
ex1_audio_ch1
ex1_audio_ch2?

The second is doable with the current implementation, the first not (unless you use a callable for parents).

speaker1/ 001.wav 002.wav speaker2/ 001.wav 002.wav 001/ speaker1.wav speaker2.wav 002/ speaker1.wav speaker2.wav

I think both folder structures are common. In the past I often used the first, while recently I used mainly the second. In another conversation in the PR, I added code to support both.

boeddeker · 2025-11-13T00:25:33Z

lazy_dataset/core.py

+    for file in files:
+        example_id = _make_example_id(file)
+        examples[example_id]["example_id"] = example_id
+        examples[example_id][file.suffix.lstrip(".")] = file


How about, letting the _make_example_id to a split of the file path (excluding the root path)? In that way, there will be o collision. Otherwise, the multiple files may overwrite each other.

How about, letting the _make_example_id do a split of the file path (excluding the root path)? In that way, there will be no collision. Otherwise, the multiple files may overwrite each other.

There were a few typos and my text was a bit too short, that made it difficult to get the idea of my comment.
Suffixes may not be unique. The sort makes it now deterministic, but
I think it is still undesired to silently ignore a file.
Suggestion: Add an assert.

Depending on the value of parent, we may want to have another key (e.g., source separation typically yields two wav files, so the suffix is not enough to distingush them):

example_id, key = _make_example_id(file_path: Path) examples[example_id]["example_id"] = example_id assert key not in examples[example_id], (key, example_id, examples[example_id]) examples[example_id][key] = file

How about slightly change the definition of parents to make it able to handle more cases?
I considered three cases:

Flat folder (parents == 0): suffix is the key, while stem is the example_id

folder per key (parents < ): The example_id is the stem, but the folder is the key (In your example it is the suffix, but the folder feels more natural to me. Or using both?)

A folder per example (parents > 0)

To have the code more natural, I had to change the sign of parents for your case.

from pathlib import Path def _make_example_id(file_path: Path, parents, sep: str = "/") -> str: """ parents: 0: '{example_id}.{key}', all folders are considered as part of example id. >0: '{example_id}/{key}', abs(parents) is the number of folders to consider as part of example id. <0: '{key}/{example_id}', abs(parents) is the number of path parts to consider as part of example id. >>> def test(parents, paths): ... for p in paths: ... example_id, key = _make_example_id(p, parents) ... print(f'ex[{example_id!r}][{key!r}] = root / {str(p)!r}') # Assuming, all files are in the 'root' directory # and the naming is '{example_id}.{key}', where key is the file extension. >>> test(0, [Path('ex1.wav'), Path('ex1.txt'), Path('ex2.wav'), Path('ex2.txt')]) ex['ex1']['wav'] = root / 'ex1.wav' ex['ex1']['txt'] = root / 'ex1.txt' ex['ex2']['wav'] = root / 'ex2.wav' ex['ex2']['txt'] = root / 'ex2.txt' # Assuming an folder per example '{example_id}/{key}', # where key is the file name with extension. >>> test(1, [Path('ex1/audio/1.wav'), Path('ex1/audio/2.txt'), Path('ex1/meta.json')]) ex['ex1']['audio/1.wav'] = root / 'ex1/audio/1.wav' ex['ex1']['audio/2.txt'] = root / 'ex1/audio/2.txt' ex['ex1']['meta.json'] = root / 'ex1/meta.json' # Assuming file stem is example id '{key}/{example_id}.{ext}', # where key is the file name with extension. >>> test(-1, [Path('audio/ex1.wav'), Path('meta/txt/ex1.txt')]) ex['ex1']['audio'] = root / 'audio/ex1.wav' ex['ex1']['meta/txt'] = root / 'meta/txt/ex1.txt' """ if parents is None: return file_path.stem if callable(parents): return parents(file_path, sep) if parents == 0: return sep.join(file_path.with_suffix("").parts[parents:]), file_path.suffix.lstrip('.') elif parents > 0: return sep.join(file_path.with_suffix("").parts[:parents]), sep.join(file_path.parts[parents:]) elif parents < 0: return sep.join(file_path.with_suffix("").parts[parents:]), sep.join(file_path.parts[:parents])

I see. I agree that it is undesirable to ignore files silently. I also like the new behaviour of parents. This should cover the most common cases. I adopted your suggestions (I moved _make_example_id to the outer scope so that the test cases are correctly executed).

boeddeker · 2025-11-13T00:26:49Z

lazy_dataset/core.py

+    _, files = _run_fast_scandir(Path(root), suffix)
+    files = map(Path, files)
+    examples = defaultdict(dict)
+    for file in files:


Can you add a sort?

The order of os.scandir depends on the physical location of the file on the HDD/SSD.
The order of the examples (and maybe even the content in case of collisions) should
be reproducible accross filesystems.

boeddeker · 2025-11-13T00:29:38Z

lazy_dataset/core.py

+    parents: Optional[int] = None,
+    sep: str = "_",
+) -> "DictDataset":
+    """Create a new DictDataset from a directory path.


Can you add a small text for a motivation and a warning that this function should usually not be used? I lack currently an motivation, when this function is recommented. (Scanning directories with large number of files shouldn't e done on demand.)

I often need to evaluate audio files that were generated by TTS systems. I usually store them in a single directory (sometimes with a nested structure), and this is a convenient way to quickly obtain an iterable over these files (e.g., here). I could add a warning that is raised in the beginning to be cautious with this function. Otherwise, I would delegate the responsibility to the user.

Ah, ok.

How about adding something like the following to the doc string (Feel free to improve the text):

Note: This function is not intended to be used for frequently used large datasets,
since the indexing overhead can get significant
For one time small datasets, it is a convenient way to load them.

Use a recursive generator style

Allows for more customizable example IDs

Avoids unintended collisions of example IDs

boeddeker · 2025-11-19T00:26:52Z

lazy_dataset/core.py

+        small datasets, it is a convenient way to load them.
+
+    >>> import tempfile
+    >>> temp_dir = tempfile.TemporaryDirectory()


Does this leak a folder? Or does the destructor of temp_dir do some cleanup?

The cleanup is not done automatically unless used as a context manager. I added a call to cleanup at the end of the doctest.

boeddeker

Last comment, how about changing from_path to from_folder or from_dir?
When I first read that name, I thought the argument would be a file.

codecov-commenter · 2025-11-20T06:42:08Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.17%. Comparing base (332d6d5) to head (6c7e67a).
⚠️ Report is 17 commits behind head on master.

Files with missing lines	Patch %	Lines
lazy_dataset/core.py	94.28%	2 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #67      +/-   ##
==========================================
- Coverage   76.63%   76.17%   -0.47%     
==========================================
  Files           5        5              
  Lines        1725     1792      +67     
==========================================
+ Hits         1322     1365      +43     
- Misses        403      427      +24

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

michael-kuhlmann · 2025-11-20T07:01:26Z

Last comment, how about changing from_path to from_folder or from_dir? When I first read that name, I thought the argument would be a file.

Changed it to from_dir.

michael-kuhlmann added 7 commits September 12, 2025 10:52

Add from_path

a0cb523

Create a lazy_dataseet from a folder by globbing all files that end in the specified suffices.

Merge branch 'fgnt:master' into master

4ccd8ce

Include example_id

df7c4e9

Add docstring

99d7c8f

Merge branch 'fgnt:master' into master

a7d80a7

Run py37 tests on Ubuntu 22.04

fafa88a

ubuntu-latest no longer supports py37

Fix doctest

1cfade8

michael-kuhlmann requested review from TCord and alexanderwerning November 12, 2025 13:43

boeddeker reviewed Nov 13, 2025

View reviewed changes

michael-kuhlmann added 4 commits November 14, 2025 14:22

Simplify _run_fast_scandir

18dc877

Use a recursive generator style

Iterate over sorted files for reproducible results across systems

baac3cf

Allow parents to be a Callable

ebc1737

Allows for more customizable example IDs

Create example IDs from file paths relative to root

64b01ae

Avoids unintended collisions of example IDs

michael-kuhlmann requested a review from boeddeker November 14, 2025 13:56

michael-kuhlmann added 5 commits November 17, 2025 19:03

Better support for different folder structures

edc6807

Raise an AssertionError when keys are colliding

4b84e9c

Add usage note

be1b066

Add doctest to test for key collision

2d5455c

Add a warning if key is an empty string

a7c4154

boeddeker reviewed Nov 19, 2025

View reviewed changes

boeddeker approved these changes Nov 19, 2025

View reviewed changes

michael-kuhlmann added 3 commits November 20, 2025 07:33

Clean up temporary doctest files

0ca7c2d

Rename from_path to from_dir

0c8f0a7

Fix doctest after renaming

6c7e67a

Add from_path utility #67

Are you sure you want to change the base?

Add from_path utility #67

Uh oh!

Conversation

michael-kuhlmann commented Nov 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

boeddeker left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Nov 20, 2025

Codecov Report

Uh oh!

michael-kuhlmann commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants