Skip to content

Conversation

@michael-kuhlmann
Copy link
Contributor

from_path allows for simple creation of a dataset from a directory.

Additionally, update the workflow configuration that has become outdated.

def _run_fast_scandir(root: Path, ext: List[str]):
subfolders, files = [], []

for f in os.scandir(root):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about keeping that function simpler?
A recursive implementation is simpler, when you use generator style.
Why do you use in and lower() for an ext check?

    def _run_fast_scandir(root: Path, ext: List[str]):
        for f in os.scandir(root):
            if f.is_dir():
                yield from _run_fast_scandir(f)
            if f.is_file():
                if any(f.name.endswith(e) for e in ext):
                    yield Path(f.path)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I remember that I had a very special case where I wanted to match only part of the extension, but I can't reproduce that scenario anymore. I adopted your suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. If you need that again, we could allow callables for verification.

if parents is None:
return file_path.stem
example_id = file_path.stem
prefix = sep.join(file_path.parts[-(2+parents):-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the option to select the folder, that are included in the example id be interpreted from the left?
e.g. the first example that I had in mind is the following:
ex1/meta.json
ex1/audio/ch1.wav
ex1/audio/ch2.wav

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the option be allowed to be a callable? To support all kind of mappings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the option be allowed to be a callable? To support all kind of mappings?

Done. parents can now also be a callable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the option to select the folder, that are included in the example id be interpreted from the left? e.g. the first example that I had in mind is the following: ex1/meta.json ex1/audio/ch1.wav ex1/audio/ch2.wav

I had something more homogeneous in mind, like:

speaker1/
  001.wav
  002.wav
speaker2/
  001.wav
  002.wav

Here, I want to add the speaker ID to resolve collisions, and a right-to-left approach seemed more natural to me. What behaviour would you expect in your example? Is it

ex1_meta
ex1_ch1
ex1_ch2

or

ex1_meta
ex1_audio_ch1
ex1_audio_ch2?

The second is doable with the current implementation, the first not (unless you use a callable for parents).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speaker1/
  001.wav
  002.wav
speaker2/
  001.wav
  002.wav

001/
  speaker1.wav
  speaker2.wav
002/
  speaker1.wav
  speaker2.wav

I think both folder structures are common. In the past I often used the first, while recently I used mainly the second. In another conversation in the PR, I added code to support both.

for file in files:
example_id = _make_example_id(file)
examples[example_id]["example_id"] = example_id
examples[example_id][file.suffix.lstrip(".")] = file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about, letting the _make_example_id to a split of the file path (excluding the root path)? In that way, there will be o collision. Otherwise, the multiple files may overwrite each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about, letting the _make_example_id do a split of the file path (excluding the root path)? In that way, there will be no collision. Otherwise, the multiple files may overwrite each other.

There were a few typos and my text was a bit too short, that made it difficult to get the idea of my comment.
Suffixes may not be unique. The sort makes it now deterministic, but
I think it is still undesired to silently ignore a file.
Suggestion: Add an assert.

Depending on the value of parent, we may want to have another key (e.g., source separation typically yields two wav files, so the suffix is not enough to distingush them):

example_id, key = _make_example_id(file_path: Path)
examples[example_id]["example_id"] = example_id
assert key not in examples[example_id], (key, example_id, examples[example_id])
examples[example_id][key] = file

How about slightly change the definition of parents to make it able to handle more cases?
I considered three cases:

  • Flat folder (parents == 0): suffix is the key, while stem is the example_id
  • folder per key (parents < ): The example_id is the stem, but the folder is the key (In your example it is the suffix, but the folder feels more natural to me. Or using both?)
  • A folder per example (parents > 0)

To have the code more natural, I had to change the sign of parents for your case.

from pathlib import Path


def _make_example_id(file_path: Path, parents, sep: str = "/") -> str:
    """

    parents:
        0: '{example_id}.{key}', all folders are considered as part of example id.
        >0: '{example_id}/{key}', abs(parents) is the number of folders to consider as part of example id.
        <0: '{key}/{example_id}', abs(parents) is the number of path parts to consider as part of example id.

    >>> def test(parents, paths):
    ...     for p in paths:
    ...         example_id, key = _make_example_id(p, parents)
    ...         print(f'ex[{example_id!r}][{key!r}] = root / {str(p)!r}')

    # Assuming, all files are in the 'root' directory
    # and the naming is '{example_id}.{key}', where key is the file extension.
    >>> test(0, [Path('ex1.wav'), Path('ex1.txt'), Path('ex2.wav'), Path('ex2.txt')])
    ex['ex1']['wav'] = root / 'ex1.wav'
    ex['ex1']['txt'] = root / 'ex1.txt'
    ex['ex2']['wav'] = root / 'ex2.wav'
    ex['ex2']['txt'] = root / 'ex2.txt'

    # Assuming an folder per example '{example_id}/{key}',
    # where key is the file name with extension.
    >>> test(1, [Path('ex1/audio/1.wav'), Path('ex1/audio/2.txt'), Path('ex1/meta.json')])
    ex['ex1']['audio/1.wav'] = root / 'ex1/audio/1.wav'
    ex['ex1']['audio/2.txt'] = root / 'ex1/audio/2.txt'
    ex['ex1']['meta.json'] = root / 'ex1/meta.json'

    # Assuming file stem is example id '{key}/{example_id}.{ext}',
    # where key is the file name with extension.
    >>> test(-1, [Path('audio/ex1.wav'), Path('meta/txt/ex1.txt')])
    ex['ex1']['audio'] = root / 'audio/ex1.wav'
    ex['ex1']['meta/txt'] = root / 'meta/txt/ex1.txt'

    """
    if parents is None:
        return file_path.stem
    if callable(parents):
        return parents(file_path, sep)
    
    if parents == 0:
        return sep.join(file_path.with_suffix("").parts[parents:]), file_path.suffix.lstrip('.')
    elif parents > 0:
        return sep.join(file_path.with_suffix("").parts[:parents]), sep.join(file_path.parts[parents:])
    elif parents < 0:
        return sep.join(file_path.with_suffix("").parts[parents:]), sep.join(file_path.parts[:parents])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I agree that it is undesirable to ignore files silently. I also like the new behaviour of parents. This should cover the most common cases. I adopted your suggestions (I moved _make_example_id to the outer scope so that the test cases are correctly executed).

_, files = _run_fast_scandir(Path(root), suffix)
files = map(Path, files)
examples = defaultdict(dict)
for file in files:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a sort?

The order of os.scandir depends on the physical location of the file on the HDD/SSD.
The order of the examples (and maybe even the content in case of collisions) should
be reproducible accross filesystems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

parents: Optional[int] = None,
sep: str = "_",
) -> "DictDataset":
"""Create a new DictDataset from a directory path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a small text for a motivation and a warning that this function should usually not be used? I lack currently an motivation, when this function is recommented. (Scanning directories with large number of files shouldn't e done on demand.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often need to evaluate audio files that were generated by TTS systems. I usually store them in a single directory (sometimes with a nested structure), and this is a convenient way to quickly obtain an iterable over these files (e.g., here). I could add a warning that is raised in the beginning to be cautious with this function. Otherwise, I would delegate the responsibility to the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok.

How about adding something like the following to the doc string (Feel free to improve the text):

Note: This function is not intended to be used for frequently used large datasets,
since the indexing overhead can get significant
For one time small datasets, it is a convenient way to load them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

small datasets, it is a convenient way to load them.
>>> import tempfile
>>> temp_dir = tempfile.TemporaryDirectory()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this leak a folder? Or does the destructor of temp_dir do some cleanup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup is not done automatically unless used as a context manager. I added a call to cleanup at the end of the doctest.

Copy link
Member

@boeddeker boeddeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comment, how about changing from_path to from_folder or from_dir?
When I first read that name, I thought the argument would be a file.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.17%. Comparing base (332d6d5) to head (6c7e67a).
⚠️ Report is 17 commits behind head on master.

Files with missing lines Patch % Lines
lazy_dataset/core.py 94.28% 2 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #67      +/-   ##
==========================================
- Coverage   76.63%   76.17%   -0.47%     
==========================================
  Files           5        5              
  Lines        1725     1792      +67     
==========================================
+ Hits         1322     1365      +43     
- Misses        403      427      +24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@michael-kuhlmann
Copy link
Contributor Author

Last comment, how about changing from_path to from_folder or from_dir? When I first read that name, I thought the argument would be a file.

Changed it to from_dir.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants