Skip to content
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .github/workflows/run_python_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,14 @@ on:
jobs:
build:

runs-on: ubuntu-latest
runs-on: ${{ matrix.os }}
strategy:
matrix:
python-version: ["3.7", "3.8", "3.9", "3.10", "3.11", "3.12"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
os: [ubuntu-latest]
include:
- os: ubuntu-22.04
python-version: "3.7"

steps:
- uses: actions/checkout@v2
Expand Down
87 changes: 87 additions & 0 deletions lazy_dataset/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,93 @@ def from_dataset(
immutable_warranty=immutable_warranty, name=name)


def from_path(
root: Union[str, Path],
suffix: Union[str, List[str]],
immutable_warranty: str = 'pickle',
name: str = None,
parents: Optional[int] = None,
sep: str = "_",
) -> "DictDataset":
"""Create a new DictDataset from a directory path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a small text for a motivation and a warning that this function should usually not be used? I lack currently an motivation, when this function is recommented. (Scanning directories with large number of files shouldn't e done on demand.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often need to evaluate audio files that were generated by TTS systems. I usually store them in a single directory (sometimes with a nested structure), and this is a convenient way to quickly obtain an iterable over these files (e.g., here). I could add a warning that is raised in the beginning to be cautious with this function. Otherwise, I would delegate the responsibility to the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok.

How about adding something like the following to the doc string (Feel free to improve the text):

Note: This function is not intended to be used for frequently used large datasets,
since the indexing overhead can get significant
For one time small datasets, it is a convenient way to load them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Scan and include all files in `root` that end with a suffix in `suffix`.
New examples are created for each unique file stem. The example_id is
derived from the file path.

>>> import tempfile
>>> temp_dir = tempfile.TemporaryDirectory()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this leak a folder? Or does the destructor of temp_dir do some cleanup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup is not done automatically unless used as a context manager. I added a call to cleanup at the end of the doctest.

>>> fp = Path(temp_dir.name) / "test1.txt"
>>> fp.touch()
>>> fp = Path(temp_dir.name) / "test1.wav"
>>> fp.touch()
>>> ds = from_path(temp_dir.name, suffix=".txt")
>>> ds
DictDataset(len=1)
MapDataset(_pickle.loads)
>>> ds[0] # doctest: +ELLIPSIS
{'example_id': 'test1', 'txt': PosixPath('.../test1.txt')}

>>> ds = from_path(temp_dir.name, suffix=[".txt", ".wav"])
>>> ds
DictDataset(len=1)
MapDataset(_pickle.loads)
>>> ds[0] # doctest: +ELLIPSIS
{'example_id': 'test1', 'txt': PosixPath('.../test1.txt'), 'wav': PosixPath('.../test1.wav')}

Args:
root (Union[str, Path]): Root directory to scan for files.
suffix (Union[str, List[str]]): List of file suffixes to scan for.
Files with these suffixes will be added to the dataset.
immutable_warranty (str, optional):
name (str, optional):
parents (Optional[int], optional): Level of parent folders to include in
the example_id. If `None`, only the file stem is used. `parents=1`
includes the immediate parent folder. Defaults to None.
sep (str, optional): Separator to use for joining folder names.
Defaults to "_".

Returns:
DictDataset: A dataset containing the scanned files.
"""
from collections import defaultdict
import os
# https://stackoverflow.com/a/59803793/16085876
def _run_fast_scandir(root: Path, ext: List[str]):
subfolders, files = [], []

for f in os.scandir(root):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about keeping that function simpler?
A recursive implementation is simpler, when you use generator style.
Why do you use in and lower() for an ext check?

    def _run_fast_scandir(root: Path, ext: List[str]):
        for f in os.scandir(root):
            if f.is_dir():
                yield from _run_fast_scandir(f)
            if f.is_file():
                if any(f.name.endswith(e) for e in ext):
                    yield Path(f.path)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I remember that I had a very special case where I wanted to match only part of the extension, but I can't reproduce that scenario anymore. I adopted your suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. If you need that again, we could allow callables for verification.

if f.is_dir():
subfolders.append(f.path)
if f.is_file():
if any(e in f.name.lower() for e in ext):
files.append(Path(f.path))

for folder in list(subfolders):
sf, f = _run_fast_scandir(folder, ext)
subfolders.extend(sf)
files.extend(f)
return subfolders, files

def _make_example_id(file_path: Path):
if parents is None:
return file_path.stem
example_id = file_path.stem
prefix = sep.join(file_path.parts[-(2+parents):-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the option to select the folder, that are included in the example id be interpreted from the left?
e.g. the first example that I had in mind is the following:
ex1/meta.json
ex1/audio/ch1.wav
ex1/audio/ch2.wav

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the option be allowed to be a callable? To support all kind of mappings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the option be allowed to be a callable? To support all kind of mappings?

Done. parents can now also be a callable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the option to select the folder, that are included in the example id be interpreted from the left? e.g. the first example that I had in mind is the following: ex1/meta.json ex1/audio/ch1.wav ex1/audio/ch2.wav

I had something more homogeneous in mind, like:

speaker1/
  001.wav
  002.wav
speaker2/
  001.wav
  002.wav

Here, I want to add the speaker ID to resolve collisions, and a right-to-left approach seemed more natural to me. What behaviour would you expect in your example? Is it

ex1_meta
ex1_ch1
ex1_ch2

or

ex1_meta
ex1_audio_ch1
ex1_audio_ch2?

The second is doable with the current implementation, the first not (unless you use a callable for parents).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speaker1/
  001.wav
  002.wav
speaker2/
  001.wav
  002.wav

001/
  speaker1.wav
  speaker2.wav
002/
  speaker1.wav
  speaker2.wav

I think both folder structures are common. In the past I often used the first, while recently I used mainly the second. In another conversation in the PR, I added code to support both.

return sep.join((prefix, example_id))

if isinstance(suffix, str):
suffix = [suffix]
_, files = _run_fast_scandir(Path(root), suffix)
files = map(Path, files)
examples = defaultdict(dict)
for file in files:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a sort?

The order of os.scandir depends on the physical location of the file on the HDD/SSD.
The order of the examples (and maybe even the content in case of collisions) should
be reproducible accross filesystems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

example_id = _make_example_id(file)
examples[example_id]["example_id"] = example_id
examples[example_id][file.suffix.lstrip(".")] = file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about, letting the _make_example_id to a split of the file path (excluding the root path)? In that way, there will be o collision. Otherwise, the multiple files may overwrite each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about, letting the _make_example_id do a split of the file path (excluding the root path)? In that way, there will be no collision. Otherwise, the multiple files may overwrite each other.

There were a few typos and my text was a bit too short, that made it difficult to get the idea of my comment.
Suffixes may not be unique. The sort makes it now deterministic, but
I think it is still undesired to silently ignore a file.
Suggestion: Add an assert.

Depending on the value of parent, we may want to have another key (e.g., source separation typically yields two wav files, so the suffix is not enough to distingush them):

example_id, key = _make_example_id(file_path: Path)
examples[example_id]["example_id"] = example_id
assert key not in examples[example_id], (key, example_id, examples[example_id])
examples[example_id][key] = file

How about slightly change the definition of parents to make it able to handle more cases?
I considered three cases:

  • Flat folder (parents == 0): suffix is the key, while stem is the example_id
  • folder per key (parents < ): The example_id is the stem, but the folder is the key (In your example it is the suffix, but the folder feels more natural to me. Or using both?)
  • A folder per example (parents > 0)

To have the code more natural, I had to change the sign of parents for your case.

from pathlib import Path


def _make_example_id(file_path: Path, parents, sep: str = "/") -> str:
    """

    parents:
        0: '{example_id}.{key}', all folders are considered as part of example id.
        >0: '{example_id}/{key}', abs(parents) is the number of folders to consider as part of example id.
        <0: '{key}/{example_id}', abs(parents) is the number of path parts to consider as part of example id.

    >>> def test(parents, paths):
    ...     for p in paths:
    ...         example_id, key = _make_example_id(p, parents)
    ...         print(f'ex[{example_id!r}][{key!r}] = root / {str(p)!r}')

    # Assuming, all files are in the 'root' directory
    # and the naming is '{example_id}.{key}', where key is the file extension.
    >>> test(0, [Path('ex1.wav'), Path('ex1.txt'), Path('ex2.wav'), Path('ex2.txt')])
    ex['ex1']['wav'] = root / 'ex1.wav'
    ex['ex1']['txt'] = root / 'ex1.txt'
    ex['ex2']['wav'] = root / 'ex2.wav'
    ex['ex2']['txt'] = root / 'ex2.txt'

    # Assuming an folder per example '{example_id}/{key}',
    # where key is the file name with extension.
    >>> test(1, [Path('ex1/audio/1.wav'), Path('ex1/audio/2.txt'), Path('ex1/meta.json')])
    ex['ex1']['audio/1.wav'] = root / 'ex1/audio/1.wav'
    ex['ex1']['audio/2.txt'] = root / 'ex1/audio/2.txt'
    ex['ex1']['meta.json'] = root / 'ex1/meta.json'

    # Assuming file stem is example id '{key}/{example_id}.{ext}',
    # where key is the file name with extension.
    >>> test(-1, [Path('audio/ex1.wav'), Path('meta/txt/ex1.txt')])
    ex['ex1']['audio'] = root / 'audio/ex1.wav'
    ex['ex1']['meta/txt'] = root / 'meta/txt/ex1.txt'

    """
    if parents is None:
        return file_path.stem
    if callable(parents):
        return parents(file_path, sep)
    
    if parents == 0:
        return sep.join(file_path.with_suffix("").parts[parents:]), file_path.suffix.lstrip('.')
    elif parents > 0:
        return sep.join(file_path.with_suffix("").parts[:parents]), sep.join(file_path.parts[parents:])
    elif parents < 0:
        return sep.join(file_path.with_suffix("").parts[parents:]), sep.join(file_path.parts[:parents])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I agree that it is undesirable to ignore files silently. I also like the new behaviour of parents. This should cover the most common cases. I adopted your suggestions (I moved _make_example_id to the outer scope so that the test cases are correctly executed).

return from_dict(examples, immutable_warranty, name)


def concatenate(*datasets):
"""
Create a new `Dataset` by concatenation of all passed datasets.
Expand Down