File patterns in read_storage: wildcard, globstar & braces #1309

dmpetrov · 2025-08-26T22:30:02Z

Closes #1283
AI generated mostly.

Result: user can do this

dc.read_storage("s3://mybkt/dir1/dir2/**/*.{png,jpg}")

without the need in this

( 
    dc.read_storage("s3://mybkt/dir1/dir2/")
    .filter(dc.C("file.path").glob("*.jpg") | dc.C("file.path").glob("*.png"))
)

But it also applies to dirs and combination of these like dc.read_storage("s3://mybkt/**/{march,april,may}/**/*.{png,jpg}")

One side effect: dir1/* include only files in dir1 (like dir1/file.txt) and won't include nested files (like dir1/dir2/song.mp4) like we did before (old wildcard). It better aligns with Unix way of expanding file pattern and it's hard to support both.

Summary by Sourcery

Enable full glob pattern support in read_storage by adding utilities for expanding braces, detecting and splitting patterns, converting globstars to SQL-compatible filters, and integrating pattern-based filtering into the listing process.

New Features:

Support shell-like glob patterns in read_storage, including wildcards (*, ?), recursive globstars (**), and brace expansions

Enhancements:

Introduce storage_pattern module with URI pattern splitting, brace expansion, recursion detection, and SQLite-compatible pattern conversion
Extend read_storage to pre-expand braces, split URIs into base and pattern, and apply filtering via a new _apply_glob_filter helper

Tests:

Add comprehensive unit and functional tests for wildcard, globstar, brace expansion, question mark, mixed patterns, and edge cases

sourcery-ai · 2025-08-26T22:30:08Z

Reviewer's Guide

This pull request enhances read_storage to support glob patterns (wildcard, globstar, question mark) and brace expansion by introducing URI parsing and filtering utilities, and updating the listing logic to split URIs into base paths and patterns, expand braces, and apply post-listing glob filters within DataChain pipelines.

Sequence diagram for read_storage with glob and brace pattern expansion

sequenceDiagram
    participant User
    participant Storage
    participant StoragePatternUtils
    participant DataChain

    User->>Storage: read_storage(uri)
    Storage->>StoragePatternUtils: expand_uri_braces(uri)
    StoragePatternUtils-->>Storage: expanded_uris
    loop for each expanded_uri
        Storage->>StoragePatternUtils: split_uri_pattern(expanded_uri)
        StoragePatternUtils-->>Storage: base_uri, glob_pattern
        Storage->>DataChain: ls(base_uri)
        alt glob_pattern exists
            Storage->>StoragePatternUtils: should_use_recursion(glob_pattern, recursive)
            StoragePatternUtils-->>Storage: use_recursive
            Storage->>StoragePatternUtils: convert_globstar_to_sqlite(glob_pattern)
            StoragePatternUtils-->>Storage: sqlite_pattern
            Storage->>DataChain: filter(sqlite_pattern)
        else no pattern
            Storage->>DataChain: ls(base_uri)
        end
    end
    Storage->>User: return storage_chain

Class diagram for new and updated storage pattern utilities

classDiagram
    class StoragePatternUtils {
        +split_uri_pattern(uri: str) tuple[str, str|None]
        +should_use_recursion(pattern: str, user_recursive: bool) bool
        +expand_brace_pattern(pattern: str) list[str]
        +expand_uri_braces(uri: str) list[str]
        +convert_globstar_to_sqlite(filter_pattern: str) str
    }

    class DataChain {
        +filter()
        +union()
    }

    class Storage {
        +read_storage(uri, ...)
        +_apply_glob_filter(dc, patterns, list_path, use_recursive, column)
    }

    StoragePatternUtils <.. Storage : uses
    Storage o-- DataChain : pipeline

Class diagram for _apply_glob_filter function

classDiagram
    class Storage {
        +_apply_glob_filter(dc: DataChain, patterns: list[str], list_path: str, use_recursive: bool, column: str): DataChain
    }
    class DataChain {
        +filter()
        +ls()
    }
    Storage o-- DataChain : applies filter

Class diagram for brace and glob pattern expansion functions

classDiagram
    class StoragePatternUtils {
        +expand_brace_pattern(pattern: str): list[str]
        +expand_uri_braces(uri: str): list[str]
        +_expand_single_braces(pattern: str): list[str]
    }
    StoragePatternUtils <.. StoragePatternUtils : uses internally

File-Level Changes

Change	Details	Files
Add pattern matching utilities for URI parsing and expansion	Implemented split_uri_pattern for extracting base URI and glob patterns Created expand_brace_pattern with recursive brace expansion support Added should_use_recursion to decide listing recursion based on patterns Developed convert_globstar_to_sqlite for SQLite-compatible globstar handling	`src/datachain/lib/dc/storage_pattern.py` `tests/unit/lib/test_storage_pattern.py`
Extend read_storage to support glob and brace patterns	Expanded URIs with braces via expand_uri_braces before listing Used split_uri_pattern to separate base paths and glob patterns Introduced _apply_glob_filter utility to list then apply SQLite-compatible glob filters Adjusted read_storage to branch between normal listing and pattern-filtered chains and respect recursion settings	`src/datachain/lib/dc/storage.py`
Add comprehensive tests for pattern handling in read_storage and utilities	Added unit tests for split_uri_pattern edge cases and expand_brace_pattern Added functional tests for read_storage covering wildcard, globstar, question mark, brace expansion, mixed patterns, and no-pattern scenarios Updated existing integration tests in test_datachain, test_datachain_merge, test_datasets to remove explicit '*' pattern suffixes and align with new behavior	`tests/func/test_datachain.py` `tests/func/test_datachain_merge.py` `tests/func/test_datasets.py` `tests/func/test_storage_pattern.py`

Assessment against linked issues

Issue	Objective	Addressed
#1283	Enable filename filtering in read_storage() via wildcards in the URI, such as '*.mp3' and '?'.	✅
#1283	Enable recursive globstar-based filtering in read_storage() via patterns like '*/.mp3'.	✅
#1283	Enable brace expansion in read_storage() URIs, such as '*/.{mp3,wav}', to match multiple extensions.	✅

Possibly linked issues

Filename filter shortcut in read_storage() #1283: The PR adds glob, globstar, question mark, and brace expansion support to read_storage URIs, directly implementing the requested file filtering shortcuts.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

for more information, see https://pre-commit.ci

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/lib/dc/storage.py:310` </location>
<code_context>
+        # If a glob pattern was detected, use it for filtering
</code_context>

<issue_to_address>
Filtering is applied after listing all files, which may be inefficient for large datasets.

Listing all files before filtering can cause excessive data transfer and slow performance on remote storage. If supported, pass the glob pattern directly to the listing function to filter files at the source.

Suggested implementation:

```python
        # If a glob pattern was detected, pass it to get_listing for source-side filtering
        if glob_pattern:
            # Handle brace expansion patterns
            patterns = expand_brace_pattern(glob_pattern)
            # If only one pattern, pass it directly
            if len(patterns) == 1:
                list_ds_name, list_uri, list_path, list_ds_exists = get_listing(
                    list_uri_to_use, session, update=update, glob_pattern=patterns[0]
                )
            else:
                # For multiple patterns, aggregate results from multiple listings
                all_listings = []
                for pat in patterns:
                    ds_name, uri, path, ds_exists = get_listing(
                        list_uri_to_use, session, update=update, glob_pattern=pat
                    )
                    if path:
                        all_listings.extend(path)
                list_ds_name, list_uri, list_path, list_ds_exists = (
                    list_ds_name, list_uri, all_listings, list_ds_exists
                )
        else:
            list_ds_name, list_uri, list_path, list_ds_exists = get_listing(
                list_uri_to_use, session, update=update
            )

        # list_ds_name is None if object is a file, we don't want to use cache
                lambda ds_name=list_ds_name, lst_uri=list_uri: lst_fn(ds_name, lst_uri)
            )

        # Filtering is now done at the source if glob_pattern is provided
        # If further filtering is needed (e.g., for complex patterns not supported by the source), apply here
        # Otherwise, use the original list_path from get_listing
        from datachain.query.schema import Column
        chain = dc

```

- Ensure that the `get_listing` function supports a `glob_pattern` argument and applies filtering at the source. You may need to update its implementation.
- Remove or refactor any redundant post-listing filtering logic that is now handled by the listing function.
- If the listing backend does not support glob patterns, fallback to post-listing filtering as before.
</issue_to_address>

### Comment 2
<location> `src/datachain/lib/dc/storage.py:329` </location>
<code_context>
+                filter_expr = None
+                for pattern in patterns:
+                    pattern_filter = Column(f"{column}.path").glob(pattern)
+                    filter_expr = pattern_filter if filter_expr is None else filter_expr | pattern_filter
+                chain = chain.filter(filter_expr)
+            chains.append(chain)
</code_context>

<issue_to_address>
Operator precedence in filter expression construction may be ambiguous.

Verify that the filter objects handle the bitwise OR operator as expected, and use parentheses if needed to clarify precedence.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
                for pattern in patterns:
                    pattern_filter = Column(f"{column}.path").glob(pattern)
                    filter_expr = pattern_filter if filter_expr is None else filter_expr | pattern_filter
=======
                for pattern in patterns:
                    pattern_filter = Column(f"{column}.path").glob(pattern)
                    filter_expr = pattern_filter if filter_expr is None else (filter_expr | pattern_filter)
>>>>>>> REPLACE

</suggested_fix>

### Comment 3
<location> `tests/unit/lib/test_read_storage_glob.py:209` </location>
<code_context>
+    def test_multiple_patterns(self, mock_read_dataset, mock_ls, mock_get_listing, mock_session, mock_listing):
</code_context>

<issue_to_address>
Test for empty URI list and non-string URI types.

Add tests for an empty URI list and for cases where URIs include os.PathLike objects or a mix of string and PathLike, to verify read_storage handles these inputs correctly.
</issue_to_address>

### Comment 4
<location> `tests/unit/lib/test_read_storage_glob.py:109` </location>
<code_context>
+    def test_wildcard_pattern(self, mock_read_dataset, mock_ls, mock_get_listing, mock_session, mock_listing):
</code_context>

<issue_to_address>
Consider verifying the actual filter arguments for correctness.

Please add an assertion to check that the filter was called with the correct glob pattern argument.

Suggested implementation:

```python
    def test_wildcard_pattern(self, mock_read_dataset, mock_ls, mock_get_listing, mock_session, mock_listing):
        """Test that wildcard patterns are automatically filtered"""
        tmp_dir, files = mock_listing

        # Setup mocks
        mock_get_listing.return_value = ("test_dataset", str(tmp_dir), "audio", True)
        mock_chain = MagicMock()
        mock_query = MagicMock()
        mock_chain._query = mock_query
        mock_chain.signals_schema = MagicMock()
        mock_chain.signals_schema.mutate = MagicMock(return_value=mock_chain.signals_schema)

        # Call the function under test (assuming it's called here, e.g. read_storage)
        # Example: read_storage("test_dataset/*.wav", ...)

        # Assert that the filter was called with the correct glob pattern
        expected_pattern = "test_dataset/*.wav"
        mock_ls.assert_any_call(expected_pattern)

```

- If the actual glob pattern or the function under test differs, adjust `expected_pattern` and the assertion accordingly.
- If the filter is applied via a different mock (not `mock_ls`), replace `mock_ls` with the correct mock object.
- Ensure the function under test is actually called in the test (e.g., `read_storage(...)`), so the assertion is meaningful.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-08-26T22:31:03Z

src/datachain/lib/dc/storage.py

+        # If a glob pattern was detected, use it for filtering
+        # Otherwise, use the original list_path from get_listing
+        if glob_pattern:
+            # Handle brace expansion patterns
+            patterns = expand_brace_pattern(glob_pattern)
+
+            # Apply glob filter(s)
+            from datachain.query.schema import Column
+            chain = dc
+            if len(patterns) == 1:


suggestion (performance): Filtering is applied after listing all files, which may be inefficient for large datasets.

Listing all files before filtering can cause excessive data transfer and slow performance on remote storage. If supported, pass the glob pattern directly to the listing function to filter files at the source.

Suggested implementation:

# If a glob pattern was detected, pass it to get_listing for source-side filtering if glob_pattern: # Handle brace expansion patterns patterns = expand_brace_pattern(glob_pattern) # If only one pattern, pass it directly if len(patterns) == 1: list_ds_name, list_uri, list_path, list_ds_exists = get_listing( list_uri_to_use, session, update=update, glob_pattern=patterns[0] ) else: # For multiple patterns, aggregate results from multiple listings all_listings = [] for pat in patterns: ds_name, uri, path, ds_exists = get_listing( list_uri_to_use, session, update=update, glob_pattern=pat ) if path: all_listings.extend(path) list_ds_name, list_uri, list_path, list_ds_exists = ( list_ds_name, list_uri, all_listings, list_ds_exists ) else: list_ds_name, list_uri, list_path, list_ds_exists = get_listing( list_uri_to_use, session, update=update ) # list_ds_name is None if object is a file, we don't want to use cache lambda ds_name=list_ds_name, lst_uri=list_uri: lst_fn(ds_name, lst_uri) ) # Filtering is now done at the source if glob_pattern is provided # If further filtering is needed (e.g., for complex patterns not supported by the source), apply here # Otherwise, use the original list_path from get_listing from datachain.query.schema import Column chain = dc

Ensure that the get_listing function supports a glob_pattern argument and applies filtering at the source. You may need to update its implementation.

Remove or refactor any redundant post-listing filtering logic that is now handled by the listing function.

If the listing backend does not support glob patterns, fallback to post-listing filtering as before.

sourcery-ai · 2025-08-26T22:31:04Z

src/datachain/lib/dc/storage.py

+                for pattern in patterns:
+                    pattern_filter = Column(f"{column}.path").glob(pattern)
+                    filter_expr = pattern_filter if filter_expr is None else filter_expr | pattern_filter


suggestion: Operator precedence in filter expression construction may be ambiguous.

Verify that the filter objects handle the bitwise OR operator as expected, and use parentheses if needed to clarify precedence.

Suggested change

for pattern in patterns:

pattern_filter = Column(f"{column}.path").glob(pattern)

filter_expr = pattern_filter if filter_expr is None else filter_expr | pattern_filter

for pattern in patterns:

pattern_filter = Column(f"{column}.path").glob(pattern)

filter_expr = pattern_filter if filter_expr is None else (filter_expr | pattern_filter)

sourcery-ai · 2025-08-26T22:31:04Z

tests/unit/lib/test_read_storage_glob.py

+    def test_multiple_patterns(self, mock_read_dataset, mock_ls, mock_get_listing, mock_session, mock_listing):
+        """Test multiple URIs with different patterns"""
+        tmp_dir, files = mock_listing
+
+        # Setup mocks for multiple URIs
+        mock_get_listing.side_effect = [
+            ("test_dataset1", str(tmp_dir), "audio/*.mp3", True),
+            ("test_dataset2", str(tmp_dir), "docs/*.json", True),
+        ]
+


suggestion (testing): Test for empty URI list and non-string URI types.

Add tests for an empty URI list and for cases where URIs include os.PathLike objects or a mix of string and PathLike, to verify read_storage handles these inputs correctly.

sourcery-ai · 2025-08-26T22:31:04Z

tests/unit/lib/test_read_storage_glob.py

+    def test_wildcard_pattern(self, mock_read_dataset, mock_ls, mock_get_listing, mock_session, mock_listing):
+        """Test that wildcard patterns are automatically filtered"""
+        tmp_dir, files = mock_listing
+
+        # Setup mocks
+        mock_get_listing.return_value = ("test_dataset", str(tmp_dir), "audio", True)
+        mock_chain = MagicMock()
+        mock_query = MagicMock()
+        mock_chain._query = mock_query
+        mock_chain.signals_schema = MagicMock()


suggestion (testing): Consider verifying the actual filter arguments for correctness.

Please add an assertion to check that the filter was called with the correct glob pattern argument.

Suggested implementation:

def test_wildcard_pattern(self, mock_read_dataset, mock_ls, mock_get_listing, mock_session, mock_listing): """Test that wildcard patterns are automatically filtered""" tmp_dir, files = mock_listing # Setup mocks mock_get_listing.return_value = ("test_dataset", str(tmp_dir), "audio", True) mock_chain = MagicMock() mock_query = MagicMock() mock_chain._query = mock_query mock_chain.signals_schema = MagicMock() mock_chain.signals_schema.mutate = MagicMock(return_value=mock_chain.signals_schema) # Call the function under test (assuming it's called here, e.g. read_storage) # Example: read_storage("test_dataset/*.wav", ...) # Assert that the filter was called with the correct glob pattern expected_pattern = "test_dataset/*.wav" mock_ls.assert_any_call(expected_pattern)

If the actual glob pattern or the function under test differs, adjust expected_pattern and the assertion accordingly.

If the filter is applied via a different mock (not mock_ls), replace mock_ls with the correct mock object.

Ensure the function under test is actually called in the test (e.g., read_storage(...)), so the assertion is meaningful.

sourcery-ai · 2025-08-26T22:31:04Z

src/datachain/lib/dc/storage.py

+        "s3://bucket/dir" -> ("s3://bucket/dir", None)
+    """
+    # Check if URI contains any glob patterns
+    if not any(char in uri for char in ['*', '?', '[', '{', '}']):


issue (code-quality): We've found these issues:

Invert any/all to simplify comparisons (invert-any-all)

Hoist repeated code outside conditional statement [×2] (hoist-statement-from-if)

sourcery-ai · 2025-08-26T22:31:04Z

src/datachain/lib/dc/storage.py

+
+
+def expand_brace_pattern(pattern: str) -> list[str]:
+    """


issue (code-quality): We've found these issues:

Convert for loop into list comprehension (list-comprehension)

Inline variable that is immediately returned (inline-immediately-returned-variable)

cloudflare-workers-and-pages · 2025-08-26T22:31:06Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`962b92e`
Status:	✅ Deploy successful!
Preview URL:	https://79cda28e.datachain-documentation.pages.dev
Branch Preview URL:	https://globstar.datachain-documentation.pages.dev

View logs

for more information, see https://pre-commit.ci

codecov · 2025-08-27T06:47:01Z

Codecov Report

❌ Patch coverage is 85.51724% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.78%. Comparing base (91617c0) to head (962b92e).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/lib/dc/storage_pattern.py	83.33%	7 Missing and 13 partials ⚠️
src/datachain/client/fsspec.py	83.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1309      +/-   ##
==========================================
- Coverage   88.84%   88.78%   -0.07%     
==========================================
  Files         155      156       +1     
  Lines       14240    14383     +143     
  Branches     2025     2062      +37     
==========================================
+ Hits        12652    12770     +118     
- Misses       1124     1134      +10     
- Partials      464      479      +15

Flag	Coverage Δ
datachain	`88.72% <85.51%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/lib/dc/storage.py	`100.00% <100.00%> (ø)`
src/datachain/client/fsspec.py	`92.53% <83.33%> (-0.22%)`	⬇️
src/datachain/lib/dc/storage_pattern.py	`83.33% <83.33%> (ø)`

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

for more information, see https://pre-commit.ci

This reverts commit 94bedbd.

dreadatour

Haven't check storage_pattern.py yet (will check it later), but I have some comments I believe we should adress first before merging this PR.

Also a question: is it possible to use glob pattern in bucket name? What will be a result of: dc.read_storage(s3://*/**/*)?

src/datachain/lib/dc/storage.py

dreadatour · 2025-08-29T03:11:20Z

src/datachain/lib/dc/storage.py

+        # Check if URI contains glob patterns and split them
+        base_uri, glob_pattern = split_uri_pattern(single_uri)
+
+        # If a pattern is found, use the base_uri for listing
+        # The pattern will be used for filtering later
+        list_uri_to_use = base_uri if glob_pattern else single_uri
+
        list_ds_name, list_uri, list_path, list_ds_exists = get_listing(
-            single_uri, session, update=update
+            list_uri_to_use, session, update=update
        )


Optional: same, we already have checks in split_uri_pattern, so we can simplify this here:

base_uri, glob_pattern = split_uri_pattern(single_uri) list_ds_name, list_uri, list_path, list_ds_exists = get_listing( base_uri, session, update=update )

dreadatour · 2025-08-29T03:30:16Z

src/datachain/lib/dc/storage.py

-    for single_uri in uris:
+    for single_uri in expanded_uris:
+        # Check if URI contains glob patterns and split them
+        base_uri, glob_pattern = split_uri_pattern(single_uri)


If I am not mistaken, glob_pattern will be rewrited here, in case of few expanded_uris. Let's say, first one have globs, and second one — does now have, then result glob_pattern will be None and later check (line 220) will now works, and globs will not be processed.

We should define glob_pattern with default None out of the for loop scope and set it only if glob_pattern from split_uri_pattern is not None.

Also it might be the case with multiple glob, but we are storing only one (from last URI) and ignoring the rest.

Also since we are storing glob pattern and apply it to all the results, it will affects other URIs in case, let's say, first URI was defined without glob pattern, and second URI have glob pattern. Results will be unpredictable.

Could you please elaborate?

My understanding - glob_pattern is defined for each uri independently and not affecting each other. Could you please explain the cases when it can affect something.

The only thing I was able to find that changes over the loops - bucket update. The same bucket could be updated multiple times which we have to avoid. Change is comming.

Hm, you're absolutely right! Sorry for confusing message from me :(
I was "mislook" the indentation level 😢

Co-authored-by: Vladimir Rudnykh <[email protected]>

dmpetrov

minor improvements

dmpetrov · 2025-08-30T21:54:40Z

src/datachain/lib/dc/storage.py

-    for single_uri in uris:
+    for single_uri in expanded_uris:
+        # Check if URI contains glob patterns and split them
+        base_uri, glob_pattern = split_uri_pattern(single_uri)


Could you please elaborate?

My understanding - glob_pattern is defined for each uri independently and not affecting each other. Could you please explain the cases when it can affect something.

The only thing I was able to find that changes over the loops - bucket update. The same bucket could be updated multiple times which we have to avoid. Change is comming.

dmpetrov · 2025-08-30T22:57:19Z

Also a question: is it possible to use glob pattern in bucket name? What will be a result of: dc.read_storage(s3://*/**/*)?

Good catch - obviously it's not supported. Added the validation.

dreadatour

Looks good to me! Great improvement 👍

Wild random controversial idea: implement additional filter_file chain method to filter chain by the same globs using the same functions you've added, without the need of using C("file").glob(...) (which syntax I don't remember and need to go to the docs all the time). With signature looks something like that: .filter_file(glob: str, file_signal: str = "file").

dreadatour · 2025-09-08T15:36:21Z

src/datachain/lib/dc/storage.py

-    for single_uri in uris:
+    for single_uri in expanded_uris:
+        # Check if URI contains glob patterns and split them
+        base_uri, glob_pattern = split_uri_pattern(single_uri)


Hm, you're absolutely right! Sorry for confusing message from me :(
I was "mislook" the indentation level 😢

shcheklein · 2025-09-08T16:43:13Z

tests/unit/lib/test_storage_pattern.py

+@patch("datachain.lib.dc.storage.ls")
+@patch.object(sys.modules["datachain.lib.dc.datasets"], "read_dataset")
+def test_read_storage_brace_expansion_pattern(
+    mock_read_dataset, mock_ls, mock_get_listing, mock_session, mock_listing


Tests like this look very complicated but really test only that a function was called? mock_chain.filter.called - is that the whole purpose? To my mind we don't test much, just testing that function is called is not very interesting.

But also there will be quite heavy maintenance cost to such tests (all these Mocks expose internals and will require updates on each refactoring change + good understanding how internals work).

Yeah, removed. Just a small drop in test coverage since it's covered by func tests pretty well.

shcheklein

Mock tests seem very complicated and not very meaningful - we'll have to maintain them. I would consider improvements there.

shcheklein · 2025-09-08T16:47:59Z

src/datachain/lib/dc/storage_pattern.py

+    Raises:
+        ValueError: If a cloud storage bucket name contains glob patterns
+    """
+    if not any(uri.startswith(scheme) for scheme in ["s3://", "gs://", "az://"]):


No hf? Do we really want to hard code the list? Is there a better way to handle this?

good catch.

I added hf with test.

I was not able to find a single check for this in fsspec/client - I extracted this to fsspec.py

shcheklein · 2025-09-08T16:48:49Z

src/datachain/lib/dc/storage_pattern.py

+    if not any(uri.startswith(scheme) for scheme in ["s3://", "gs://", "az://"]):
+        return
+
+    # Extract bucket name (everything between :// and first /)


Don't we have some helpers for this in Client / fsspec?

nope. But I created one.

shcheklein

Quite a lot of logic in storage pattern ....it would be great to check if some libraries exist to support all of this. It is quite a lot of low level code for the change to maintain later

dmpetrov · 2025-09-09T02:51:33Z

it would be great to check if some libraries exist to support all of this. It is quite a lot of low level code for the change to maintain later

Yeah, I also was hoping that it will be much easier - that's why I took it. However, it's much more involved when it comes down to the details.

read_storage: globstar support

c9db62a

[pre-commit.ci] auto fixes from pre-commit.com hooks

beca90b

for more information, see https://pre-commit.ci

sourcery-ai bot reviewed Aug 26, 2025

View reviewed changes

dmpetrov marked this pull request as draft August 26, 2025 22:55

dmpetrov and others added 9 commits August 26, 2025 16:28

linter

bc3011f

linter2

68cdca7

fix test

5c99e45

another test fix

e16bc34

[pre-commit.ci] auto fixes from pre-commit.com hooks

288a9c7

for more information, see https://pre-commit.ci

tests

5612c0b

tests2

42cd2ba

test3

230adea

tests: dogs dir

74d694c

dmpetrov and others added 14 commits August 27, 2025 00:13

test py3.9

2f78ccd

[pre-commit.ci] auto fixes from pre-commit.com hooks

5a93f51

for more information, see https://pre-commit.ci

test windows

9fcc468

Merge branch 'globstar' of github.com:iterative/datachain into globstar

de46273

fix deep dir

6b8986c

[pre-commit.ci] auto fixes from pre-commit.com hooks

a9f64d8

for more information, see https://pre-commit.ci

linter: function split

1ce2651

reformat test

38c827a

[pre-commit.ci] auto fixes from pre-commit.com hooks

06e79c4

for more information, see https://pre-commit.ci

revert som etest changes back

94bedbd

Revert "revert som etest changes back"

9767a87

This reverts commit 94bedbd.

Merge branch 'main' into globstar

96ef405

initial func test for globstar

a7c1c19

fix braces issue

4f29c13

dmpetrov added 6 commits August 28, 2025 13:11

fixes

c14cdcb

linter

e134635

Merge branch 'main' into globstar

e736843

uodate docstring

d09dabd

refactoring: extract everything from storage.py

c94b934

small refactoring

464cee9

dreadatour requested changes Aug 29, 2025

View reviewed changes

dmpetrov and others added 4 commits August 30, 2025 13:33

Merge branch 'main' into globstar

77dc48e

Apply suggestions from code review

1752144

Co-authored-by: Vladimir Rudnykh <[email protected]>

avoid double listings

14f4a4b

remove unnecessary code

b9b02c4

dmpetrov commented Aug 30, 2025

View reviewed changes

check glob in buckets

3fb6fd3

dmpetrov added 2 commits August 30, 2025 16:36

improve test coverage

6338000

test coverage

7043721

dreadatour approved these changes Sep 8, 2025

View reviewed changes

shcheklein reviewed Sep 8, 2025

View reviewed changes

dmpetrov added 5 commits September 8, 2025 16:26

Merge branch 'main' into globstar

8cb9947

fix: docs formatting

c5f48a0

remove complecated unit tests with mocks

d6d13d3

extract uri cloud check to fsspec

00fe28b

fix: hf prefix and test

962b92e

dmpetrov merged commit 885163e into main Sep 10, 2025
37 of 38 checks passed

dmpetrov deleted the globstar branch September 10, 2025 03:27

File patterns in read_storage: wildcard, globstar & braces #1309

File patterns in read_storage: wildcard, globstar & braces #1309

Uh oh!

Conversation

dmpetrov commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for read_storage with glob and brace pattern expansion

Class diagram for new and updated storage pattern utilities

Class diagram for _apply_glob_filter function

Class diagram for brace and glob pattern expansion functions

File-Level Changes

Assessment against linked issues

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

cloudflare-workers-and-pages bot commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

codecov bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dreadatour left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmpetrov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmpetrov commented Aug 30, 2025

Uh oh!

dreadatour left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shcheklein left a comment

dmpetrov commented Aug 26, 2025 •

edited

Loading

sourcery-ai bot commented Aug 26, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Aug 26, 2025 •

edited

Loading

codecov bot commented Aug 27, 2025 •

edited

Loading