Skip to content

Add VirtualObjectStore to support routing paths to multiple ObjectStores #17084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Aug 8, 2025

Which issue does this PR close?

Closes #16991.

Rationale for this change

DataFusion currently lacks a mechanism to unify access to multiple object stores under a single abstraction. This PR introduces a VirtualObjectStore that routes read operations to specific object stores based on the first segment of the object path (i.e., a virtual prefix). This enables more flexible configurations where datasets might be spread across S3, local disk, memory, etc., but can be accessed transparently.

What changes are included in this PR?

  • Adds a new VirtualObjectStore implementation that supports routing of read operations (e.g., get, list, list_with_delimiter) to configured stores based on path prefix.
  • Integrates virtual_store: Option<Arc<dyn ObjectStore>> into FileScanConfig and FileScanConfigBuilder, enabling its use in scans.
  • Updates DataSource::execute to use the virtual store if configured, falling back to resolving the store via object_store_url.
  • Adds necessary dependencies (async-trait, tokio) for async trait implementation and testing.
  • Comprehensive unit tests covering key scenarios, including list resolution and path routing.

Are these changes tested?

Yes. The PR includes extensive unit tests under virtual_object_store.rs covering:

  • Prefix-based routing
  • Nested path handling
  • Nonexistent store prefixes
  • list and list_with_delimiter sorting and results

Are there any user-facing changes?

Yes:

  • Developers can now optionally pass a virtual_store to FileScanConfig, enabling custom routing behavior.
  • This introduces a more flexible way to access multi-store environments through DataFusion.

No breaking changes were introduced. Write operations (put, delete, etc.) are not yet supported by the VirtualObjectStore and return NotSupported errors accordingly.

@github-actions github-actions bot added execution Related to the execution crate datasource Changes to the datasource crate labels Aug 8, 2025
…icate that write operations are not supported
@alamb
Copy link
Contributor

alamb commented Aug 8, 2025

FYI @EmilyMatt -- I am not sure if you have seen what @kosiew is working on

kosiew added 4 commits August 12, 2025 10:43
- Add chrono to [dependencies] in datafusion/execution/Cargo.toml
- Remove chrono from [dev-dependencies]
- Update tokio dev-dependency to enable features ["macros", "rt", "sync"]
@kosiew kosiew force-pushed the virtual-object-store-16991 branch from 789630a to 5075467 Compare August 12, 2025 04:12
@kosiew kosiew marked this pull request as ready for review August 12, 2025 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasource Changes to the datasource crate execution Related to the execution crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

request: Connect file groups in datasource to their object store
2 participants