Skip to content

Conversation

@japrescott
Copy link
Contributor

@japrescott japrescott commented Nov 17, 2025

Description

https://www.loom.com/share/ce23eb8e863d43a78036b8f7608aab0b

closes jeremyprescott/ss-160-batch-processing-cloud-storage-improvements
This pull request simplifies the creation of referencefile used by batch processing; it allows the user to use its system credentials directly without needing to create a custom script or figure out the format. This should cover the 90% of the use cases that users have.

Screen.Recording.2025-11-17.at.13.59.35.mov

List any dependencies that are required for this change.

  • This feature relies on fsspec which abstracts S3, GCS and Azure access. Its optionally installed

example usage

GCS

inference rf-cloud data-staging create-batch-of-images \
  --batch-id=test-gcs-lenny-$(date +%Y%m%d-%H%M%S) \
  --data-source=cloud-storage \
  --bucket-path=gs://roboflow-jeremy-test/

GCS credentials overwrite

  • GOOGLE_APPLICATION_CREDENTIALS

S3 staging

inference rf-cloud data-staging create-batch-of-images \
  --batch-id=test-s3-lenny-$(date +%Y%m%d-%H%M%S) \
  --data-source=cloud-storage \
  --bucket-path=s3://roboflow-jeremy-test/lenny

S3 credentials overwrite

  • AWS_PROFILE
  • AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

S3 Env

  • AWS_ENDPOINT_URL
  • AWS_REGION

AZURE staging

azure does not support using system credentials. They need to be specified.

AZURE_STORAGE_ACCOUNT_NAME=roboflowjeremy AZURE_STORAGE_ACCOUNT_KEY={KEY}  inference rf-cloud data-staging create-batch-of-images \
  --batch-id=test-azure-lenny-$(date +%Y%m%d-%H%M%S) \
  --data-source=cloud-storage \
  --bucket-path=az://rf-test/

AZURE credentials overwrite

  • AZURE_STORAGE_SAS_TOKEN

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

YOUR_ANSWER

Any specific deployment considerations

For example, documentation changes, usability, usage/costs, secrets, etc.

Docs

  • Docs updated? What were the changes:

@japrescott japrescott marked this pull request as ready for review November 17, 2025 14:27
- `az://container/images/*.png` - PNG files in images folder

!!! hint "Cloud Storage Examples"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should highlight here that the credtials are used only locally byt the CLI tool to generate signed URLs

@PawelPeczek-Roboflow
Copy link
Collaborator

When it comes to implementation and functionality - looks ok
My main concern is the implementation of remote storage ops:

  • haven't seen the fsspec previously, looks like decent lib with community, but the way it is implemented (giant abstraction over all file systems) make it hard to read (literally few minutes spent across 2 repos to find out how GCS walk is implemented + not being able to find how they go from async I/O implementations to sync functions)
  • whole implementation is slow when it comes to listing operations - on your video, you reach a pace of 2k urls per second, so to sign 1M it takes 8.5 min - and that's assuming dense matches in terms of path pattern - and this is because best they can do in the lib is to provide generic implementation for walk operation - which in general can be match faster assuming some structure of the keys. Not sure if that turns our to be a problem, but almost 10 mins to sign 1M images seems to be high

@japrescott
Copy link
Contributor Author

thanks @PawelPeczek-Roboflow

literally few minutes spent across 2 repos to find out how GCS walk is implemented

fsspec does a good job in abstracting different cloud storages. I chose it over cloudpathlib for its use by dask, dvc and HF assuming it's well tested. But I agree, finding the actual documentation/implementation detail for each cloud storage provider (ie. how to pass credentials?) also felt more challenging than I would have liked it to be.

which in general can be match faster assuming some structure of the keys

I agree. We could divide and conquer the namespace and probe for the existence of keys to optimally fan-out the listing operation. I think that is something we could add later if this indeed becomes is the limiting factor or integrate it

Not sure if that turns our to be a problem, but almost 10 mins to sign 1M images seems to be high

The primary goal of this PR was ease-of-use and better UX with immediate feedback to the user that something is happening. IMHO for payloads of 100k+ the user anyway should be hooking up their bucket directly to our services and let us manage this.

Copy link
Collaborator

@PawelPeczek-Roboflow PawelPeczek-Roboflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fine for me

@japrescott japrescott merged commit ee847e6 into main Nov 21, 2025
41 checks passed
@japrescott japrescott deleted the japrescott/batch-processing-with-cloud-storage branch November 21, 2025 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants