Skip to content

Commit ee847e6

Browse files
authored
Merge pull request #1726 from roboflow/japrescott/batch-processing-with-cloud-storage
closes jeremyprescott/ss-160-batch-processing-cloud-storage-improvements This pull request simplifies the creation of referencefile used by batch processing; it allows the user to use its system credentials directly without needing to create a custom script or figure out the format. This should cover the 90% of the use cases that users have.
2 parents 59e39ab + bf9528d commit ee847e6

File tree

8 files changed

+1345
-159
lines changed

8 files changed

+1345
-159
lines changed

docs/workflows/batch_processing/about.md

Lines changed: 101 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,13 +87,113 @@ inference rf-cloud data-staging create-batch-of-images --images-dir <your-images
8787

8888
for videos:
8989
```bash
90-
inference rf-cloud data-staging create-batch-of-videos --videos-dir <your-images-dir-path> --batch-id <your-batch-id>
90+
inference rf-cloud data-staging create-batch-of-videos --videos-dir <your-videos-dir-path> --batch-id <your-batch-id>
9191
```
9292

9393
!!! hint "Format of `<your-batch-id>`"
9494

9595
Batch ID must be lower-cased string without special caraters, with letters and digits allowed.
9696

97+
#### Cloud Storage Integration
98+
99+
If your data is already stored in cloud storage (S3, Google Cloud Storage, or Azure), you can process it directly without downloading files locally. This feature generates presigned URLs for your cloud files, making it efficient for large datasets.
100+
101+
!!! info "Installing Cloud Storage Support"
102+
103+
Cloud storage integration requires additional dependencies. Install them with:
104+
105+
```bash
106+
pip install 'inference-cli[cloud-storage]'
107+
```
108+
109+
**For images stored in cloud storage:**
110+
111+
```bash
112+
inference rf-cloud data-staging create-batch-of-images \
113+
--data-source cloud-storage \
114+
--bucket-path <cloud-path> \
115+
--batch-id <your-batch-id>
116+
```
117+
118+
**For videos stored in cloud storage:**
119+
120+
```bash
121+
inference rf-cloud data-staging create-batch-of-videos \
122+
--data-source cloud-storage \
123+
--bucket-path <cloud-path> \
124+
--batch-id <your-batch-id>
125+
```
126+
127+
The `--bucket-path` parameter supports:
128+
- **S3**: `s3://bucket-name/path/`
129+
- **Google Cloud Storage**: `gs://bucket-name/path/`
130+
- **Azure Blob Storage**: `az://container-name/path/`
131+
132+
You can optionally include glob patterns to filter files:
133+
- `s3://my-bucket/training-data/**/*.jpg` - All JPG files recursively
134+
- `gs://my-bucket/videos/2024-*/*.mp4` - MP4 files in 2024-* folders
135+
- `az://container/images/*.png` - PNG files in images folder
136+
137+
138+
!!! tip "Credentials Usage"
139+
140+
Your cloud storage credentials are used **only locally** by the CLI tool to generate presigned URLs. They are **never uploaded** to Roboflow servers. The presigned URLs allow our batch processing service to access your files directly from your cloud storage without requiring access to your credentials.
141+
142+
!!! hint "Cloud Storage Examples"
143+
144+
**AWS S3:**
145+
```bash
146+
export AWS_PROFILE=my-profile # Optional, uses credentials from ~/.aws/credentials
147+
148+
inference rf-cloud data-staging create-batch-of-images \
149+
--data-source cloud-storage \
150+
--bucket-path "s3://my-bucket/training-data/**/*.jpg" \
151+
--batch-id my-s3-batch
152+
```
153+
For more information, see [AWS S3 configuration](./integration.md#aws-s3-and-s3-compatible-storage).
154+
155+
**Google Cloud Storage:**
156+
```bash
157+
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
158+
159+
inference rf-cloud data-staging create-batch-of-videos \
160+
--data-source cloud-storage \
161+
--bucket-path "gs://my-gcs-bucket/videos/**/*.mp4" \
162+
--batch-id my-gcs-batch
163+
```
164+
165+
For more information, see [Google Cloud Storage configuration](./integration.md#google-cloud-storage).
166+
167+
**Azure Blob Storage:**
168+
```bash
169+
export AZURE_STORAGE_ACCOUNT_NAME=myaccount
170+
export AZURE_STORAGE_SAS_TOKEN="sv=2021-06-08&ss=b&srt=sco&sp=rl"
171+
172+
inference rf-cloud data-staging create-batch-of-images \
173+
--data-source cloud-storage \
174+
--bucket-path "az://my-container/images/*.png" \
175+
--batch-id my-azure-batch
176+
```
177+
178+
For more information, see [Azure Blob Storage configuration](./integration.md#azure-blob-storage).
179+
180+
!!! tip "Cloud Storage Configuration"
181+
182+
For detailed authentication options, credential management, and advanced configuration, see the [Cloud Storage Integration guide](./integration.md#cloud-storage-authentication).
183+
184+
!!! info "Large Dataset Handling"
185+
186+
The system automatically handles large datasets:
187+
188+
- **Images**: Automatically split into chunks of 20,000 files each for efficient processing
189+
- **Videos**: Best results with batches under 1,000 videos
190+
- **Progress tracking**: You'll see real-time progress as files are listed and presigned URLs are generated
191+
192+
When processing over 20,000 images, you'll see a message indicating how many chunks will be created.
193+
194+
!!! warning "Presigned URL Expiration"
195+
196+
Generated presigned URLs are valid for 24 hours. Ensure your batch processing job completes within this timeframe.
97197

98198
Then, you can inspect the details of staged batch of data:
99199

0 commit comments

Comments
 (0)