You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Merge pull request #1726 from roboflow/japrescott/batch-processing-with-cloud-storage
closes jeremyprescott/ss-160-batch-processing-cloud-storage-improvements
This pull request simplifies the creation of referencefile used by batch processing; it allows the user to use its system credentials directly without needing to create a custom script or figure out the format. This should cover the 90% of the use cases that users have.
Batch ID must be lower-cased string without special caraters, with letters and digits allowed.
96
96
97
+
#### Cloud Storage Integration
98
+
99
+
If your data is already stored in cloud storage (S3, Google Cloud Storage, or Azure), you can process it directly without downloading files locally. This feature generates presigned URLs for your cloud files, making it efficient for large datasets.
100
+
101
+
!!! info "Installing Cloud Storage Support"
102
+
103
+
Cloud storage integration requires additional dependencies. Install them with:
You can optionally include glob patterns to filter files:
133
+
-`s3://my-bucket/training-data/**/*.jpg` - All JPG files recursively
134
+
-`gs://my-bucket/videos/2024-*/*.mp4` - MP4 files in 2024-* folders
135
+
-`az://container/images/*.png` - PNG files in images folder
136
+
137
+
138
+
!!! tip "Credentials Usage"
139
+
140
+
Your cloud storage credentials are used **only locally** by the CLI tool to generate presigned URLs. They are **never uploaded** to Roboflow servers. The presigned URLs allow our batch processing service to access your files directly from your cloud storage without requiring access to your credentials.
141
+
142
+
!!! hint "Cloud Storage Examples"
143
+
144
+
**AWS S3:**
145
+
```bash
146
+
export AWS_PROFILE=my-profile # Optional, uses credentials from ~/.aws/credentials
For more information, see [Azure Blob Storage configuration](./integration.md#azure-blob-storage).
179
+
180
+
!!! tip "Cloud Storage Configuration"
181
+
182
+
For detailed authentication options, credential management, and advanced configuration, see the [Cloud Storage Integration guide](./integration.md#cloud-storage-authentication).
183
+
184
+
!!! info "Large Dataset Handling"
185
+
186
+
The system automatically handles large datasets:
187
+
188
+
- **Images**: Automatically split into chunks of 20,000 files each for efficient processing
189
+
- **Videos**: Best results with batches under 1,000 videos
190
+
- **Progress tracking**: You'll see real-time progress as files are listed and presigned URLs are generated
191
+
192
+
When processing over 20,000 images, you'll see a message indicating how many chunks will be created.
193
+
194
+
!!! warning "Presigned URL Expiration"
195
+
196
+
Generated presigned URLs are valid for 24 hours. Ensure your batch processing job completes within this timeframe.
97
197
98
198
Then, you can inspect the details of staged batch of data:
0 commit comments