Skip to content

Commit e3088d4

Browse files
committed
feat: improve operator tools for cloud storage management
1 parent 97534d9 commit e3088d4

File tree

4 files changed

+1285
-3
lines changed

4 files changed

+1285
-3
lines changed

operator-tools/README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,9 +76,19 @@ This makes the webhook endpoint available at `http://localhost:12000/samples`.
7676
# View item details with S3 stats
7777
uv run operator-tools/manage_item.py info sentinel-2-l2a-staging ITEM_ID --s3-stats
7878

79+
# View storage tier statistics from STAC metadata
80+
uv run operator-tools/manage_item.py info sentinel-2-l2a-staging ITEM_ID --s3-stac-info
81+
82+
# Combine both statistics
83+
uv run operator-tools/manage_item.py info sentinel-2-l2a-staging ITEM_ID --s3-stats --s3-stac-info
84+
7985
# Debug S3 URL extraction
8086
uv run operator-tools/manage_item.py info sentinel-2-l2a-staging ITEM_ID --s3-stats --debug
8187

88+
# Sync storage tiers for a single item (dry run)
89+
uv run operator-tools/manage_item.py sync-storage-tiers sentinel-2-l2a-staging ITEM_ID \
90+
--s3-endpoint https://s3.de.io.cloud.ovh.net --dry-run
91+
8292
# Delete single item with S3 cleanup (dry run)
8393
uv run operator-tools/manage_item.py delete sentinel-2-l2a-staging ITEM_ID --clean-s3 --dry-run
8494

@@ -88,6 +98,8 @@ uv run operator-tools/manage_item.py delete sentinel-2-l2a-staging ITEM_ID --cle
8898

8999
**Key Features:**
90100
- Detailed item inspection with S3 statistics
101+
- Storage tier statistics from STAC metadata
102+
- Sync storage tiers with S3 (single item)
91103
- Debug mode for S3 URL extraction troubleshooting
92104
- Delete with automatic S3 validation
93105
- Dry-run mode for safe testing
@@ -104,6 +116,7 @@ Comprehensive tool for managing STAC collections using the Transaction API, **no
104116
- Clean collections (remove all items)
105117
- Clean collections with validated S3 data deletion (removes items AND all S3 objects)
106118
- View comprehensive S3 storage statistics (works with any S3 asset structure)
119+
- View storage tier statistics from STAC metadata (all items processed)
107120
- Automatic validation ensures S3 cleanup succeeds before removing STAC items
108121
- Create/update collections from templates
109122
- Batch operations on multiple collections
@@ -127,9 +140,19 @@ uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging
127140
# View collection with S3 storage statistics
128141
uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-stats
129142

143+
# View storage tier statistics from STAC metadata
144+
uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-stac-info
145+
146+
# Combine both statistics
147+
uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-stats --s3-stac-info
148+
130149
# Debug S3 URL extraction
131150
uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-stats --debug
132151

152+
# Sync storage tiers for entire collection (dry run)
153+
uv run operator-tools/manage_collections.py sync-storage-tiers sentinel-2-l2a-staging \
154+
--s3-endpoint https://s3.de.io.cloud.ovh.net --dry-run
155+
133156
# Clean a collection (dry run first!)
134157
uv run operator-tools/manage_collections.py clean sentinel-2-l2a-staging --dry-run
135158
uv run operator-tools/manage_collections.py clean sentinel-2-l2a-staging
@@ -149,6 +172,7 @@ uv run operator-tools/manage_collections.py batch-create stac/
149172
**Key Features:**
150173
- **Validated S3 cleanup** - Verifies all S3 objects deleted before removing STAC items
151174
- **Comprehensive S3 support** - Handles individual files, directories, and Zarr stores
175+
- **Sync storage tiers** - Keep STAC metadata in sync with S3 storage classes
152176
- **Debug mode** - Detailed S3 URL extraction and validation info
153177
- **Safety first** - STAC items preserved if S3 cleanup fails
154178

@@ -277,6 +301,12 @@ Check how much S3 storage a collection is using:
277301
# View collection info with S3 statistics
278302
uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-stats
279303

304+
# View storage tier statistics from STAC metadata
305+
uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-stac-info
306+
307+
# Combine both statistics
308+
uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-stats --s3-stac-info
309+
280310
# With debug output (shows detailed URL extraction)
281311
uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-stats --debug
282312
```
@@ -287,6 +317,12 @@ uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-sta
287317
- Estimated total storage across all items
288318
- Works with **any S3 asset structure** (individual files, Zarr stores, directories)
289319

320+
**Storage Tier Statistics (`--s3-stac-info`):**
321+
- Processes all items in the collection
322+
- Shows distribution of storage tiers (STANDARD, STANDARD_IA, EXPRESS_ONEZONE, MIXED)
323+
- Detailed breakdowns for mixed storage tiers
324+
- Reads from STAC metadata (no S3 queries required)
325+
290326
**Example:**
291327
```
292328
S3 Storage Statistics:

operator-tools/README_collections.md

Lines changed: 230 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -325,7 +325,7 @@ uv run operator-tools/manage_collections.py delete sentinel-2-l2a-staging --clea
325325

326326
#### 5. `info` - Show Collection Information
327327

328-
Display detailed information about a collection, including item count. **NEW**: Optionally include comprehensive S3 storage statistics.
328+
Display detailed information about a collection, including item count. **NEW**: Optionally include comprehensive S3 storage statistics and storage tier statistics from STAC metadata.
329329

330330
```bash
331331
# Basic collection info
@@ -334,6 +334,12 @@ uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging
334334
# Include S3 storage statistics (samples first 5 items)
335335
uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-stats
336336

337+
# Include storage tier statistics from STAC metadata (all items)
338+
uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-stac-info
339+
340+
# Combine both statistics
341+
uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-stats --s3-stac-info
342+
337343
# With debug output (shows detailed URL extraction)
338344
uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-stats --debug
339345

@@ -344,6 +350,7 @@ uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-sta
344350

345351
**Options:**
346352
- `--s3-stats`: **[NEW]** Include S3 storage statistics (object count, total size)
353+
- `--s3-stac-info`: **[NEW]** Query STAC API and compute storage tier statistics for all assets of all items
347354
- `--debug`: **[NEW]** Show detailed debug information about S3 URL extraction
348355
- `--s3-endpoint`: S3 endpoint URL (optional, uses `AWS_ENDPOINT_URL` env var if not specified)
349356

@@ -358,6 +365,11 @@ uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-sta
358365
- Object count and total size for sampled items
359366
- Estimated total storage for all items in collection
360367
- Works with **any S3 asset structure** (individual files, Zarr stores, directories)
368+
- **[NEW]** Storage tier statistics (when `--s3-stac-info` is used):
369+
- Items/assets with tier info vs without tier info
370+
- Distribution of storage tiers (STANDARD, STANDARD_IA, EXPRESS_ONEZONE, MIXED)
371+
- Detailed breakdowns for mixed storage tiers
372+
- Reads from STAC metadata (no S3 queries required)
361373

362374
**S3 Statistics Behavior:**
363375
- Samples the first 5 items to avoid long wait times on large collections
@@ -367,6 +379,16 @@ uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-sta
367379
- Provides estimated total based on sample average
368380
- Requires AWS credentials to access S3
369381

382+
**Storage Tier Statistics Behavior (`--s3-stac-info`):**
383+
- Processes **all items** in the collection (with progress bar)
384+
- Reads storage tier information from STAC metadata (`assets[*].alternate.s3.storage:scheme.tier`)
385+
- No S3 queries required - reads directly from STAC item metadata
386+
- Aggregates statistics across all items:
387+
- Total asset counts per tier
388+
- Combined tier distributions for mixed storage
389+
- Summary statistics (items/assets with/without tier info)
390+
- Shows distribution breakdowns for mixed storage tiers
391+
370392
**Example Output:**
371393

372394
```bash
@@ -415,6 +437,164 @@ Shows detailed per-item information:
415437
Objects: 1,247, Size: 2.34 GB (cumulative)
416438
```
417439

440+
**Storage Tier Statistics Output (`--s3-stac-info`):**
441+
442+
```bash
443+
uv run operator-tools/manage_collections.py info sentinel-2-l2a-staging --s3-stac-info
444+
```
445+
446+
```
447+
============================================================
448+
Collection: sentinel-2-l2a-staging
449+
Title: Sentinel-2 Level-2A [V1 staging]
450+
...
451+
Items: 43
452+
────────────────────────────────────────────────────────────
453+
Storage Tier Statistics (from STAC metadata):
454+
Processing 43 items...
455+
Analyzing storage tiers [####################################] 43/43
456+
457+
Summary:
458+
Items with tier info: 43
459+
Items without tier info: 0
460+
Total assets: 645
461+
Assets with tier info: 645
462+
Assets without tier info: 0
463+
464+
Storage Tier Distribution (by asset count):
465+
STANDARD_IA: 430 assets (66.7%)
466+
STANDARD: 215 assets (33.3%)
467+
Distribution:
468+
STANDARD: 215 objects (100.0%)
469+
```
470+
471+
#### 6. `sync-storage-tiers` - Sync Storage Tier Metadata for Collection
472+
473+
Sync storage tier metadata for all items in a collection with S3. This command queries S3 for current storage classes at the **object level** and updates STAC item metadata to match. It compares object-level distributions (not just asset-level tiers) and shows a detailed summary of mismatches found and corrections made.
474+
475+
```bash
476+
# Dry run (preview changes)
477+
uv run operator-tools/manage_collections.py sync-storage-tiers sentinel-2-l2a-staging \
478+
--s3-endpoint https://s3.de.io.cloud.ovh.net --dry-run
479+
480+
# Actually sync (with confirmation)
481+
uv run operator-tools/manage_collections.py sync-storage-tiers sentinel-2-l2a-staging \
482+
--s3-endpoint https://s3.de.io.cloud.ovh.net
483+
484+
# Add missing alternate.s3 for legacy items
485+
uv run operator-tools/manage_collections.py sync-storage-tiers sentinel-2-l2a-staging \
486+
--s3-endpoint https://s3.de.io.cloud.ovh.net --add-missing
487+
488+
# Skip confirmation prompt
489+
uv run operator-tools/manage_collections.py sync-storage-tiers sentinel-2-l2a-staging \
490+
--s3-endpoint https://s3.de.io.cloud.ovh.net -y
491+
```
492+
493+
**Options:**
494+
- `--s3-endpoint`: S3 endpoint URL (required, or set `AWS_ENDPOINT_URL` env var)
495+
- `--add-missing`: Add `alternate.s3` to assets that don't have it (for legacy items)
496+
- `--dry-run`: Show what would be updated without actually updating
497+
- `--yes, -y`: Skip confirmation prompt
498+
499+
**Output includes:**
500+
- Progress bar showing sync progress
501+
- Summary statistics:
502+
- Items processed, updated, unchanged, failed
503+
- Assets updated, added, failed
504+
- **Object-level statistics**: Shows object counts per tier from both S3 and STAC
505+
- **Problems section**: Lists items/assets with mismatches showing object-level differences
506+
- **Corrections section**: Shows what was fixed (first 10 items, then summary)
507+
508+
**How it works:**
509+
1. Fetches all items from the collection
510+
2. For each item and asset:
511+
- Queries S3 to get **object-level distribution** (counts per tier)
512+
- Reads STAC metadata to get **object-level distribution** from `tier_distribution`
513+
- Compares object counts per tier (not just the tier name)
514+
- Identifies mismatches at the object level
515+
3. Updates STAC metadata to match S3 object-level distribution
516+
4. Optionally adds `alternate.s3` for legacy items (if `--add-missing`)
517+
5. Updates STAC items via Transaction API (DELETE + POST)
518+
6. Reports summary with object-level statistics, problems, and corrections
519+
520+
**Example Output:**
521+
522+
```bash
523+
uv run operator-tools/manage_collections.py sync-storage-tiers sentinel-2-l2a-staging \
524+
--s3-endpoint https://s3.de.io.cloud.ovh.net --dry-run
525+
```
526+
527+
```
528+
DRY RUN: Syncing storage tiers for collection: sentinel-2-l2a-staging
529+
Processing 43 items...
530+
Syncing storage tiers [####################################] 43/43
531+
532+
============================================================
533+
SYNC SUMMARY
534+
============================================================
535+
Items processed: 43
536+
✅ Items updated: 5
537+
✓ Items with no changes: 38
538+
❌ Items failed: 0
539+
540+
Assets:
541+
Updated: 12
542+
Added (alternate.s3): 0
543+
⚠️ Failed to query S3: 0
544+
545+
────────────────────────────────────────────────────────────
546+
OBJECT-LEVEL STATISTICS
547+
────────────────────────────────────────────────────────────
548+
549+
S3 (current storage):
550+
Total objects: 12,450
551+
STANDARD: 5,245 objects (42.1%)
552+
STANDARD_IA: 7,205 objects (57.9%)
553+
554+
STAC (metadata):
555+
Total objects: 12,450
556+
STANDARD: 5,500 objects (44.2%)
557+
STANDARD_IA: 6,950 objects (55.8%)
558+
559+
────────────────────────────────────────────────────────────
560+
🔍 MISMATCHES FOUND: 3 item(s)
561+
────────────────────────────────────────────────────────────
562+
563+
Item: S2A_MSIL2A_20250831T103701_N0511_R008_T31TFL_20250831T145420
564+
Asset: reflectance
565+
S3 objects: STANDARD: 450, STANDARD_IA: 608
566+
STAC objects: STANDARD: 500, STANDARD_IA: 558
567+
568+
Item: S2A_MSIL2A_20251008T100041_N0511_R122_T32TQM_20251008T122613
569+
Asset: reflectance
570+
S3 objects: STANDARD: 1
571+
STAC objects: STANDARD_IA: 1
572+
573+
────────────────────────────────────────────────────────────
574+
✅ CORRECTIONS MADE: 5 item(s) updated
575+
────────────────────────────────────────────────────────────
576+
S2A_MSIL2A_20250831T103701_N0511_R008_T31TFL_20250831T145420: 2 asset(s) updated
577+
S2A_MSIL2A_20251008T100041_N0511_R122_T32TQM_20251008T122613: 1 asset(s) updated
578+
... and 3 more item(s)
579+
580+
────────────────────────────────────────────────────────────
581+
DRY RUN - No changes were made
582+
────────────────────────────────────────────────────────────
583+
============================================================
584+
```
585+
586+
**Use cases:**
587+
- Keeping STAC metadata in sync with actual S3 storage classes
588+
- Finding and fixing storage tier mismatches across collections
589+
- Adding storage tier metadata to legacy items
590+
- Auditing storage tier accuracy before reporting
591+
592+
**Best practices:**
593+
- Always use `--dry-run` first to preview changes
594+
- Review the problems section to understand mismatches
595+
- Use `--add-missing` for legacy items that don't have `alternate.s3`
596+
- Test on a single item with `manage_item.py sync-storage-tiers` before running on entire collection
597+
418598
### Global Options
419599

420600
#### `--api-url`
@@ -433,7 +613,7 @@ The `manage_item.py` tool provides commands for working with individual STAC ite
433613

434614
### `info` - Show Item Information
435615

436-
Display detailed information about a specific STAC item, including optional S3 statistics.
616+
Display detailed information about a specific STAC item, including optional S3 statistics and storage tier statistics.
437617

438618
```bash
439619
# Basic item info
@@ -442,6 +622,12 @@ uv run operator-tools/manage_item.py info sentinel-2-l2a-staging ITEM_ID
442622
# Include S3 storage statistics
443623
uv run operator-tools/manage_item.py info sentinel-2-l2a-staging ITEM_ID --s3-stats
444624

625+
# Include storage tier statistics from STAC metadata
626+
uv run operator-tools/manage_item.py info sentinel-2-l2a-staging ITEM_ID --s3-stac-info
627+
628+
# Combine both statistics
629+
uv run operator-tools/manage_item.py info sentinel-2-l2a-staging ITEM_ID --s3-stats --s3-stac-info
630+
445631
# With debug output (shows detailed URL extraction)
446632
uv run operator-tools/manage_item.py info sentinel-2-l2a-staging ITEM_ID --s3-stats --debug
447633
```
@@ -454,6 +640,11 @@ uv run operator-tools/manage_item.py info sentinel-2-l2a-staging ITEM_ID --s3-st
454640
- S3 URLs extracted from assets
455641
- Object count
456642
- Total size in GB
643+
- **With `--s3-stac-info`:**
644+
- Total assets and tier coverage statistics
645+
- Storage tier distribution by asset count
646+
- Distribution breakdowns for mixed storage tiers
647+
- Reads from STAC metadata (no S3 queries required)
457648
- **With `--debug`:**
458649
- Exact S3 URLs found in each asset
459650
- Which fields contain S3 URLs (`alternate.s3.href` vs main `href`)
@@ -463,6 +654,7 @@ uv run operator-tools/manage_item.py info sentinel-2-l2a-staging ITEM_ID --s3-st
463654
- Debugging why an item's S3 data isn't being found
464655
- Verifying S3 URLs are correctly formatted
465656
- Understanding how much S3 storage an item uses
657+
- Checking storage tier distribution for an item
466658
- Investigating issues before batch operations
467659

468660
### `delete` - Delete a Single Item
@@ -512,6 +704,42 @@ DELETION SUMMARY:
512704
- Verifying S3 cleanup works before scaling to collection
513705
- Debugging deletion issues
514706

707+
### `sync-storage-tiers` - Sync Storage Tier Metadata for a Single Item
708+
709+
Sync storage tier metadata for a single STAC item with S3. This command queries S3 for current storage classes at the **object level** and updates STAC item metadata to match. It compares object-level distributions (not just asset-level tiers) and shows detailed mismatches.
710+
711+
```bash
712+
# Dry run (preview changes)
713+
uv run operator-tools/manage_item.py sync-storage-tiers sentinel-2-l2a-staging ITEM_ID \
714+
--s3-endpoint https://s3.de.io.cloud.ovh.net --dry-run
715+
716+
# Actually sync (with confirmation)
717+
uv run operator-tools/manage_item.py sync-storage-tiers sentinel-2-l2a-staging ITEM_ID \
718+
--s3-endpoint https://s3.de.io.cloud.ovh.net
719+
720+
# Add missing alternate.s3 for legacy items
721+
uv run operator-tools/manage_item.py sync-storage-tiers sentinel-2-l2a-staging ITEM_ID \
722+
--s3-endpoint https://s3.de.io.cloud.ovh.net --add-missing
723+
```
724+
725+
**Options:**
726+
- `--s3-endpoint`: S3 endpoint URL (required, or set `AWS_ENDPOINT_URL` env var)
727+
- `--add-missing`: Add `alternate.s3` to assets that don't have it (for legacy items)
728+
- `--dry-run`: Show what would be updated without actually updating
729+
730+
**Output includes:**
731+
- Summary of assets with alternate.s3, tier info, and updates
732+
- **Object-level statistics**: Shows object counts per tier from both S3 and STAC
733+
- **Problems section**: Lists mismatches showing object-level differences (S3 objects vs STAC objects)
734+
- **Corrections section**: Shows what was fixed
735+
- Confirmation of STAC item update (if not dry-run)
736+
737+
**Use cases:**
738+
- Testing sync on a single item before running on entire collection
739+
- Fixing storage tier mismatches for specific items
740+
- Adding missing storage tier metadata to legacy items
741+
- Debugging storage tier sync issues
742+
515743
## Common Workflows
516744

517745
### Create a New Collection

0 commit comments

Comments
 (0)