-
Notifications
You must be signed in to change notification settings - Fork 68
Description
Overview
As part of #248 I mentioned a possible refactor of how we resume runs.
UPDATE: This issue is now part of the 5-PR checkpoint architecture refactor. It implements the Checkpoint Acquisition Layer that handles fetching checkpoints from various sources.
Parent Issue
Part of #248 - Checkpoint System Refactor
Related Architecture Components
- Foundation: Checkpoint Pipeline Infrastructure (Phase 1) #493 (Pipeline Infrastructure)
- Works with: Model Transformation Layer - Post-loading modifications (freezing, transfer learning, adapters) #410 (Model Modifiers), Checkpoint Loading Orchestration (Phase 2) #494 (Loading Orchestration)
- Integration: Checkpoint System Integration and Migration (Phase 3) #495 (System Integration)
Original Context
Thanks Jesper, I think this refactoring makes a lot of sense. Would it make sense to have another modifier "ResumeRun..." to which we could bring all the logic from
run_id,fork_run_id,load_only_weightsandwarm_start?
This could be a good opportunity to make the code more intuitive and reduce the direct dependency on MLFlow run_ids.
New Architecture Role
This issue implements the Checkpoint Acquisition Layer - the first layer in our pipeline architecture:
┌─────────────────────────────────────────────────┐
│ Model Transformation Layer │
│ (Post-loading modifications) │
├─────────────────────────────────────────────────┤
│ Loading Orchestration Layer │
│ (Strategies for applying checkpoints) │
├─────────────────────────────────────────────────┤
│ Checkpoint Acquisition Layer (THIS ISSUE) │
│ (Obtaining checkpoint from sources) │
└─────────────────────────────────────────────────┘
Components to Implement
1. CheckpointSource Interface (training/src/anemoi/training/utils/checkpoint_loaders.py)
class CheckpointSource(ABC):
"""Base class for checkpoint sources"""
@abstractmethod
def fetch(self) -> Path:
"""Download/access checkpoint, return local path"""
pass2. Source Implementations
LocalSource: Access local checkpoint filesHTTPSource: Download from HTTP/HTTPS URLsS3Source: Download from S3 bucketsGCSSource: Google Cloud StorageAzureSource: Azure Blob StorageMLFlowSource: Load from MLFlow runs (preserves existing functionality)
3. Registry Pattern
@registry.register_source('s3')
class S3Source(CheckpointSource):
def __init__(self, bucket: str, key: str, cache_dir: Optional[str] = None):
self.bucket = bucket
self.key = key
self.cache_dir = cache_dir
def fetch(self) -> Path:
# Download from S3, cache locally
passConfiguration Examples
Local Checkpoint
checkpoint:
source:
type: local
path: /path/to/checkpoint.ckptS3 Checkpoint
checkpoint:
source:
type: s3
bucket: ecmwf-models
key: anemoi/pretrained.ckpt
cache_dir: /tmp/checkpointsHTTP Checkpoint
checkpoint:
source:
type: http
url: https://models.ecmwf.int/anemoi/latest.ckpt
cache_dir: ~/.cache/anemoiMLFlow (Backwards Compatibility)
checkpoint:
source:
type: mlflow
run_id: abc123
artifact_path: model/checkpointKey Features
- Multi-source support: Local, cloud, HTTP, MLFlow
- Caching: Optional local caching of remote checkpoints
- Retry logic: Exponential backoff for network failures
- Async downloads: Non-blocking I/O operations
- Progress tracking: Download progress for large files
- Extensible: Easy to add new source types
Refactoring Benefits
- Flexibility: Load checkpoints from anywhere, not just MLFlow
- Intuitive: Clear configuration for different sources
- Maintainable: Each source type is independent
- Testable: Mock sources for testing
- Backwards compatible: MLFlowSource preserves existing functionality
Implementation Status (PR #464)
- Base CheckpointSource interface
- LocalSource
- HTTPSource
- S3Source
- GCSSource
- AzureSource
- MLFlowSource (migration)
- Registry pattern
- Caching mechanism
- Tests
- Documentation
Migration Strategy
- Keep existing MLFlow functionality working
- Add new source types alongside
- Gradually migrate configs to new format
- Deprecate old parameters in v2.0
Testing
- Unit tests for each source type
- Mock tests for network sources
- Integration tests with real checkpoints
- Cache functionality tests
- Retry logic tests
Success Criteria
- All existing restart/resume scenarios work
- New source types implemented
- Caching working correctly
- Retry logic robust
- Full test coverage
- Documentation complete
- Backwards compatibility maintained
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
To be triaged