Skip to content

Checkpoint Acquisition Layer - Multi-source checkpoint loading (S3, HTTP, local, MLFlow) #458

@JesperDramsch

Description

@JesperDramsch

Overview

As part of #248 I mentioned a possible refactor of how we resume runs.

UPDATE: This issue is now part of the 5-PR checkpoint architecture refactor. It implements the Checkpoint Acquisition Layer that handles fetching checkpoints from various sources.

Parent Issue

Part of #248 - Checkpoint System Refactor

Related Architecture Components

Original Context

On #442 @JPXKQX mentioned:

Thanks Jesper, I think this refactoring makes a lot of sense. Would it make sense to have another modifier "ResumeRun..." to which we could bring all the logic from run_id, fork_run_id, load_only_weights and warm_start?

This could be a good opportunity to make the code more intuitive and reduce the direct dependency on MLFlow run_ids.

New Architecture Role

This issue implements the Checkpoint Acquisition Layer - the first layer in our pipeline architecture:

┌─────────────────────────────────────────────────┐
│        Model Transformation Layer               │
│         (Post-loading modifications)            │
├─────────────────────────────────────────────────┤
│         Loading Orchestration Layer             │
│    (Strategies for applying checkpoints)        │
├─────────────────────────────────────────────────┤
│   Checkpoint Acquisition Layer (THIS ISSUE)    │
│      (Obtaining checkpoint from sources)        │
└─────────────────────────────────────────────────┘

Components to Implement

1. CheckpointSource Interface (training/src/anemoi/training/utils/checkpoint_loaders.py)

class CheckpointSource(ABC):
    """Base class for checkpoint sources"""
    @abstractmethod
    def fetch(self) -> Path:
        """Download/access checkpoint, return local path"""
        pass

2. Source Implementations

  • LocalSource: Access local checkpoint files
  • HTTPSource: Download from HTTP/HTTPS URLs
  • S3Source: Download from S3 buckets
  • GCSSource: Google Cloud Storage
  • AzureSource: Azure Blob Storage
  • MLFlowSource: Load from MLFlow runs (preserves existing functionality)

3. Registry Pattern

@registry.register_source('s3')
class S3Source(CheckpointSource):
    def __init__(self, bucket: str, key: str, cache_dir: Optional[str] = None):
        self.bucket = bucket
        self.key = key
        self.cache_dir = cache_dir
    
    def fetch(self) -> Path:
        # Download from S3, cache locally
        pass

Configuration Examples

Local Checkpoint

checkpoint:
  source:
    type: local
    path: /path/to/checkpoint.ckpt

S3 Checkpoint

checkpoint:
  source:
    type: s3
    bucket: ecmwf-models
    key: anemoi/pretrained.ckpt
    cache_dir: /tmp/checkpoints

HTTP Checkpoint

checkpoint:
  source:
    type: http
    url: https://models.ecmwf.int/anemoi/latest.ckpt
    cache_dir: ~/.cache/anemoi

MLFlow (Backwards Compatibility)

checkpoint:
  source:
    type: mlflow
    run_id: abc123
    artifact_path: model/checkpoint

Key Features

  • Multi-source support: Local, cloud, HTTP, MLFlow
  • Caching: Optional local caching of remote checkpoints
  • Retry logic: Exponential backoff for network failures
  • Async downloads: Non-blocking I/O operations
  • Progress tracking: Download progress for large files
  • Extensible: Easy to add new source types

Refactoring Benefits

  • Flexibility: Load checkpoints from anywhere, not just MLFlow
  • Intuitive: Clear configuration for different sources
  • Maintainable: Each source type is independent
  • Testable: Mock sources for testing
  • Backwards compatible: MLFlowSource preserves existing functionality

Implementation Status (PR #464)

  • Base CheckpointSource interface
  • LocalSource
  • HTTPSource
  • S3Source
  • GCSSource
  • AzureSource
  • MLFlowSource (migration)
  • Registry pattern
  • Caching mechanism
  • Tests
  • Documentation

Migration Strategy

  1. Keep existing MLFlow functionality working
  2. Add new source types alongside
  3. Gradually migrate configs to new format
  4. Deprecate old parameters in v2.0

Testing

  • Unit tests for each source type
  • Mock tests for network sources
  • Integration tests with real checkpoints
  • Cache functionality tests
  • Retry logic tests

Success Criteria

  • All existing restart/resume scenarios work
  • New source types implemented
  • Caching working correctly
  • Retry logic robust
  • Full test coverage
  • Documentation complete
  • Backwards compatibility maintained

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

To be triaged

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions