Skip to content
This repository was archived by the owner on Jul 14, 2025. It is now read-only.

Commit 38205e7

Browse files
authored
Merge pull request #19 from stac-extensions/migration_path
draft migration path doc from ML Model to MLM Extension
2 parents 6b8dadd + 1ec663e commit 38205e7

File tree

2 files changed

+106
-7
lines changed

2 files changed

+106
-7
lines changed

MIGRATION_TO_MLM.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# Migration Guide: ML Model Extension to MLM Extension
2+
3+
## Context
4+
5+
The ML Model Extension was started at Radiant Earth on October 4th, 2021. It was possibly the first STAC extension dedicated to describing machine learning models. The extension incorporated inputs from 9 different organizations and was used to describe models in Radiant Earth's MLHub API. The announcement of this extension and its use in Radiant Earth's MLHub is described [here](https://medium.com/radiant-earth-insights/geospatial-models-now-available-in-radiant-mlhub-a41eb795d7d7). Radiant Earth's MLHub API and Python SDK are now [deprecated](https://mlhub.earth/?gad_source=1&gclid=CjwKCAjwk8e1BhALEiwAc8MHiBZ1JcpErgQXlna7FsB3dd-mlPpMF-jpLQJolBgtYLDOeH2k-cxxLRoCEqQQAvD_BwE). In order to support other current users of the ML Model extension, this document lays out a migration path to convert metadata to the Machine Learning Model Extension (MLM).
6+
7+
## Shared Goals
8+
9+
Both the ML Model Extension and the Machine Learning Model (MLM) Extension aim to provide a standard way to catalog machine learning (ML) models that work with, but are not limited to, Earth observation (EO) data. Their main goals are:
10+
11+
1. **Search and Discovery**: Helping users find and use ML models.
12+
2. **Describing Inference and Training Requirements**: Making it easier to run these models by describing input requirements and outputs.
13+
3. **Reproducibility**: Providing runtime information and links to assets so that model inference is reproducible.
14+
15+
## Schema Changes
16+
17+
### ML Model Extension
18+
- **Scope**: Item, Collection
19+
- **Field Name Prefix**: `ml-model`
20+
- **Key Sections**:
21+
- Item Properties
22+
- Asset Objects
23+
- Inference/Training Runtimes
24+
- Relation Types
25+
- Interpretation of STAC Fields
26+
27+
### MLM Extension
28+
- **Scope**: Collection, Item, Asset, Links
29+
- **Field Name Prefix**: `mlm`
30+
- **Key Sections**:
31+
- Item Properties and Collection Fields
32+
- Asset Objects
33+
- Relation Types
34+
- Model Input/Output Objects
35+
- Best Practices
36+
37+
Notable differences:
38+
39+
- The MLM Extension covers more details at both the Item and Asset levels, making it easier to describe and use model metadata.
40+
- The MLM Extension covers Runtime requirements within the [Container Asset](https://github.com/stac-extensions/mlm?tab=readme-ov-file#container-asset), while the ML Model Extension records [similar information](./README.md#inferencetraining-runtimes) in the `ml-model:inference-runtime` or `ml-model:training-runtime` asset roles.
41+
- The MLM extension has a corresponding Python library, [`stac-model`](https://pypi.org/project/stac-model/) which can be used to create and validate MLM metadata. An example of the library in action is [here](https://github.com/stac-extensions/mlm/blob/main/stac_model/examples.py#L14). The ML Model extension does not support this and requires the JSON to be written manually by interpreting the JSON Schema or existing examples.
42+
- MLM is easier to maintain and enhance in a fast moving ML ecosystem thanks to it's use of pydantic models, while still being compatible with pystac for extension and STAc core validation.
43+
44+
## Changes in Field Names
45+
46+
### Item Properties
47+
48+
| ML Model Extension | MLM Extension | Notes |
49+
| ---------------------------------- | ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
50+
| `ml-model:type` | N/A | No direct equivalent, it is implied by the `mlm` prefix in MLM fields and directly specified by the schema identifier. |
51+
| `ml-model:learning_approach` | `mlm:tasks` | Removed in favor of specifying specific `mlm:tasks`. |
52+
| `ml-model:prediction_type` | `mlm:tasks` | `mlm:tasks` provides a more comprehensive enum of prediction types. |
53+
| `ml-model:architecture` | `mlm:architecture` | The MLM provides specific guidance on using *Papers With Code - Computer Vision* identifiers for model architectures. No guidance is provided in ML Model. |
54+
| `ml-model:training-processor-type` | `mlm:accelerator` | MLM defines more choices for accelerators in an enum and specifies that this is the accelerator for inference. ML Model only accepts `cpu` or `gpu` but this isn't sufficient today where we have models optimized for different CPU architectures, CUDA GPUs, Intel GPUs, AMD GPUs, Mac Silicon, and TPUs. |
55+
| `ml-model:training-os` | N/A | This field is no longer recommended in the MLM for training or inference; instead, users can specify an optional `mlm:training-runtime` asset. |
56+
57+
58+
### New Fields in MLM
59+
60+
| Field Name | Description |
61+
|----------------------------------|-------------------------------------------------------------------------------------------------------------------------|
62+
| **`mlm:name`** | A required name for the model. |
63+
| **`mlm:framework`** | The framework used to train the model. |
64+
| **`mlm:framework_version`** | The version of the framework. Useful in case a container runtime asset is not specified or if the consumer of the MLM wants to run the model outside of a container. |
65+
| **`mlm:memory_size`** | The in-memory size of the model. |
66+
| **`mlm:total_parameters`** | Total number of model parameters. |
67+
| **`mlm:pretrained`** | Indicates if the model is derived from a pretrained model. |
68+
| **`mlm:pretrained_source`** | Source of the pretrained model by name or URL if it is less well known. |
69+
| **`mlm:batch_size_suggestion`** | Suggested batch size for the given accelerator. |
70+
| **`mlm:accelerator`**| Indicates the specific accelerator recommended for the model. |
71+
| **`mlm:accelerator_constrained`**| Indicates if the model requires a specific accelerator. |
72+
| **`mlm:accelerator_summary`** | Description of the accelerator. This might contain details on the exact accelerator version (TPUv4 vs TPUv5) and their configuration. |
73+
| **`mlm:accelerator_count`** | Minimum number of accelerator instances required. |
74+
| **`mlm:input`** | Describes the model's input shape, dtype, and normalization and resize transformations. |
75+
| **`mlm:output`** | Describes the model's output shape and dtype. |
76+
| **`mlm:hyperparameters`** | Additional hyperparameters relevant to the model. |
77+
78+
### Asset Objects
79+
80+
| ML Model Extension Role | MLM Extension Role | Notes |
81+
| ---------------------------- | ----------------------- | -------------------------------------------------------------------------------------------------- |
82+
| `ml-model:inference-runtime` | `mlm:inference-runtime` | Direct conversion; same role and function. |
83+
| `ml-model:training-runtime` | `mlm:training-runtime` | Direct conversion; same role and function. |
84+
| `ml-model:checkpoint` | `mlm:checkpoint` | Direct conversion; same role and function. |
85+
| N/A | `mlm:model` | New required role for model assets in MLM. This represents the asset that is the source of model weights and definition. |
86+
| N/A | `mlm:source_code` | Recommended for providing source code details. |
87+
| N/A | `mlm:container` | Recommended for containerized environments. |
88+
| N/A | `mlm:training` | Recommended for training pipeline assets. |
89+
| N/A | `mlm:inference` | Recommended for inference pipeline assets. |
90+
91+
92+
The MLM provides a recommended asset role for `mlm:training-runtime` and asset `mlm:training`, which can point to a container URL that has the training runtime requirements. The ML Model extension specifies a field for `ml-model:training-runtime` and like `mlm:training` it only contains the default STAC Asset fields and a few additional fields specified by the Container Asset. Training requirements typically differ from inference requirements which is why there are two separate Container assets in both extensions.
93+
94+
## Getting Help
95+
96+
If you have any questions about a migration, feel free to contact the maintainers by opening a discussion or issue on the [MLM repository](https://github.com/stac-extensions/mlm).
97+
98+
If you see a feature missing in the MLM, feel free to open an issue describing your feature request.

README.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
> [https://github.com/stac-extensions/mlm](https://github.com/stac-extensions/mlm). <br>
66
> The corresponding schemas are made available on
77
> [https://stac-extensions.github.io/mlm/](https://stac-extensions.github.io/mlm/).
8+
> Documentation on migrating from the Ml Model Extension to the Machine Learning Model Extension (MLM) is [here](./MIGRATION_TO_MLM.md).
89
>
910
> It is **STRONGLY** recommended to migrate `ml-model` definitions to the `mlm` extension.
1011
> The `mlm` extension improves the model metadata definition and properties with added support for use cases not directly supported by `ml-model`.
@@ -19,7 +20,7 @@
1920
- **Owner**: @duckontheweb
2021

2122
This document explains the ML Model Extension to the [SpatioTemporal Asset
22-
Catalog](https://github.com/radiantearth/stac-spec) (STAC) specification.
23+
Catalog](https://github.com/radiantearth/stac-spec) (STAC) specification.
2324

2425
- Examples:
2526
- [Item example](examples/dummy/item.json): Shows the basic usage of the extension in a STAC Item
@@ -60,7 +61,7 @@ these models for the following types of use-cases:
6061
institutions are making an effort to publish code and examples along with academic publications to enable this kind of reproducibility. However,
6162
the quality and usability of this code and related documentation can vary widely and there are currently no standards that ensure that a new
6263
researcher could reproduce a given set of published results from the documentation. The STAC ML Model Extension aims to address this issue by
63-
providing a detailed description of the training data and environment used in a ML model experiment.
64+
providing a detailed description of the training data and environment used in a ML model experiment.
6465

6566
## Item Properties
6667

@@ -77,7 +78,7 @@ these models for the following types of use-cases:
7778

7879
#### ml-model:learning_approach
7980

80-
Describes the learning approach used to train the model. It is STRONGLY RECOMMENDED that you use one of the
81+
Describes the learning approach used to train the model. It is STRONGLY RECOMMENDED that you use one of the
8182
following values, but other values are allowed.
8283

8384
- `"supervised"`
@@ -87,7 +88,7 @@ following values, but other values are allowed.
8788

8889
#### ml-model:prediction_type
8990

90-
Describes the type of predictions made by the model. It is STRONGLY RECOMMENDED that you use one of the
91+
Describes the type of predictions made by the model. It is STRONGLY RECOMMENDED that you use one of the
9192
following values, but other values are allowed. Note that not all Prediction Type values are valid
9293
for a given [Learning Approach](#ml-modellearning_approach).
9394

@@ -131,7 +132,7 @@ While the Compose file defines nearly all of the parameters required to run the
131132
directory containing input data should be mounted to the container and to which host directory the output predictions should be written. The Compose
132133
file MUST define volume mounts for input and output data using the Compose
133134
[Interpolation syntax](https://github.com/compose-spec/compose-spec/blob/master/spec.md#interpolation). The input data volume MUST be defined by an
134-
`INPUT_DATA` variable and the output data volume MUST be defined by an `OUTPUT_DATA` variable.
135+
`INPUT_DATA` variable and the output data volume MUST be defined by an `OUTPUT_DATA` variable.
135136

136137
For example, the following Compose file snippet would mount the host input directory to `/var/data/input` in the container and would mount the host
137138
output data directory to `/var/data/output` in the host container. In this contrived example, the script to run the model takes 2 arguments: the
@@ -219,10 +220,10 @@ extension, please open a PR to include it in the `examples` directory. Here are
219220

220221
### Running tests
221222

222-
The same checks that run as checks on PR's are part of the repository and can be run locally to verify that changes are valid.
223+
The same checks that run as checks on PR's are part of the repository and can be run locally to verify that changes are valid.
223224
To run tests locally, you'll need `npm`, which is a standard part of any [node.js installation](https://nodejs.org/en/download/).
224225

225-
First you'll need to install everything with npm once. Just navigate to the root of this repository and on
226+
First you'll need to install everything with npm once. Just navigate to the root of this repository and on
226227
your command line run:
227228
```bash
228229
npm install

0 commit comments

Comments
 (0)