Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,17 @@

A memory-first, lightning-fast, ready-to-use ML checkpointing library.

PyTorch DCP, Megatron-LM and NeMo 2.0 adapters are readily available for seamless integration, on top of our general core checkpointing library that can be used for custom integrations.
Adapters for PyTorch DCP, Megatron-LM and NeMo 2.0 are readily available for seamless integration, on top of our core checkpointing APIs that can also be used directly for custom integrations.

If interested in direct integration support with another framework, please let us know by creating a [feature request](https://github.com/google/ml-flashpoint/issues/new?template=feature_request.md) or upvoting an existing one!
If interested in a native integration with another framework, please let us know by creating a [feature request](https://github.com/google/ml-flashpoint/issues/new?template=feature_request.md) or upvoting an existing one!

For learning more about using the library and its performance, check out the [user documentation](https://google.github.io/ml-flashpoint).
Below you will find development instructions for contributors.

## Installation

This library defines core dependencies, as well as additional optional dependencies for specific adapters, to avoid polluting consumers with unnecessary dependencies.
See the adapters installation commands for examples of the available adapters.
See the adapters installation commands below for examples of the available options, and the [`pyproject.toml`](./pyproject.toml) as the source of truth for all available adapters.

### Core Library
```bash
Expand Down
15 changes: 8 additions & 7 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,25 +18,26 @@ The goal is to ultimately improve your ML runtime (total time and goodput), by a
1. Free up your long-term storage bandwidth for other use cases.

ML Flashpoint saves checkpoints to shared memory, to be able to recover when the node is not lost, and automatically replicates them asynchronously to peer(s) in the training cluster, to improve resilience during node losses.
Replication has not been observed to have any negative impact on ongoing training or overall job time.
Replication has not been observed to have any meaningful negative impact on ongoing training or overall job time.
See the [overview](overview.md) for more detail.

### Performance

We performed some tests on a [Vertex AI Training Cluster](https://docs.cloud.google.com/vertex-ai/docs/training/training-clusters/overview) with 4 [A3-Mega](https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) nodes for Gemma 27B and Llama 70B pre-training over just 300 steps and observed the improvements listed below.
We observe meaningful improvements even in small-scale tests spanning just 300 training steps with 4 [A3-Mega](https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) nodes for Gemma 27B and Llama 70B pre-training.
We executed such tests on a [Vertex AI Training Cluster](https://docs.cloud.google.com/vertex-ai/docs/training/training-clusters/overview) and obtained the speedups listed below.
These tests were conducted using ML Flashpoint _alongside_ NeMo's recommended checkpointing (as you would in production), where NeMo's default checkpointing used a 7-10 TB [Filestore](https://cloud.google.com/filestore) instance.

Observations when comparing the hybrid of ML Flashpoint (every 5 steps) and NeMo checkpointing (every 50 steps) to just NeMo's regular checkpointing (every 10 steps):
When comparing the hybrid of ML Flashpoint (every 5 steps) and NeMo checkpointing (every 50 steps) to NeMo's regular checkpointing (every 10 steps - so half as often), the results were:

* Data write times that are up to 20-30x faster, with little to no optimization.
This is expected to further improve with additional optimizations.
* Total checkpoint recovery times that are ~7-10x faster (includes the time it takes to do checkpoint detection, cross-node coordination, replication, read into model state and be ready to resume training).
* For _async_ checkpointing: improvements averaging **3-6%** for _overall job time_, with peaks of **5-10%** improvements.
These improvements only account for checkpoint save efficiency, representing a "worst case" in the sense that checkpointing purely adds overhead and isn't actually used.
Any job interruptions will also benefit from the improved checkpoint recovery times.
* For _async_ checkpointing: improvements averaging **3-6%** for _overall job time_, with peaks of **5-10%** improvements if only doing NeMo checkpointing at the end (300th step) as opposed to every 50 steps.
These improvements only account for checkpoint save efficiency, representing a "lower bound" speedup as it doesn't account for the speedups in recovery time.
Any job interruptions would also benefit from recovery performance gains.

While [ML runtime goodput](https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity) is important, we focus on overall job time as an end-to-end metric, as it is simpler and allows for straightforward _total_ cost comparisons.
Runtime goodput alone can be misleading if improvements to unproductive time actually worsen productive (active training) time, and the change in total evaluation period (job time) is not taken into account.
Runtime goodput alone can be misleading if improvements to unproductive (non-training) time actually worsen productive (active training) time, and the change in total evaluation period (job time) is not taken into account.

## Design Philosophy

Expand Down
Loading