-
Notifications
You must be signed in to change notification settings - Fork 51
Add Trainer Protocol #533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Trainer Protocol #533
Conversation
joecummings
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall very happy with this but I need to see more details in the docstrings bc its not entirely clear whats being passed into each method.
|
I like protocol more than ABC |
| - Neither: default_dir/step-{current_step} | ||
|
|
||
| Returns: | ||
| dict containing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's they key of dict[str, Any]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a placeholder now is returning the path it was written to and the step. we might not need anything though i'm not sure, but I want to return a placeholder at least in case of like "hosted service" territory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we know what the dict returns, lets make it a dataclass?
| dict containing: | ||
| - step: int - Training step from the loaded checkpoint | ||
| - learning_rate: float - Learning rate from the loaded checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would we need these 2?
| """ | ||
|
|
||
| loss: float | ||
| metrics: dict[str, float] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it should be List[Metric], but i am not sure how exactly this will be used. I do think that [str, float] is too constraining.
| class Metric: |
| config: Model configuration. Common keys include: | ||
| - vocab_size: int - Size of the vocabulary | ||
| - hidden_size: int - Hidden dimension size | ||
| - num_layers: int - Number of transformer layers | ||
| - num_attention_heads: int - Number of attention heads | ||
| - max_seq_len: int - Maximum sequence length | ||
| parallelism: Parallelism configuration. Common keys include: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two should probably be dataclasses
This PR introduces the
TrainerProtocol insrc/forge/api/trainer.py, establishing a unified training interface that all trainer implementations will conform to.Motivation
Currently, Forge applications directly use Monarch actors (e.g.,
RLTrainer.options(...).as_actor(...)), which exposes implementation details like.route()and.fanout()to application code.This creates tight coupling and makes it difficult to:
Protocol + Wrappers
Note that we're using Python's Protocol, and not ABC! In case you weren't aware, there is a big philosophical debate about ABC vs Protocol that Claude has introduced me to. I'm primarily choosing Protocol because it's lighter weight (and let me know if you disagree strongly).
Why Protocol and not ABC?
We want a public, lightweight interface that anyone can satisfy without importing our base class. Protocol gives us structural typing: if it quacks, it flies. An ABC would force nominal inheritance and encourage a hierarchy we don’t actually need.
TL;DR:
@runtime_checkableenablesisinstance(x, Trainer)as a light guard.What it looks like in practice:
With ABC:
With Protocol:
Why this matters:
HFTrainer,TitanTrainer, etc. can be simple adapters over the Monarch actorBaseTrainer(ABC)with helpers/metrics—without changing the publicTrainerprotocol.Other planned changes:
The Protocol is step 1 of a multi-PR refactor. Other planned changes: