Add Trainer Protocol #533

allenwang28 · 2025-11-06T19:41:34Z

This PR introduces the Trainer Protocol in src/forge/api/trainer.py, establishing a unified training interface that all trainer implementations will conform to.

Motivation

Currently, Forge applications directly use Monarch actors (e.g., RLTrainer.options(...).as_actor(...)), which exposes implementation details like .route() and .fanout() to application code.

This creates tight coupling and makes it difficult to:

Switch between different trainer backends (TorchTitan, HuggingFace, etc.)
Write portable application code that doesn't depend on Monarch specifics

Protocol + Wrappers

Note that we're using Python's Protocol, and not ABC! In case you weren't aware, there is a big philosophical debate about ABC vs Protocol that Claude has introduced me to. I'm primarily choosing Protocol because it's lighter weight (and let me know if you disagree strongly).

Why Protocol and not ABC?

We want a public, lightweight interface that anyone can satisfy without importing our base class. Protocol gives us structural typing: if it quacks, it flies. An ABC would force nominal inheritance and encourage a hierarchy we don’t actually need.

TL;DR:

Looser coupling: Call sites accept “anything with the right methods,” not “anything that inherits our base.”
Frictionless third-party impls: External teams can implement the interface without depending on our internals.
Small, composable capabilities: Easy to define narrow traits and mix them.
Optional runtime checks: If desired, @runtime_checkable enables isinstance(x, Trainer) as a light guard.

What it looks like in practice:

With ABC:

 # Would force inheritance
  class TitanTrainer(Trainer):  # Must inherit from ABC
      def __init__(self, actor_handle):
          super().__init__()  # ABC initialization overhead
          self._actor = actor_handle

With Protocol:

  # No inheritance required
  class TitanTrainer:  # Just a plain class
      def __init__(self, actor_handle):
          self._actor = actor_handle  # That's it

Why this matters:

Simple/thin wrappers: HFTrainer, TitanTrainer, etc. can be simple adapters over the Monarch actor
Fungibility by default: Third parties can drop in their own trainer without subclassing anything.
Stability for callers: Callers type against the behavior (the protocol), so internal refactors don’t cascade.
Escape hatch: If we later need shared behavior, we can add an optional BaseTrainer(ABC) with helpers/metrics—without changing the public Trainer protocol.
Ultimately this all allows us to keep looser coupling between the protocol definition and implementation.

Other planned changes:

The Protocol is step 1 of a multi-PR refactor. Other planned changes:

Restructure actor/trainer.py into trainer/titan.py and rename RLTrainer to TitanTrainerActor. Add TitanTrainer wrapper class that hides Monarch adverbs
Implement the rest of the API for titan trainer (we only do train_step right now)
App migration - maybe after the other API changes have landed

joecummings

Overall very happy with this but I need to see more details in the docstrings bc its not entirely clear whats being passed into each method.

src/forge/api/trainer.py

casteryh · 2025-11-06T20:05:06Z

I like protocol more than ABC

casteryh · 2025-11-06T23:09:00Z

src/forge/api/trainer.py

+            - Neither: default_dir/step-{current_step}
+
+        Returns:
+            dict containing:


what's they key of dict[str, Any]?

a placeholder now is returning the path it was written to and the step. we might not need anything though i'm not sure, but I want to return a placeholder at least in case of like "hosted service" territory

if we know what the dict returns, lets make it a dataclass?

felipemello1 · 2025-11-08T03:57:51Z

src/forge/api/trainer.py

+            dict containing:
+                - step: int - Training step from the loaded checkpoint
+                - learning_rate: float - Learning rate from the loaded checkpoint


why would we need these 2?

felipemello1 · 2025-11-08T04:02:09Z

src/forge/api/types.py

+    """
+
+    loss: float
+    metrics: dict[str, float]


Maybe it should be List[Metric], but i am not sure how exactly this will be used. I do think that [str, float] is too constraining.

torchforge/src/forge/observability/metrics.py

Line 85 in bb57589

class Metric:

felipemello1 · 2025-11-08T04:07:18Z

src/forge/api/types.py

+        config: Model configuration. Common keys include:
+            - vocab_size: int - Size of the vocabulary
+            - hidden_size: int - Hidden dimension size
+            - num_layers: int - Number of transformer layers
+            - num_attention_heads: int - Number of attention heads
+            - max_seq_len: int - Maximum sequence length
+        parallelism: Parallelism configuration. Common keys include:


these two should probably be dataclasses

Allen Wang added 2 commits November 6, 2025 11:05

add trainer protocol

8141cef

remove some get_* i'm not convinced on yet

189b242

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 6, 2025

allenwang28 requested review from JenniferWang and joecummings November 6, 2025 19:42

joecummings reviewed Nov 6, 2025

View reviewed changes

casteryh reviewed Nov 6, 2025

View reviewed changes

src/forge/api/trainer.py Outdated Show resolved Hide resolved

src/forge/api/trainer.py Outdated Show resolved Hide resolved

src/forge/api/trainer.py Outdated Show resolved Hide resolved

src/forge/api/trainer.py Outdated Show resolved Hide resolved

Allen Wang added 3 commits November 6, 2025 14:23

bulk changes

d89bc1b

forwardbackwardresult

83f3a8f

add custom loss

4d11e4e

casteryh reviewed Nov 6, 2025

View reviewed changes

Merge branch 'main' into interfaces

1d341eb

felipemello1 reviewed Nov 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Trainer Protocol #533

Add Trainer Protocol #533

Uh oh!

allenwang28 commented Nov 6, 2025

Uh oh!

joecummings left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

casteryh commented Nov 6, 2025 •

edited

Loading

Uh oh!

casteryh Nov 6, 2025

Uh oh!

allenwang28 Nov 7, 2025

Uh oh!

felipemello1 Nov 8, 2025

Uh oh!

felipemello1 Nov 8, 2025

Uh oh!

felipemello1 Nov 8, 2025 •

edited

Loading

Uh oh!

felipemello1 Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add Trainer Protocol #533

Are you sure you want to change the base?

Add Trainer Protocol #533

Uh oh!

Conversation

allenwang28 commented Nov 6, 2025

Motivation

Protocol + Wrappers

Why Protocol and not ABC?

What it looks like in practice:

Other planned changes:

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

casteryh commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casteryh Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

allenwang28 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

felipemello1 Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

felipemello1 Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

felipemello1 Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

casteryh commented Nov 6, 2025 •

edited

Loading

felipemello1 Nov 8, 2025 •

edited

Loading