Skip to content

[trainer] fix: update TorchTitanEngine for latest torchtitan API#6231

Open
acisseJZhong wants to merge 1 commit intoverl-project:mainfrom
acisseJZhong:update_titan_api
Open

[trainer] fix: update TorchTitanEngine for latest torchtitan API#6231
acisseJZhong wants to merge 1 commit intoverl-project:mainfrom
acisseJZhong:update_titan_api

Conversation

@acisseJZhong
Copy link
Copy Markdown
Collaborator

@acisseJZhong acisseJZhong commented May 1, 2026

Summary

  • Update TorchTitanEngine to align with torchtitan HEAD API changes: remove deleted expert_tensor_parallel_degree/etp/maybe_enable_amp, add concrete CrossEntropyLoss.Config for new abstract BaseLoss
  • Fix attn_type being silently ignored ([trainer] bug: TorchtitanEngine silently ignores attn_type="flex" — no clear BKM for which torchtitan version to use #6182): pass attn_backend= to model_registry() and fix wrong attribute path in prepare_model_inputs
  • Fix derive_torchtitan_name_and_flavor to handle config factories (callables) and len(layers) fallback for layer count matching
  • Fix test script: use NUM_GPUS for rollout TP size instead of hardcoded 8, fix misleading experiment name

Fixes #6182

Test plan

  • Verify derive_torchtitan_name_and_flavor correctly resolves Qwen3-0.6B flavor
  • Run tests/special_e2e/run_ppo_trainer_torchtitan.sh with torchtitan HEAD
  • Verify model builds with attn_type=flex (FlexAttention) as configured
BACKEND=torchtitan FSDP_SIZE=1 NUM_GPUS=1 MODEL_ID=Qwen/Qwen3-0.6B
   bash tests/special_e2e/sft/run_sft_engine.sh 

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the torchtitan engine implementation and its associated E2E test script. Key changes include refactoring model specification and attention backend handling, removing expert tensor parallelism configurations, and adding a placeholder loss function for Trainer initialization. The model flavor derivation logic was also enhanced to support callable configurations and different layer attribute names. I have no feedback to provide as there were no review comments to assess.

Align verl's TorchTitanEngine with torchtitan HEAD, fixing several
breaking API changes and the attn_type bug reported in verl-project#6182.

torchtitan API updates:
- Remove `expert_tensor_parallel_degree` from ParallelismConfig (removed upstream)
- Remove `etp` from ParallelDims constructor (removed upstream)
- Remove `maybe_enable_amp` context manager (removed upstream; `train_context()` handles mixed precision)
- Add `loss=CrossEntropyLoss.Config()` to Trainer.Config (BaseLoss is now abstract)

attn_type fixes (verl-project#6182):
- Pass `attn_backend=` to `model_registry()` instead of dead post-hoc override
- Fix `attn_type` lookup to use `self.engine_config.attn_type` instead of wrong path

derive_torchtitan_name_and_flavor fixes:
- Handle config factories (callables) in addition to config objects
- Fall back to `len(config.layers)` when `n_layers` attr doesn't exist

test script fixes:
- Use `NUM_GPUS` for rollout TP size instead of hardcoded 8
- Fix misleading default experiment name
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[trainer] bug: TorchtitanEngine silently ignores attn_type="flex" — no clear BKM for which torchtitan version to use

2 participants