[trainer] fix: update TorchTitanEngine for latest torchtitan API by acisseJZhong · Pull Request #6231 · verl-project/verl

acisseJZhong · 2026-05-01T04:02:00Z

Summary

Update TorchTitanEngine to align with torchtitan HEAD API changes: remove deleted expert_tensor_parallel_degree/etp/maybe_enable_amp, add concrete CrossEntropyLoss.Config for new abstract BaseLoss
Fix attn_type being silently ignored ([trainer] bug: TorchtitanEngine silently ignores attn_type="flex" — no clear BKM for which torchtitan version to use #6182): pass attn_backend= to model_registry() and fix wrong attribute path in prepare_model_inputs
Fix derive_torchtitan_name_and_flavor to handle config factories (callables) and len(layers) fallback for layer count matching
Fix test script: use NUM_GPUS for rollout TP size instead of hardcoded 8, fix misleading experiment name

Test plan

Verify derive_torchtitan_name_and_flavor correctly resolves Qwen3-0.6B flavor
Run tests/special_e2e/run_ppo_trainer_torchtitan.sh with torchtitan HEAD
Verify model builds with attn_type=flex (FlexAttention) as configured

BACKEND=torchtitan FSDP_SIZE=1 NUM_GPUS=1 MODEL_ID=Qwen/Qwen3-0.6B
   bash tests/special_e2e/sft/run_sft_engine.sh

CLAassistant · 2026-05-01T04:02:07Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request updates the torchtitan engine implementation and its associated E2E test script. Key changes include refactoring model specification and attention backend handling, removing expert tensor parallelism configurations, and adding a placeholder loss function for Trainer initialization. The model flavor derivation logic was also enhanced to support callable configurations and different layer attribute names. I have no feedback to provide as there were no review comments to assess.

Align verl's TorchTitanEngine with torchtitan HEAD, fixing several breaking API changes and the attn_type bug reported in verl-project#6182. torchtitan API updates: - Remove `expert_tensor_parallel_degree` from ParallelismConfig (removed upstream) - Remove `etp` from ParallelDims constructor (removed upstream) - Remove `maybe_enable_amp` context manager (removed upstream; `train_context()` handles mixed precision) - Add `loss=CrossEntropyLoss.Config()` to Trainer.Config (BaseLoss is now abstract) attn_type fixes (verl-project#6182): - Pass `attn_backend=` to `model_registry()` instead of dead post-hoc override - Fix `attn_type` lookup to use `self.engine_config.attn_type` instead of wrong path derive_torchtitan_name_and_flavor fixes: - Handle config factories (callables) in addition to config objects - Fall back to `len(config.layers)` when `n_layers` attr doesn't exist test script fixes: - Use `NUM_GPUS` for rollout TP size instead of hardcoded 8 - Fix misleading default experiment name

acisseJZhong requested review from eric-haibin-lin and vermouth1992 as code owners May 1, 2026 04:02

gemini-code-assist Bot reviewed May 1, 2026

View reviewed changes

acisseJZhong force-pushed the update_titan_api branch 2 times, most recently from ba5beaf to 451c9dc Compare May 1, 2026 04:38

acisseJZhong mentioned this pull request May 1, 2026

[trainer] bug: TorchtitanEngine silently ignores attn_type="flex" — no clear BKM for which torchtitan version to use #6182

Open

4 tasks

acisseJZhong force-pushed the update_titan_api branch from 451c9dc to 48abba3 Compare May 1, 2026 04:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[trainer] fix: update TorchTitanEngine for latest torchtitan API#6231

[trainer] fix: update TorchTitanEngine for latest torchtitan API#6231
acisseJZhong wants to merge 1 commit intoverl-project:mainfrom
acisseJZhong:update_titan_api

acisseJZhong commented May 1, 2026 •

edited

Loading

Uh oh!

CLAassistant commented May 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

acisseJZhong commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

CLAassistant commented May 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

acisseJZhong commented May 1, 2026 •

edited

Loading