Centralize Async TP Enablement with maybe_enable_async_tp API #1619

fegin · 2025-08-21T21:14:56Z

This PR addresses duplicated code related to enabling async TP across different parts of the codebase. It introduces a new API, maybe_enable_async_tp(), which centralizes the enablement logic and is reused consistently in all models.

Note that while this PR fixes one async TP bug in TorchTitan, it does not fully resolve #1613, as there appear to be additional bugs in PyTorch's async TP implementation.

This PR addresses duplicated code related to enabling async TP across different parts of the codebase. It introduces a new API, `maybe_enable_async_tp()`, which centralizes the enablement logic and is reused consistently in all models. Note that while this PR fixes one async TP bug in TorchTitan, it does not fully resolve #1613, as there appear to be additional bugs in PyTorch's async TP implementation.

tianyu-l

LGTM, had a suggestion.

tianyu-l · 2025-08-21T21:55:14Z

torchtitan/models/llama3/infra/parallelize.py

@@ -139,12 +133,26 @@ def parallelize_llama(
    return model


+def maybe_enable_async_tp(job_config: JobConfig, tp_mesh: DeviceMesh):


I'd suggest we take this chance to put it into model-agnostic file. Specifically I'm thinking of torchtitan/distributed/tensor_parallel.py where we can also put NoParallel (https://github.com/pytorch/torchtitan/blob/main/torchtitan/distributed/expert_parallel.py#L116) to.

I'm also thinking we may want to put most apply_ac (and maybe apply_compile) logic to that folder, as they are pretty much the same across all models.

fegin requested review from tianyu-l, wwwjn and wconstab as code owners August 21, 2025 21:14

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 21, 2025

update

653cece

tianyu-l reviewed Aug 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Centralize Async TP Enablement with maybe_enable_async_tp API #1619

Centralize Async TP Enablement with maybe_enable_async_tp API #1619

fegin commented Aug 21, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Aug 21, 2025

Uh oh!

Uh oh!

		@@ -139,12 +133,26 @@ def parallelize_llama(
		return model


		def maybe_enable_async_tp(job_config: JobConfig, tp_mesh: DeviceMesh):

Centralize Async TP Enablement with maybe_enable_async_tp API #1619

Are you sure you want to change the base?

Centralize Async TP Enablement with maybe_enable_async_tp API #1619

Conversation

fegin commented Aug 21, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!