Fix DTensor/torch.Tensor compatibility in LinearCrossEntropyLoss #2898

jscaldwell55 · 2025-08-02T01:56:47Z

Summary

Fixes #2856 - DTensor/torch.Tensor mixed type error in Llama4 LoRA fine-tuning

Problem

When running distributed LoRA fine-tuning with custom_sharded_layers, the training fails with:
RuntimeError: aten.mm.default: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!

This occurs because FSDP wraps some tensors as DTensors while others remain regular tensors, causing a type mismatch in the loss computation.

Solution

Added DTensor compatibility handling in LinearCrossEntropyLoss.compute_cross_entropy() by checking tensor types before the linear projection:

If weight is DTensor and hidden is not: convert hidden to DTensor to match
If hidden is DTensor and weight is not: convert hidden to local tensor
No-op for matching types or non-distributed training

Testing

Added regression test in test_dtensor_cross_entropy.py
Verified Python syntax and imports work correctly
No impact on non-distributed training paths

Test Plan

To verify the fix:

# Run the regression test
pytest tests/torchtune/modules/loss/test_dtensor_cross_entropy.py -v

# Test distributed LoRA training
tune run --nproc_per_node 8 lora_finetune_distributed --config llama4/scout_17B_16E_lora

cc @krammnic

Fixes pytorch#2856 When using distributed LoRA fine-tuning with custom_sharded_layers, some tensors become DTensors while others remain regular tensors. This caused a RuntimeError when computing cross-entropy loss. The fix adds compatibility handling in LinearCrossEntropyLoss.compute_cross_entropy by checking tensor types before the linear projection. When there's a type mismatch: - If weight is DTensor and hidden is not: convert hidden to DTensor - If hidden is DTensor and weight is not: convert hidden to local tensor This ensures compatibility for distributed training while maintaining normal operation for non-distributed cases. Changes: - Add DTensor type checking before self.linear_projection call - Handle tensor type conversion when mismatch detected - Add regression test for the issue - No impact on non-distributed training

pytorch-bot · 2025-08-02T01:56:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2898

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 4 Cancelled Jobs

As of commit 2527975 with merge base b22a3ae ():

NEW FAILURES - The following jobs have failed:

GPU tests / gpu_test (3.11, stable) (gh)
tests/torchtune/modules/loss/test_dtensor_cross_entropy.py::TestDTensorCrossEntropy::test_regular_tensors
Lint / lint (3.10) (gh)
tests/torchtune/modules/loss/test_dtensor_cross_entropy.py:10:1: F401 'torch.distributed._tensor.DTensor' imported but unused
Unit Test / unit_tests (3.11) (gh)
tests/torchtune/modules/loss/test_dtensor_cross_entropy.py::TestDTensorCrossEntropy::test_regular_tensors

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.10, stable) (gh)
tests/torchtune/modules/loss/test_dtensor_cross_entropy.py::TestDTensorCrossEntropy::test_regular_tensors
GPU tests / gpu_test (3.9, stable) (gh)
tests/torchtune/modules/loss/test_dtensor_cross_entropy.py::TestDTensorCrossEntropy::test_regular_tensors
Unit Test / unit_tests (3.10) (gh)
Unit Test / unit_tests (3.9) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jscaldwell55 · 2025-08-02T01:59:10Z

I've been working on some similar issues in Hugging Face PEFT and llama-cookbook, so thought I'd jump in and see if i could resolve this. Totally open to any changes or feedback.

cc @krammnic

krammnic · 2025-08-03T20:56:11Z

I will do few sanity checks and then we can proceed on this

nathan-az · 2025-08-04T00:44:34Z

torchtune/modules/loss/cross_entropy_loss.py

+                        # This case is less likely but handle it
+                        hidden_chunk = hidden_chunk.to_local()
+            except ImportError:
+                # DTensor not available in this PyTorch version


Do we have to worry about this? torchtune I believe only needs to support latest stable and prerelease, so DTensor should always be importable.

nathan-az · 2025-08-04T00:50:09Z

torchtune/modules/loss/cross_entropy_loss.py

+        # When using FSDP with custom_sharded_layers, some tensors might be DTensors
+        # while others are regular tensors, causing compatibility issues
+        if hasattr(torch.distributed, '_tensor') and torch.distributed.is_initialized():
+            try:


Ideally blocks like this actually sit outside compute_cross_entropy (perhaps in forward), because compute_cross_entropy gets compiled, and type branching doesn't appear to play nicely with compile.

Compile support here is already muddy, but calling this part outside compute_cross_entropy can't hurt.

nathan-az · 2025-08-04T00:50:20Z

torchtune/modules/loss/cross_entropy_loss.py

+                from torch.distributed._tensor import DTensor
+
+                # For linear_projection modules, we need to check the weight parameter
+                if hasattr(self.linear_projection, 'weight'):


Is it possible for this to be false?

nathan-az · 2025-08-04T00:59:08Z

torchtune/modules/loss/cross_entropy_loss.py

+                        )
+                    elif hidden_is_dtensor and not weight_is_dtensor:
+                        # This case is less likely but handle it
+                        hidden_chunk = hidden_chunk.to_local()


Is this correct? If hidden_chunk is a DTensor, according to the forward logic, it should be sharded on the feature dimension (bs*seq_len, feature_dim / tp_dim). Then since the weight isn't a DTensor in this branch, presumably it has the original feature dimension and shape (feature_dim, vocab_size), so the matmul shapes don't match.

I may be missing something, feel free to correct me! I don't have access to a machine to test right now.

nathan-az · 2025-08-04T01:01:28Z

Apologies if I jumped on reviewing this too soon. Let me know if anything doesn't make sense. I'll take a closer look when you're happy to proceed :)

jscaldwell55 · 2025-08-04T03:04:54Z

Hi @nathan-az, thank you for reviewing this! You're totally right about the dimension mismatch; trying to convert tensors was the wrong approach. I need to find the root FSDP problem thats causing this

Next steps for revising this PR:

Go to where custom_sharded_layers is being used in the FSDP configuration, and see whether linear_projection layer is being excluded from wrapping. If so, update the config to make sure linear_projection is wrapped w/ the rest of the model.

jscaldwell55 · 2025-08-14T17:46:49Z

Closing this PR in favor of #2900, which takes a better approach by validating the configuration rather than trying to convert tensor types at runtime. The validation approach addresses the root cause without the dimension mismatch and compilation issues identified in the review.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 2, 2025

nathan-az reviewed Aug 4, 2025

View reviewed changes

jscaldwell55 closed this Aug 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix DTensor/torch.Tensor compatibility in LinearCrossEntropyLoss #2898

Fix DTensor/torch.Tensor compatibility in LinearCrossEntropyLoss #2898

Uh oh!

jscaldwell55 commented Aug 2, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 2, 2025 •

edited

Loading

Uh oh!

jscaldwell55 commented Aug 2, 2025

Uh oh!

krammnic commented Aug 3, 2025 •

edited

Loading

Uh oh!

nathan-az Aug 4, 2025

Uh oh!

nathan-az Aug 4, 2025

Uh oh!

nathan-az Aug 4, 2025

Uh oh!

nathan-az Aug 4, 2025 •

edited

Loading

Uh oh!

nathan-az commented Aug 4, 2025

Uh oh!

jscaldwell55 commented Aug 4, 2025 •

edited

Loading

Uh oh!

jscaldwell55 commented Aug 14, 2025

Uh oh!

Uh oh!

Fix DTensor/torch.Tensor compatibility in LinearCrossEntropyLoss #2898

Fix DTensor/torch.Tensor compatibility in LinearCrossEntropyLoss #2898

Uh oh!

Conversation

jscaldwell55 commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Testing

Test Plan

Uh oh!

pytorch-bot bot commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2898

❌ 3 New Failures, 4 Cancelled Jobs

Uh oh!

jscaldwell55 commented Aug 2, 2025

Uh oh!

krammnic commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nathan-az Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

nathan-az Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

nathan-az Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

nathan-az Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nathan-az commented Aug 4, 2025

Uh oh!

jscaldwell55 commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jscaldwell55 commented Aug 14, 2025

Uh oh!

Uh oh!

jscaldwell55 commented Aug 2, 2025 •

edited

Loading

pytorch-bot bot commented Aug 2, 2025 •

edited

Loading

krammnic commented Aug 3, 2025 •

edited

Loading

nathan-az Aug 4, 2025 •

edited

Loading

jscaldwell55 commented Aug 4, 2025 •

edited

Loading