Skip to content

Conversation

@KimbingNg
Copy link
Contributor

@KimbingNg KimbingNg commented Nov 26, 2025

What does this PR do?

Fixes # (12673) and Tencent-Hunyuan/HunyuanVideo-1.5#10 I identified the root cause: record_stream isn’t functioning as expected for unknown reasons. This causes the offload_to_memory function to update param.data without waiting for the default stream (current_stream) to complete (the forward pass).

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

cc @asomoza @sayakpaul @a-r-r-o-w @DN6

@KimbingNg
Copy link
Contributor Author

KimbingNg commented Nov 26, 2025

As for why record_stream fails, I suspect it’s because calling record_stream on Tensor.data is unsafe — Tensor.data is somewhat deprecated, though I still need confirmation on this.

@KimbingNg
Copy link
Contributor Author

Also relevant to Tencent-Hunyuan/HunyuanVideo-1.5#10

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR!

Could we also run the test_group_offloading tests?

pytest tests/models -k "test_group_offloading"?


for group_module in self.modules:
for param in group_module.parameters():
if self.record_stream and param.device.type == 'cuda':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to condition on param.device.type == 'cuda'?

Copy link
Contributor Author

@KimbingNg KimbingNg Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Tensors on CPU can not call record_stream

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do something like if self.record_stream and param.device.type != "cpu":, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I updated the code as you recommended in the latest commit

@KimbingNg
Copy link
Contributor Author

pytest tests/models -k "test_group_offloading"

Running pytest:

=============================================== 365 passed, 37 skipped, 3604 deselected, 757 warnings in 100.34s (0:01:40) ================================================

@KimbingNg KimbingNg changed the title Fixes #12673. record_stream is not working properly Fixes #12673. record_stream in group offloading is not working properly Nov 26, 2025
@sayakpaul sayakpaul requested a review from DN6 November 26, 2025 12:30
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

QwenImagepipeline imcompatable with group offloading with record_stream=True

3 participants