-
Notifications
You must be signed in to change notification settings - Fork 30.2k
Fix Qwen2AudioForConditionalGeneration.forward()
and test_flash_attn_kernels_inference_equivalence
#39503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for opening the PR 🙏
We'll need to pass cache_position
to self.language_model
as well. The non-legacy
processing branch in Qwen2AudioForConditionalGeneration.forward
only prepares inputs_embeds
(and not position-based tensors, like position_ids
or attention_mask
), so further processing of cache_position
probably isn't needed. I'm assuming that before this fix, we were relying on the language_model
's default values for cache_position
, so in practice this change won't have an impact on the outputs from interfaces like generate
.
Let's make sure the slow tests are passing after the changes!
@ebezzam after your changes, remove this line as well. If this was the only cache-related issue, then CI will be happy :D (the line I linked prevents |
This is such an important patch. Thankyou so much, was having so many issues in just inference. |
Thank you @gante for the pointers! I can confirm that slow tests are passing: # RUN_SLOW=1 pytest tests/models/qwen2_audio/test_modeling_qwen2_audio.py
...
============= 15 passed, 33 skipped in 45.03s ============= UPDATE after seeing failing CII tried different combinations of:
# COMMAND: RUN_SLOW=1 pytest tests/models/qwen2_audio
# -- without `all_generative_model_classes = ()` and without `GenerationTesterMixin`
# -- similar to https://app.circleci.com/pipelines/github/huggingface/transformers/138686/workflows/39334258-3551-4727-9629-adcf04e95a67/jobs/1838253
FAILED tests/models/qwen2_audio/test_modeling_qwen2_audio.py::Qwen2AudioForConditionalGenerationModelTest::test_generation_tester_mixin_inheritance - AssertionError: False is not true : This model can call `generate` from `GenerationMixin`, so one of two things must happen: 1) the tester must inherit from `GenerationTesterMixi...
==== 1 failed, 93 passed, 77 skipped, 4 warnings in 164.27s (0:02:44) =========
# -- without `all_generative_model_classes = ()` and with `GenerationTesterMixin`
FAILED tests/models/qwen2_audio/test_modeling_qwen2_audio.py::Qwen2AudioForConditionalGenerationModelTest::test_assisted_decoding_matches_greedy_search_0_random - RuntimeError: The size of tensor a (2) must match the size of tensor b (26) at non-singleton dimension 1
FAILED tests/models/qwen2_audio/test_modeling_qwen2_audio.py::Qwen2AudioForConditionalGenerationModelTest::test_assisted_decoding_matches_greedy_search_1_same - RuntimeError: The size of tensor a (2) must match the size of tensor b (27) at non-singleton dimension 1
FAILED tests/models/qwen2_audio/test_modeling_qwen2_audio.py::Qwen2AudioForConditionalGenerationModelTest::test_assisted_decoding_sample - RuntimeError: The size of tensor a (2) must match the size of tensor b (27) at non-singleton dimension 1
==== 3 failed, 124 passed, 87 skipped, 4 warnings in 179.77s (0:02:59) ======
# with `all_generative_model_classes = ()` and without `GenerationTesterMixin`
==== 95 passed, 76 skipped, 4 warnings in 165.05s (0:02:45) ======= |
remaining failing tests should be fixed adding this method to ensure we don't trigger merging input features when not in prefil stage |
cache_position
to Qwen2AudioForConditionalGeneration.forward()
Qwen2AudioForConditionalGeneration.forward()
run-slow: qwen2_audio |
This comment contains run-slow, running the specified jobs: models: ['models/qwen2_audio'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for iterating, LGTM 🙏
(having a look at the CI failures, which seem unrelated to these changes) |
The root issue in CI is solved in timm, and already released. However, this release happened after we last built our CI images. I've manually triggered a rebuild of our CI images (here), it should be a matter of waiting for this job -> rerun this PR's CI |
ok, needs more work: our setup file has |
run-slow: qwen2_audio |
This comment contains run-slow, running the specified jobs: models: ['models/qwen2_audio'] |
the CI job failed before running the slow tests, but I'm getting the following failures locally:
cc @ebezzam |
thanks for the update! I'll take a look into that |
[For maintainers] Suggested jobs to run (before merge) run-slow: qwen2_audio |
tests/test_modeling_common.py
Outdated
@@ -3484,6 +3484,7 @@ def flash_attn_inference_equivalence(self, attn_implementation: str, padding_sid | |||
model = model_class(config) | |||
|
|||
model.to(torch_device) | |||
model.half() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolves test that is expecting half-precision inputs
# RUN_SLOW=1 pytest tests/models/qwen2_audio/test_modeling_qwen2_audio.py::Qwen2AudioForConditionalGenerationModelTest::test_flash_attn_kernels_inference_equivalence
# -- previously
FAILED tests/models/qwen2_audio/test_modeling_qwen2_audio.py::Qwen2AudioForConditionalGenerationModelTest::test_flash_attn_kernels_inference_equivalence - RuntimeError: FlashAttention only supports fp16, bf16, and fp8_e4m3 data type
Also fixes the simliar test for Qwen2_VL (below) and maybe other models!
RUN_SLOW=1 pytest tests/models/qwen2_vl/test_modeling_qwen2_vl.py::Qwen2VLModelTest::test_flash_attn_kernels_inference_equivalence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was broken recently (~1 week ago, we could see the models being loaded in BF16)
Can we set the original dtype, i.e. model.to(torch.bfloat16)
, instead?
run-slow: qwen2_audio |
This comment contains run-slow, running the specified jobs: models: ['models/qwen2_audio'] |
Qwen2AudioForConditionalGeneration.forward()
Qwen2AudioForConditionalGeneration.forward()
and test_flash_attn_kernels_inference_equivalence
What does this PR do?
Fixes breaking Qwen2Audio tests: https://github.com/huggingface/transformers/actions/runs/16361063842/job/46229139255
Errors don't display in above Model CI tests, but this is what I got:
In short,
cache_position
was missing inforward
ofQwen2AudioForConditionalGeneration
and adding it resolves the tests 👍However, it probably needs to be processed in some way? Like in
Qwen2VLForConditionalGeneration
?If it's similar as in
Qwen2VLForConditionalGeneration
happy to do it!cc @eustlb, @gante as it seems you were aware of cache related issues here