-
Notifications
You must be signed in to change notification settings - Fork 30.4k
[FA
] Remaining Cleanup
#40424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FA
] Remaining Cleanup
#40424
Conversation
tensor_kwargs = {"dtype": torch.int32, "device": position_ids.device} | ||
if not is_packed_sequence: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path was meant for generation and is no longer valid, it's a dead path that is no longer used as of #40161
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Can you check if this is working for you, as for #39814? @maxjeblick @alessiodevoto I ran the reproducer and can no longer get the error so I assume it's not an issue even with these changes here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deleted path was meant to help Nvidia KVPress or users that perform manual generation loop to use FA2 (#39814). Otherwise the cu_seq_lens
always assume that we have a packed sequence in input. We need to keep it somewhere and allow users to do custom generation with FA2
Trying to give a history of what happened and why ultimately don't need that path anymore:
|
@zucchini-nlp Tried to explain things above ^ tl;dr: We no longer enter varlen during generate (except when we use attention mask which hasn't changed) - this was inefficient and broke more things; this cleans the rest up connected to #40161 |
Sorry, I am a bit lazy to read through all PRs 🙃 Just to make sure, if users have a custom generation where they call If the linked issue isn't reproducible anymore, it should be fine. But I'd like us to have a test to avoid regression |
No worries 😆
Depends on the input either a) No padding then we enter the basic fa function path (last in if else) b) Varlen path with attention mask where we manually do things (hasn't changed)
We circumvent that issue entirely by not going the varlen path here. It wasn't necessary to enter varlen here as we have no padding and we made our lives significantly harder with this 😓 When we entered the varlen path (for input with no padding), we needed that workaround that you made but with the removal of the prep during generate, we no longer need it.
I can revert the removal of the test - it should catch that error in case we decide to change things up there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but:
- let's make sure we don't break ulyss as well
- let's check our run slow on this PR the docker should ahve fa2 now
- happy to have a small TLDR because I did not know either that now padding -> flash_fn instead (did not know it made a difference)
run-slow: llama,mistral,bart |
This comment contains run-slow, running the specified jobs: models: ['models/bart', 'models/llama', 'models/mistral'] |
Can you check if this PR works for you @ETOgaosion @kisseternity? (ulysses-sp) |
run-slow: llama,mistral,bart |
This comment contains run-slow, running the specified jobs: models: ['models/bart', 'models/llama', 'models/mistral'] |
Identical failures to main (+ the dola tests which are known) Waiting for feedback on ulysses and kvpress, then I'd merge |
Thanls for the heads up, I'll report later today. |
I think it works for verl's ulysses patch, we use a special handling method for current 4.55 query length API, luckily |
Gotcha @ETOgaosion, I'd be interested if this worked without the patch for 4.55? I assume this PR is safe then on your side |
Thanks a lot for the PR; from kvpress side, there are no issues! |
Indeed I'm using https://github.com/huggingface/transformers/pull/40412/files for a quick fix and it looks good so far, when the training is done I'll give a try. |
@kisseternity The problem is that #40412 will use logic that is no longer valid for us, so this PR will supersede #40412. Just want to make sure that I don't break things here instead as well 👀 |
Thanks for checking @maxjeblick 🤗 |
Thanks for reverting the remaining dead code @vasqu! Indeed, we should NEVER take varlen path when we don't have attention mask or native packed format! This was a mistake that it was ever added |
Parts of the flash attention generation was moved to the generate preparation and was reverted in #40161
The prepare in generation had more ripple effects tho:
query_length
API can break ulysses-sp patch #40399This PR reverts these changes completely to be aligned with the changes in #40161:
Additional context
The long version explanation:
FA
] Remaining Cleanup #40424 (comment)The short version explanation:
flash_fn
when we have no padding making anything related to varlen combined with generate obsoleteFixes #40399
Closes #40412
cc @ArthurZucker @zucchini-nlp @Cyrilvallez