[Bugfix] Make Gelu Activations consistent across frameworks #753
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR fixes a consistency issue with how TEI handles GeLU activation compared to the
transformerslibrary and thecandlelibrary.It seems that the value
geluis meant to serialize to an old incorrect version of how GeLU activation (based on the comment given here) was implemented based on this code snippet in transformers.This means that any config that uses the value
gelufor thehidden_activationusing theGeluActivationfunction which uses thetorch.erffunction. The new GeLU activation is referenced usingnew_geluorgelu_pytorch_tanh.This behavior is also what is followed by the huggingface/candle repository here (
gelucorresponds toxs.gelu_erf()and notxs.gelu())This PR brings the TEI implementation in line with how transformers parses the
config.jsonvalues and howcandleresolves activations.I came across this inconsistency while I was reviewing some of the code changes I had in #746, but thought this should be opened as a separate PR, given that it will slight vary (re: correct) existing model behavior. (h/t to @bbaldino for pointing this out to me)
Please do let me know if I'm missing something obvious here as to why TEI is not in-sync with how the activation functions are defined. My understanding is that this is just a bug that got carried over from legacy code that was introduced in #41
Before submitting
instasnapshots?Who can review?
@Narsil OR @alvarobartt OR @kozistr