Skip to content

Conversation

@vrdn-23
Copy link
Contributor

@vrdn-23 vrdn-23 commented Nov 6, 2025

What does this PR do?

This PR fixes a consistency issue with how TEI handles GeLU activation compared to the transformers library and the candle library.

It seems that the value gelu is meant to serialize to an old incorrect version of how GeLU activation (based on the comment given here) was implemented based on this code snippet in transformers.

ACT2CLS = {
    "gelu": GELUActivation,
    "gelu_10": (ClippedGELUActivation, {"min": -10, "max": 10}),
    "gelu_fast": FastGELUActivation,
    "gelu_new": NewGELUActivation,
    "gelu_python": (GELUActivation, {"use_gelu_python": True}),
    "gelu_pytorch_tanh": GELUTanh,
    "gelu_python_tanh": (GELUTanh, {"use_gelu_tanh_python": True}),
    "gelu_accurate": AccurateGELUActivation,
    "laplace": LaplaceActivation,
    "leaky_relu": nn.LeakyReLU,
    "linear": LinearActivation,
    "mish": MishActivation,
    "quick_gelu": QuickGELUActivation,
    "relu": nn.ReLU,
    "relu2": ReLUSquaredActivation,
    "relu6": nn.ReLU6,
...

This means that any config that uses the value gelu for the hidden_activation using the GeluActivation function which uses the torch.erf function. The new GeLU activation is referenced using new_gelu or gelu_pytorch_tanh.

This behavior is also what is followed by the huggingface/candle repository here (gelu corresponds to xs.gelu_erf() and not xs.gelu())

This PR brings the TEI implementation in line with how transformers parses the config.json values and how candle resolves activations.

I came across this inconsistency while I was reviewing some of the code changes I had in #746, but thought this should be opened as a separate PR, given that it will slight vary (re: correct) existing model behavior. (h/t to @bbaldino for pointing this out to me)

Please do let me know if I'm missing something obvious here as to why TEI is not in-sync with how the activation functions are defined. My understanding is that this is just a bug that got carried over from legacy code that was introduced in #41

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
  • Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

@Narsil OR @alvarobartt OR @kozistr

@vrdn-23
Copy link
Contributor Author

vrdn-23 commented Nov 6, 2025

Okay so after some more digging, it seems one of the main reasons to not change this would be speed of gelu_erf() compared to gelu(). I was digging through the candle repository and saw some relevant issues here. I can run some benchmarks later this week to see if there is still a performance loss.

huggingface/candle#1062
huggingface/candle#2418
huggingface/candle#1926 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant