Skip to content

Conversation

deep1401
Copy link

Why are these changes needed?

Added support for Gemma 3 text version. It adds support for inference of all Gemma 3 models (base as well as instruct) for text only mode inference.

Related issue number (if applicable)

Closes #3697

if device_map == "sequential":
device_map = "auto"
# print("From pretrained kwargs", from_pretrained_kwargs)
tokenizer = AutoTokenizer.from_pretrained(model_path, revision=revision)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

I have a small suggestion:

Suggested change
tokenizer = AutoTokenizer.from_pretrained(model_path, revision=revision)
tokenizer = AutoTokenizer.from_pretrained(model_path, revision=revision, pad_to_multiple_of=8)

See this similar issue in huggingface/transformers: huggingface/transformers#36815

Some prompts may trigger an error similar to the following:

ERROR | stderr | Exception in thread Thread-5 (<lambda>):
ERROR | stderr | Traceback (most recent call last):
ERROR | stderr |   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
ERROR | stderr |     self.run()
ERROR | stderr |   File "/usr/lib/python3.10/threading.py", line 953, in run
ERROR | stderr |     self._target(*self._args, **self._kwargs)
ERROR | stderr |   File "/home/example/projects/FastChat/fastchat/model/model_gemma3.py", line 81, in <lambda>
ERROR | stderr |     target=lambda: model.generate(input_ids=input_ids, **generate_kwargs)
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR | stderr |     return func(*args, **kwargs)
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2465, in generate
ERROR | stderr |     result = self._sample(
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 3434, in _sample
ERROR | stderr |     outputs = model_forward(**model_inputs, return_dict=True)
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR | stderr |     return self._call_impl(*args, **kwargs)
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR | stderr |     return forward_call(*args, **kwargs)
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/transformers/utils/generic.py", line 965, in wrapper
ERROR | stderr |     output = func(self, *args, **kwargs)
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
ERROR | stderr |     return func(*args, **kwargs)
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/transformers/models/gemma3/modeling_gemma3.py", line 942, in forward
ERROR | stderr |     outputs: BaseModelOutputWithPast = self.model(
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR | stderr |     return self._call_impl(*args, **kwargs)
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR | stderr |     return forward_call(*args, **kwargs)
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/transformers/utils/generic.py", line 965, in wrapper
ERROR | stderr |     output = func(self, *args, **kwargs)
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/transformers/models/gemma3/modeling_gemma3.py", line 722, in forward
ERROR | stderr |     layer_outputs = decoder_layer(
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR | stderr |     return self._call_impl(*args, **kwargs)
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR | stderr |     return forward_call(*args, **kwargs)
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/transformers/models/gemma3/modeling_gemma3.py", line 420, in forward
ERROR | stderr |     hidden_states, self_attn_weights = self.self_attn(
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR | stderr |     return self._call_impl(*args, **kwargs)
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR | stderr |     return forward_call(*args, **kwargs)
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/transformers/models/gemma3/modeling_gemma3.py", line 342, in forward
ERROR | stderr |     attn_output, attn_weights = attention_interface(
ERROR | stderr |   File "/home/example/projects/fastchat-venv/lib/python3.10/site-packages/transformers/integrations/sdpa_attention.py", line 54, in sdpa_attention_forward
ERROR | stderr |     attn_output = torch.nn.functional.scaled_dot_product_attention(
ERROR | stderr | RuntimeError: p.attn_bias_ptr is not correctly aligned

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
Thanks for this, we actually ended up creating: https://www.github.com/transformerlab/transformerlab-inference.
We use that instead since fastchat hasn't been merging and stopped new developments.
This model is added on there and works without flash attention which was causing your original issue, please let me know if it also occuses without flash attention too?

@fuckgitb
Copy link

fuckgitb commented Jul 21, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Plans to support Gemma 3?
4 participants