-
Notifications
You must be signed in to change notification settings - Fork 12.7k
Improve Mistral models integration with llama.cpp #14737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the contribution. From a developer perspective, it looks like a good approach to avoid any potential tokenization / formatting problems. In general, for all models, using a reference tokenizer instead of relying on My understanding is that most chat template problems occur during the early days of the model release, and with time tend to get polished and fixed. So this approach would be a stable alternative during such periods of instability. |
IIRC Mistral's architecture also makes use of sliding window attention (SWA), defaulting to a window size of 4096 tokens - though I don't know all the details (like which layers, if any, are full layers). It would be great if the window size could be stored in the GGUF file as well (e.g. as |
b809a96
to
2865a25
Compare
Hey guys many sorries for the delay of the answer and thanks a lot for your feedback.
Exactly what's cool with llama.cpp is that you support the possibility to pass jinja templates when serving so people can use them once they are correct if they want and remove the mistral-common server ! Very nice feature.
This is actually for super old (for Deep Learning ^^) models so we didn't add support to that. Could it be a subsequent PR ? Regarding the PR:
Happy to answer more questions :) |
Partially, there's also a |
@juliendenize Please undo all the formatting/style changes, they are not relevant and add too much noise to the PR, will review afterwards. :) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Arf, would it be ok to up the requirements for Pydantic in your side or is it a no ? Was there a particular reason to stay at 2.6 ? |
Done sorry about that my own formatter was on. Is there a formatter or linter available for Python ? Didn't find it in the contributing guidelines. |
Yes, I think it's ok, it's probably just the version that was available at the time.
We don't use a Python formatter, only |
Pillow conflict, should be fine to update: |
Right, now we are getting somewhere. :) Edit: The unbound errors are clearly handled at init and can be silenced by |
@juliendenize do you also plan to make changes to |
Tried to make things cleaner sorry for the back and forth. |
that would be cool indeed, I didn't work personally on Voxtral (for the model), so I might need some assistance as I lack experience in audio models. Is voxtral already supported by llama.cpp ? I assumed that not for now. |
Yeah not for now, but I was trying to add support and ran into issues converting to GGUF. But that should be easy to add after this PR is merged, so don't worry about it for now :) |
Ok so this: https://github.com/ggml-org/llama.cpp/actions/runs/16500995835/job/46660394829?pr=14737 Is actually expected because we didn't merge the PR here yet in We're in the process of merging I'm just adding a final feature which is begin able to call |
Ok, ping me when you're ready. |
I've just had a deeper look into this PR. One concern though, most of the code inside Just thinking, maybe it's better to bring them right into Btw, I'm also working converting Voxtral to GGUF. I thought that would be simple but I'm currently stuck at the tokenizer. Trying a quick hack to copy some code from this PR.. will see if it work. |
Ok so as demo in #14862, I think it might be better to merge everything into
|
Sounds good to me. |
Hi @ngxson thanks for the review.
The reason I split the two files was to avoid confusion of what is happening, because here we don't convert hf models. It is indeed a lot of copy paste from convert_hf_to_gguf.py but with lots of overriding. We could probably subclasses but end up overriding whole methods and use few super(). Though not entirely sure about that as I decided to decouple things really early and didn't keep track of that. I can probably achieve something better. Maybe the first thing would be to import from convert_hf_to_gguf.py . Then if you have a strong opinion about merging the two it could be done more easily. |
Hmm FYI, We can also add an additional flag like From the perspective of
So overall I still think merging everything into |
thanks @ngxson I started the process of refactoring. I have a bug that i need to fix (probably on Monday) which is why I didn't push but I prefer notifying you guys as I don't want to be silent again. You were right there is very few changes to make ! BTW @CISC we merged the PR and made the release of FYI we also released Magistral GGUF https://huggingface.co/mistralai/Magistral-Small-2507-GGUF thanks to this PR seems to work very smoothly with |
10ef34f
to
9b5d9a8
Compare
Done :) |
valid_prefixes = ( | ||
"multi_modal_projector.", | ||
"vision_tower.", | ||
"vision_encoder.", | ||
"vision_language_adapter.", | ||
"patch_merger.", | ||
"pre_mm_projector_norm", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit out of scope, but this list can be extracted into a static const inside MmprojModel.TENSOR_PREFIXES
I will do that in another PR, just writing a note here so I won't forget it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, nice contribution!
We can merge once the 2 pending comments are all resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think I answered your comments.
Regarding the remote I think it would have been nice to allow remote download for mistral format but I reverted as requested.
Regarding chat templates, i created a method for it that should be easily expandable.
remote_tensors = gguf.utility.SafetensorRemote.get_list_tensors_hf_model(remote_hf_model_id) | ||
self.tensor_names = set(name for name in remote_tensors.keys()) | ||
for name, remote_tensor in gguf.utility.SafetensorRemote.get_list_tensors_hf_model(remote_hf_model_id).items(): | ||
for name, remote_tensor in remote_tensors.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left the change here (while removing mistral-format) as remote_tensors was not used in the for loop but evaluated again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now mistral-format cannot be downloaded from hf though by removing the mistral format case.
@staticmethod | ||
def get_community_chat_template(vocab: MistralVocab, templates_dir: Path): | ||
assert TokenizerVersion is not None, "mistral_common is not installed" | ||
assert isinstance(vocab.tokenizer, (Tekkenizer, SentencePieceTokenizer)), ( | ||
f"Expected Tekkenizer or SentencePieceTokenizer, got {type(vocab.tokenizer)}" | ||
) | ||
|
||
if vocab.tokenizer.version == TokenizerVersion.v1: | ||
return "mistral-v1" | ||
elif vocab.tokenizer.version == TokenizerVersion.v3 and vocab.tokenizer_type == MistralTokenizerType.spm: | ||
return "mistral-v3" | ||
elif vocab.tokenizer.version == TokenizerVersion.v3 and vocab.tokenizer_type == MistralTokenizerType.tekken: | ||
return "mistral-v3-tekken" | ||
elif vocab.tokenizer.version == TokenizerVersion.v7 and vocab.tokenizer_type == MistralTokenizerType.spm: | ||
return "mistral-v7" | ||
elif vocab.tokenizer.version == TokenizerVersion.v7 and vocab.tokenizer_type == MistralTokenizerType.tekken: | ||
return "mistral-v7-tekken" | ||
elif vocab.tokenizer.version == TokenizerVersion.v11: | ||
template_file = "Mistral-Small-3.2-24B-Instruct-2506.jinja" | ||
elif vocab.tokenizer.version == TokenizerVersion.v13: | ||
template_file = "unsloth-mistral-Devstral-Small-2507.jinja" | ||
else: | ||
raise ValueError(f"Unknown tokenizer type: {vocab.tokenizer_type} and version {vocab.tokenizer.version}") | ||
|
||
template_path = templates_dir / template_file | ||
if not template_path.exists(): | ||
raise FileNotFoundError(f"Template file not found: {template_path}") | ||
|
||
with open(template_path, "r", encoding="utf-8") as f: | ||
template = f.read() | ||
|
||
return template |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should handle the chat template defaults.
Thanks for this contribution! A little remark to check the current "good enough" tekken pre-tokenizer too for Unicode characters (like Finnish, Thai, Hungarian, etc.):
|
Hey @juliendenize , I do a final review a bit later this week. Since we already have 2 approvals on this PR, I think it's pretty much ready to be merged. |
@juliendenize Sorry, been pretty busy, @ngxson can merge when ready. |
@juliendenize What do you think about these differences in the Tekken tokenizer that affect Unicode characters? |
@broadbit-hu If the request is to improve, it is out of scope rn (at least to me), not sure there was requests to improve or performance issues raised. |
@juliendenize I apologize, I'll be a bit more specific. The llama-vocab.cpp handles the Tekken pre-tokenizer here: https://github.com/ggml-org/llama.cpp/blob/master/src/llama-vocab.cpp#L381 The original regex in Mistral's tokenizer.json:
The code in llama.cpp that handles regular expressions has limitations with Unicode characters, so the original regex was replaced with this:
Could you check the differences between your own (maybe vLLM) environment and the llama.cpp pre-tokenizer to see if it causes significant differences in inference (for example Mistral NeMo, Mistral Small, etc.)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CISC please merge when you're ready
Description
This PR aims to enhance the integration of Mistral models with llama.cpp by addressing several key issues and introducing new features. Here are the details:
Context
Using mistral-common with llama.cpp
We recommend that users only use the
llama-server
tool with the/completions
route of the server for now, as it is the only one that supports tokens input. We also advise users to setreturn_tokens=True
in their requests to letmistral-common
handle detokenization.Added features
We have added a script to convert Mistral models to GGUF directly from Hugging Face. This script is located at
convert_mistral_to_gguf.py
and can be used to convert Mistral models to GGUF format.We registered the Mistral architecture in llama.cpp to support Mistral models natively. This allows users to use Mistral models with llama.cpp without having to convert them to Hugging Face first.
Known Limitations:
Our approach does not support multimodality:
Also this approach requires users to only use the llama.cpp server with the
/completions
route.Example Code
To get started, install mistral-common using the following command:
(Optional) Convert the model
Launch the mistral-common and llama.cpp servers
Launch the mistral-common server:
Launch the llama.cpp server:
Use the servers
Here is a code snippet demonstrating how to use the new features:
Feedback and Contributions
We believe these changes will significantly improve the integration of Mistral models with llama.cpp and provide a better experience for our users. We welcome any feedback or suggestions to further enhance this integration. Also, as we have few experience in the codebase of llama.cpp, we welcome any help to improve the integration and make sure we respect the codebase and the community.