-
Notifications
You must be signed in to change notification settings - Fork 12.5k
Improve Mistral models integration with llama.cpp #14737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Thanks for the contribution. From a developer perspective, it looks like a good approach to avoid any potential tokenization / formatting problems. In general, for all models, using a reference tokenizer instead of relying on My understanding is that most chat template problems occur during the early days of the model release, and with time tend to get polished and fixed. So this approach would be a stable alternative during such periods of instability. |
IIRC Mistral's architecture also makes use of sliding window attention (SWA), defaulting to a window size of 4096 tokens - though I don't know all the details (like which layers, if any, are full layers). It would be great if the window size could be stored in the GGUF file as well (e.g. as |
b809a96
to
2865a25
Compare
Hey guys many sorries for the delay of the answer and thanks a lot for your feedback.
Exactly what's cool with llama.cpp is that you support the possibility to pass jinja templates when serving so people can use them once they are correct if they want and remove the mistral-common server ! Very nice feature.
This is actually for super old (for Deep Learning ^^) models so we didn't add support to that. Could it be a subsequent PR ? Regarding the PR:
Happy to answer more questions :) |
Partially, there's also a |
@juliendenize Please undo all the formatting/style changes, they are not relevant and add too much noise to the PR, will review afterwards. :) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Arf, would it be ok to up the requirements for Pydantic in your side or is it a no ? Was there a particular reason to stay at 2.6 ? |
Done sorry about that my own formatter was on. Is there a formatter or linter available for Python ? Didn't find it in the contributing guidelines. |
Yes, I think it's ok, it's probably just the version that was available at the time.
We don't use a Python formatter, only |
Pillow conflict, should be fine to update: |
Right, now we are getting somewhere. :) Edit: The unbound errors are clearly handled at init and can be silenced by |
@juliendenize do you also plan to make changes to |
Tried to make things cleaner sorry for the back and forth. |
that would be cool indeed, I didn't work personally on Voxtral (for the model), so I might need some assistance as I lack experience in audio models. Is voxtral already supported by llama.cpp ? I assumed that not for now. |
Yeah not for now, but I was trying to add support and ran into issues converting to GGUF. But that should be easy to add after this PR is merged, so don't worry about it for now :) |
Ok so this: https://github.com/ggml-org/llama.cpp/actions/runs/16500995835/job/46660394829?pr=14737 Is actually expected because we didn't merge the PR here yet in We're in the process of merging I'm just adding a final feature which is begin able to call |
Ok, ping me when you're ready. |
I've just had a deeper look into this PR. One concern though, most of the code inside Just thinking, maybe it's better to bring them right into Btw, I'm also working converting Voxtral to GGUF. I thought that would be simple but I'm currently stuck at the tokenizer. Trying a quick hack to copy some code from this PR.. will see if it work. |
Ok so as demo in #14862, I think it might be better to merge everything into
|
Sounds good to me. |
Hi @ngxson thanks for the review.
The reason I split the two files was to avoid confusion of what is happening, because here we don't convert hf models. It is indeed a lot of copy paste from convert_hf_to_gguf.py but with lots of overriding. We could probably subclasses but end up overriding whole methods and use few super(). Though not entirely sure about that as I decided to decouple things really early and didn't keep track of that. I can probably achieve something better. Maybe the first thing would be to import from convert_hf_to_gguf.py . Then if you have a strong opinion about merging the two it could be done more easily. |
Hmm FYI, We can also add an additional flag like From the perspective of
So overall I still think merging everything into |
thanks @ngxson I started the process of refactoring. I have a bug that i need to fix (probably on Monday) which is why I didn't push but I prefer notifying you guys as I don't want to be silent again. You were right there is very few changes to make ! BTW @CISC we merged the PR and made the release of FYI we also released Magistral GGUF https://huggingface.co/mistralai/Magistral-Small-2507-GGUF thanks to this PR seems to work very smoothly with |
Hey @CISC think it should be good there I refactored what was needed. To use the Mistral format in the conversion script I added the same flag we did for vLLM with
We understand the difficulty, however we don't plan to release official ones. We have several advantages doing it via This is why we're trying to find ways to ease the usage of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that when I merge #14862, it will create some conflicts in this PR. I will make a PR to your repo to resolve the conflicts.
In the meantime, I think it's important to test the output from your script. As mentioned in some comments, this probably not yet working.
convert_hf_to_gguf.py
Outdated
help="Whether the model is stored following the Mistral format.", | ||
) | ||
parser.add_argument( | ||
"--n-ctx", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe have a list of predefined mapping model --> n_ctx
instead?
Having this as a required param is quite confusing tbh. Ideally it should be inside params.json
, otherwise we should make it such that users are not required to enter it manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added in a bunch of repos on the hub the "max_position_embeddings"
field in params.json
. Don't think that a mapping is easy to maintain so we will make the extra effort on our side.
thanks for the review, I updated the PR. Tested on Mistral-Small-3.2 with mmproj. I used the jinja template from 3.1. from datetime import datetime, timedelta
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8080/v1"
TEMP = 0.15
MAX_TOK = 131072
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
model_id = "mistralai/Mistral-Small-3.2-24B-Instruct-2506"
SYSTEM_PROMPT = load_system_prompt(model_id, "SYSTEM_PROMPT.txt")
image_url = "https://cdn.lospec.com/thumbnails/gallery/gabigaie/a-little-pikachu-default.png"
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is it and which side is the tail ?",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=TEMP,
max_tokens=MAX_TOK,
)
print(response.choices[0].message.content) Based on your second paragraph i didn't rebase but lmk if you want me to do it. |
Thanks, it looks better now.
Yes please do a rebase. I thought about doing it myself but turns out I still haven't had time. |
10ef34f
to
9b5d9a8
Compare
Done :) |
convert_hf_to_gguf.py
Outdated
if TYPE_CHECKING: | ||
from torch import Tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this back to where it was.
convert_hf_to_gguf.py
Outdated
|
||
self.gguf_writer.add_add_bos_token(True) | ||
self.gguf_writer.add_add_eos_token(False) | ||
self._set_vocab_mistral() | ||
|
||
script_dir = Path(__file__).parent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this whole method and move this code up to set_vocab
where it is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok now got it ignore previous comment.
convert_hf_to_gguf.py
Outdated
# layer_norm_eps is not in config.json, it is hard-coded in modeling_pixtral.py | ||
self.hparams["layer_norm_eps"] = self.hparams.get("layer_norm_eps", 1e-5) | ||
self.img_break_tok_id = self.get_token_id("[IMG_BREAK]") | ||
logger.info(f"Image break token id: {self.img_break_tok_id}") | ||
elif self.is_mistral_format: | ||
self.hparams["layer_norm_eps"] = self.hparams.get("norm_eps", 1e-5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there are Mistral models that are missing norm_eps
this is now handled in base set_gguf_parameters
and not needed here.
convert_hf_to_gguf.py
Outdated
model_name = "Mistral" | ||
hf_arch = "" | ||
is_mistral_format = True | ||
undo_permute = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought they weren't permuted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this, weirdly enough it was working before and now still.
But yeah checked the conversion script mistral -> hf we do permutation so we shouldn't have to undo it here.
valid_prefixes = ( | ||
"multi_modal_projector.", | ||
"vision_tower.", | ||
"vision_encoder.", | ||
"vision_language_adapter.", | ||
"patch_merger.", | ||
"pre_mm_projector_norm", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit out of scope, but this list can be extracted into a static const inside MmprojModel.TENSOR_PREFIXES
I will do that in another PR, just writing a note here so I won't forget it
convert_hf_to_gguf.py
Outdated
def set_vocab_tekken(self): | ||
vocab = gguf.vocab.MistralVocab(self.dir_model) | ||
self.gguf_writer.add_tokenizer_model(vocab.gguf_tokenizer_model) | ||
|
||
tokens = [] | ||
scores = [] | ||
toktypes = [] | ||
|
||
for text, score, toktype in vocab.all_tokens(): | ||
tokens.append(text) | ||
scores.append(score) | ||
toktypes.append(toktype) | ||
|
||
assert len(tokens) == vocab.vocab_size, ( | ||
f"token count ({len(tokens)}) != vocab size ({vocab.vocab_size})" | ||
) | ||
|
||
if vocab.tokenizer_type == gguf.vocab.MistralTokenizerType.tekken: | ||
self.gguf_writer.add_tokenizer_pre("tekken") | ||
self.gguf_writer.add_token_merges( | ||
vocab.extract_vocab_merges_from_model() | ||
) | ||
|
||
logger.info( | ||
f"Setting bos, eos, unk and pad token IDs to {vocab.bos_id}, {vocab.eos_id}, {vocab.unk_id}, {vocab.pad_id}." | ||
) | ||
|
||
self.gguf_writer.add_bos_token_id(vocab.bos_id) | ||
self.gguf_writer.add_eos_token_id(vocab.eos_id) | ||
self.gguf_writer.add_unk_token_id(vocab.unk_id) | ||
self.gguf_writer.add_pad_token_id(vocab.pad_id) | ||
|
||
self.gguf_writer.add_token_list(tokens) | ||
self.gguf_writer.add_token_scores(scores) | ||
self.gguf_writer.add_token_types(toktypes) | ||
self.gguf_writer.add_vocab_size(vocab.vocab_size) | ||
|
||
self.gguf_writer.add_add_bos_token(True) | ||
self.gguf_writer.add_add_eos_token(False) | ||
self._set_vocab_mistral() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simply replace all calls to set_vocab_tekken
with self._set_vocab_mistral()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure in set_vocab_tekken
you also have the following:
script_dir = Path(__file__).parent
template_path = script_dir / "models/templates/unsloth-mistral-Devstral-Small-2507.jinja"
with open(template_path, "r", encoding="utf-8") as f:
template = f.read()
self.gguf_writer.add_chat_template(template)
LMK if you still want to discard this.
Edit: understood after this comment what you meant i removed the method and copied the template_path part to the set_vocab method.
convert_hf_to_gguf.py
Outdated
script_dir = Path(__file__).parent | ||
template_path = script_dir / "models/templates/unsloth-mistral-Devstral-Small-2507.jinja" | ||
with open(template_path, "r", encoding="utf-8") as f: | ||
template = f.read() | ||
self.gguf_writer.add_chat_template(template) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can also move this to _set_vocab_mistral()
, as for now GGUF converted via --mistral-format
won't have any templates built-in, thus it won't be usable out-of-the box with tools based on llama.cpp.
IMO forcing people to use both python (to format the chat) and llama.cpp at the same time may not be a good user experience, so having built-in jinja template should still be a requirement.
Also self-note: one thing missing from this code, we should make sure we are using the correct tekken version for the given model, maybe as a assert tekken_json["config"]["version"] == "v7"
. But I can add it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, nice contribution!
We can merge once the 2 pending comments are all resolved.
convert_hf_to_gguf.py
Outdated
|
||
if not self.is_mistral_format: | ||
remote_tensors = gguf.utility.SafetensorRemote.get_list_tensors_hf_model(remote_hf_model_id) | ||
|
||
else: | ||
url = f"{gguf.utility.SafetensorRemote.BASE_DOMAIN}/{remote_hf_model_id}/resolve/main/consolidated.safetensors" | ||
remote_tensors = gguf.utility.SafetensorRemote.get_list_tensors(url) | ||
|
||
self.tensor_names = set(name for name in remote_tensors.keys()) | ||
for name, remote_tensor in gguf.utility.SafetensorRemote.get_list_tensors_hf_model(remote_hf_model_id).items(): | ||
for name, remote_tensor in remote_tensors.items(): | ||
yield (name, LazyTorchTensor.from_remote_tensor(remote_tensor)) | ||
|
||
self.get_tensors = get_remote_tensors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed about remote_safetensors, I think these changes should also be reverted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think I answered your comments.
Regarding the remote I think it would have been nice to allow remote download for mistral format but I reverted as requested.
Regarding chat templates, i created a method for it that should be easily expandable.
remote_tensors = gguf.utility.SafetensorRemote.get_list_tensors_hf_model(remote_hf_model_id) | ||
self.tensor_names = set(name for name in remote_tensors.keys()) | ||
for name, remote_tensor in gguf.utility.SafetensorRemote.get_list_tensors_hf_model(remote_hf_model_id).items(): | ||
for name, remote_tensor in remote_tensors.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left the change here (while removing mistral-format) as remote_tensors was not used in the for loop but evaluated again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now mistral-format cannot be downloaded from hf though by removing the mistral format case.
@staticmethod | ||
def get_community_chat_template(vocab: MistralVocab, templates_dir: Path): | ||
assert TokenizerVersion is not None, "mistral_common is not installed" | ||
assert isinstance(vocab.tokenizer, (Tekkenizer, SentencePieceTokenizer)), ( | ||
f"Expected Tekkenizer or SentencePieceTokenizer, got {type(vocab.tokenizer)}" | ||
) | ||
|
||
if vocab.tokenizer.version == TokenizerVersion.v1: | ||
return "mistral-v1" | ||
elif vocab.tokenizer.version == TokenizerVersion.v3 and vocab.tokenizer_type == MistralTokenizerType.spm: | ||
return "mistral-v3" | ||
elif vocab.tokenizer.version == TokenizerVersion.v3 and vocab.tokenizer_type == MistralTokenizerType.tekken: | ||
return "mistral-v3-tekken" | ||
elif vocab.tokenizer.version == TokenizerVersion.v7 and vocab.tokenizer_type == MistralTokenizerType.spm: | ||
return "mistral-v7" | ||
elif vocab.tokenizer.version == TokenizerVersion.v7 and vocab.tokenizer_type == MistralTokenizerType.tekken: | ||
return "mistral-v7-tekken" | ||
elif vocab.tokenizer.version == TokenizerVersion.v11: | ||
template_file = "Mistral-Small-3.2-24B-Instruct-2506.jinja" | ||
elif vocab.tokenizer.version == TokenizerVersion.v13: | ||
template_file = "unsloth-mistral-Devstral-Small-2507.jinja" | ||
else: | ||
raise ValueError(f"Unknown tokenizer type: {vocab.tokenizer_type} and version {vocab.tokenizer.version}") | ||
|
||
template_path = templates_dir / template_file | ||
if not template_path.exists(): | ||
raise FileNotFoundError(f"Template file not found: {template_path}") | ||
|
||
with open(template_path, "r", encoding="utf-8") as f: | ||
template = f.read() | ||
|
||
return template |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should handle the chat template defaults.
Description
This PR aims to enhance the integration of Mistral models with llama.cpp by addressing several key issues and introducing new features. Here are the details:
Context
Using mistral-common with llama.cpp
We recommend that users only use the
llama-server
tool with the/completions
route of the server for now, as it is the only one that supports tokens input. We also advise users to setreturn_tokens=True
in their requests to letmistral-common
handle detokenization.Added features
We have added a script to convert Mistral models to GGUF directly from Hugging Face. This script is located at
convert_mistral_to_gguf.py
and can be used to convert Mistral models to GGUF format.We registered the Mistral architecture in llama.cpp to support Mistral models natively. This allows users to use Mistral models with llama.cpp without having to convert them to Hugging Face first.
Known Limitations:
Our approach does not support multimodality:
Also this approach requires users to only use the llama.cpp server with the
/completions
route.Example Code
To get started, install mistral-common using the following command:
(Optional) Convert the model
Launch the mistral-common and llama.cpp servers
Launch the mistral-common server:
Launch the llama.cpp server:
Use the servers
Here is a code snippet demonstrating how to use the new features:
Feedback and Contributions
We believe these changes will significantly improve the integration of Mistral models with llama.cpp and provide a better experience for our users. We welcome any feedback or suggestions to further enhance this integration. Also, as we have few experience in the codebase of llama.cpp, we welcome any help to improve the integration and make sure we respect the codebase and the community.