Skip to content

Commit a913fd3

Browse files
authored
models(gallery): add hermes-3-llama-3.1(8B,70B,405B) with vLLM (#3360)
models(gallery): add hermes-3-llama-3.1 with vLLM it adds 8b, 70b and 405b to the gallery Signed-off-by: Ettore Di Giacinto <[email protected]>
1 parent fbaae85 commit a913fd3

File tree

3 files changed

+152
-0
lines changed

3 files changed

+152
-0
lines changed

gallery/hermes-vllm.yaml

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
---
2+
name: "hermes-vllm"
3+
4+
config_file: |
5+
backend: vllm
6+
context_size: 8192
7+
stopwords:
8+
- "<|im_end|>"
9+
- "<dummy32000>"
10+
- "<|eot_id|>"
11+
- "<|end_of_text|>"
12+
function:
13+
disable_no_action: true
14+
grammar:
15+
# Uncomment the line below to enable grammar matching for JSON results if the model is breaking
16+
# the output. This will make the model more accurate and won't break the JSON output.
17+
# This however, will make parallel_calls not functional (it is a known bug)
18+
# mixed_mode: true
19+
disable: true
20+
parallel_calls: true
21+
expect_strings_after_json: true
22+
json_regex_match:
23+
- "(?s)<tool_call>(.*?)</tool_call>"
24+
- "(?s)<tool_call>(.*)"
25+
capture_llm_results:
26+
- (?s)<scratchpad>(.*?)</scratchpad>
27+
replace_llm_results:
28+
- key: (?s)<scratchpad>(.*?)</scratchpad>
29+
value: ""
30+
31+
template:
32+
use_tokenizer_template: true
33+
chat: |
34+
{{.Input -}}
35+
<|im_start|>assistant
36+
chat_message: |
37+
<|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}}
38+
{{- if .FunctionCall }}
39+
<tool_call>
40+
{{- else if eq .RoleName "tool" }}
41+
<tool_response>
42+
{{- end }}
43+
{{- if .Content}}
44+
{{.Content }}
45+
{{- end }}
46+
{{- if .FunctionCall}}
47+
{{toJson .FunctionCall}}
48+
{{- end }}
49+
{{- if .FunctionCall }}
50+
</tool_call>
51+
{{- else if eq .RoleName "tool" }}
52+
</tool_response>
53+
{{- end }}<|im_end|>
54+
completion: |
55+
{{.Input}}
56+
function: |
57+
<|im_start|>system
58+
You are a function calling AI model.
59+
Here are the available tools:
60+
<tools>
61+
{{range .Functions}}
62+
{'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
63+
{{end}}
64+
</tools>
65+
You should call the tools provided to you sequentially
66+
Please use <scratchpad> XML tags to record your reasoning and planning before you call the functions as follows:
67+
<scratchpad>
68+
{step-by-step reasoning and plan in bullet points}
69+
</scratchpad>
70+
For each function call return a json object with function name and arguments within <tool_call> XML tags as follows:
71+
<tool_call>
72+
{"arguments": <args-dict>, "name": <function-name>}
73+
</tool_call><|im_end|>
74+
{{.Input -}}
75+
<|im_start|>assistant
76+
# Uncomment to specify a quantization method (optional)
77+
# quantization: "awq"
78+
# Uncomment to limit the GPU memory utilization (vLLM default is 0.9 for 90%)
79+
# gpu_memory_utilization: 0.5
80+
# Uncomment to trust remote code from huggingface
81+
# trust_remote_code: true
82+
# Uncomment to enable eager execution
83+
# enforce_eager: true
84+
# Uncomment to specify the size of the CPU swap space per GPU (in GiB)
85+
# swap_space: 2
86+
# Uncomment to specify the maximum length of a sequence (including prompt and output)
87+
# max_model_len: 32768
88+
# Uncomment and specify the number of Tensor divisions.
89+
# Allows you to partition and run large models. Performance gains are limited.
90+
# https://github.com/vllm-project/vllm/issues/1435
91+
# tensor_parallel_size: 2

gallery/index.yaml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4752,6 +4752,38 @@
47524752
- filename: Hermes-3-Llama-3.1-70B.Q4_K_M.gguf
47534753
sha256: 955c2f42caade4278f3c9dbffa32bb74572652b20e49e5340e782de3585bbe3f
47544754
uri: huggingface://NousResearch/Hermes-3-Llama-3.1-70B-GGUF/Hermes-3-Llama-3.1-70B.Q4_K_M.gguf
4755+
- &hermes-vllm
4756+
url: "github:mudler/LocalAI/gallery/hermes-vllm.yaml@master"
4757+
name: "hermes-3-llama-3.1-8b:vllm"
4758+
icon: https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/vG6j5WxHX09yj32vgjJlI.jpeg
4759+
tags:
4760+
- llm
4761+
- vllm
4762+
- gpu
4763+
- function-calling
4764+
license: llama-3
4765+
urls:
4766+
- https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B
4767+
description: |
4768+
Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board. It is designed to focus on aligning LLMs to the user, with powerful steering capabilities and control given to the end user. The model uses ChatML as the prompt format, opening up a much more structured system for engaging the LLM in multi-turn chat dialogue. It also supports function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.
4769+
overrides:
4770+
parameters:
4771+
model: NousResearch/Hermes-3-Llama-3.1-8B
4772+
- !!merge <<: *hermes-vllm
4773+
name: "hermes-3-llama-3.1-70b:vllm"
4774+
urls:
4775+
- https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B
4776+
overrides:
4777+
parameters:
4778+
model: NousResearch/Hermes-3-Llama-3.1-70B
4779+
- !!merge <<: *hermes-vllm
4780+
name: "hermes-3-llama-3.1-405b:vllm"
4781+
icon: https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/-kj_KflXsdpcZoTQsvx7W.jpeg
4782+
urls:
4783+
- https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-405B
4784+
overrides:
4785+
parameters:
4786+
model: NousResearch/Hermes-3-Llama-3.1-405B
47554787
- !!merge <<: *hermes-2-pro-mistral
47564788
name: "biomistral-7b"
47574789
description: |

gallery/vllm.yaml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
name: "vllm"
3+
4+
config_file: |
5+
backend: vllm
6+
function:
7+
disable_no_action: true
8+
grammar:
9+
disable: true
10+
parallel_calls: true
11+
expect_strings_after_json: true
12+
template:
13+
use_tokenizer_template: true
14+
# Uncomment to specify a quantization method (optional)
15+
# quantization: "awq"
16+
# Uncomment to limit the GPU memory utilization (vLLM default is 0.9 for 90%)
17+
# gpu_memory_utilization: 0.5
18+
# Uncomment to trust remote code from huggingface
19+
# trust_remote_code: true
20+
# Uncomment to enable eager execution
21+
# enforce_eager: true
22+
# Uncomment to specify the size of the CPU swap space per GPU (in GiB)
23+
# swap_space: 2
24+
# Uncomment to specify the maximum length of a sequence (including prompt and output)
25+
# max_model_len: 32768
26+
# Uncomment and specify the number of Tensor divisions.
27+
# Allows you to partition and run large models. Performance gains are limited.
28+
# https://github.com/vllm-project/vllm/issues/1435
29+
# tensor_parallel_size: 2

0 commit comments

Comments
 (0)