|
1 | | -# llama-cpp-rs |
2 | | -An reimplementation of the parts of microsoft's [guidance](https://github.com/guidance-ai/guidance) that don't slow things down. Based on [llama.cpp](https://github.com/ggerganov/llama.cpp) with bindings in rust. |
| 1 | +# llama-cpp-rs-2 |
3 | 2 |
|
4 | | -## Features |
| 3 | +A wrapper around the [llama-cpp](https://github.com/ggerganov/llama.cpp/) library for rust. |
5 | 4 |
|
6 | | -✅ Guarenteed LLM output formatting (see [formatting](#formatting)) |
| 5 | +# Goals |
7 | 6 |
|
8 | | -✅ Dynamic prompt templates |
| 7 | +- Safe |
| 8 | +- Up to date (llama-cpp-rs is out of date) |
| 9 | +- Abort free (llama.cpp will abort if you violate its invariants. This library will attempt to prevent that by ether |
| 10 | + ensuring the invariants are upheld statically or by checking them ourselves and returning an error) |
| 11 | +- Performant (no meaningful overhead over using llama-cpp-sys-2) |
| 12 | +- Well documented |
9 | 13 |
|
10 | | -✅ Model Quantization |
| 14 | +# Non-goals |
11 | 15 |
|
12 | | -✅ Fast (see [performace](#performace)) |
| 16 | +- Idiomatic rust (I will prioritize a more direct translation of the C++ API over a more idiomatic rust API due to |
| 17 | + maintenance burden) |
13 | 18 |
|
14 | | -## Prompt storage. |
| 19 | +# Contributing |
15 | 20 |
|
16 | | -You can store context on the filesystem if it will be reused, or keep the GRPC connection open to keep it in memory. |
17 | | - |
18 | | -## Formatting |
19 | | - |
20 | | -For a very simple example, assume you pass an LLM a transcript - you just sent the user a verification code, but you don't know if they've recived it yet, or if they are even able to access the 2fa device. You ask the user for the code - they respond and you prompt the LLM. |
21 | | - |
22 | | -```` |
23 | | -<transcript> |
24 | | -What is the users verification code? |
25 | | -```yaml |
26 | | -verification code: ' |
27 | | -```` |
28 | | - |
29 | | -A tranditional solution (and the only solution offered by openai) is to give a stop condition of `'` you hope the llm to fills in a string and stops when it is done. You get *no control* on how it will respond. Without spending extra compute on a longer prompt you cannot specify that the code is 6 digits or what to output if it does not exist. And even with the longer prompt there is no guarentee it will be followed. |
30 | | - |
31 | | -We do things differently by adding the ability to force an LLMs output to follow a regex and allowing bidirectional streaming. |
32 | | - |
33 | | -- Given the regex `(true)|(false)` you can force a LLM to only respond with true or false. |
34 | | -- Given `([0-9]+)|(null)` you can extract a verification code that a user has given. |
35 | | - |
36 | | -Combining the two leads to something like |
37 | | - |
38 | | -````{ prompt: "<rest>verification code: '" }```` |
39 | | - |
40 | | -````{ generate: "(([0-9]+)|(null))'" }```` |
41 | | - |
42 | | -Which will always output the users verification code or `null`. |
43 | | - |
44 | | -When combined with bidrirectional streaming we can do neat things, for example if the LLM yeilds a null `verification code`. We can send a second message asking for a `reason` (with the regex `(not arrived)|(unknown)|(device inaccessable)`). |
45 | | - |
46 | | -### Comparisons |
47 | | - |
48 | | -Guidance uses complex tempating sytnax. Dynamism is achvived though function calling and conditional statments in a handlebars like DSL. The function calling is a security nightmare (especially in a language as dynamic as python) and condional templating does not scale. |
49 | | - |
50 | | -[lmql](https://lmql.ai/) uses a similar approach in that control flow stays in the "host" language, but it is a superset of python supported via decorators. Preformance is difficult to control and near impossible to use in a concurrent setting such as a web server. |
51 | | - |
52 | | -We instead stick the LLM on a GPU (or many if resources are required) and call to it using GRPC. |
53 | | - |
54 | | -Dynamism is achived in the client code (where it belongs) by streaming messages back and forth between the client and `llama-cpp-rpc` with minimal overhead. |
55 | | - |
56 | | -## Performace |
57 | | - |
58 | | -Numbers are run on a 3090 running a finetuned 7b Minseral model (unquantized). With quantization we can run state of the art 70b models on consumer hardware. |
59 | | - |
60 | | -||Remote Hosting|FS context storage|concurrency|raw tps|guided tps| |
61 | | -|----|----|----|----|----|----| |
62 | | -|Llama-cpp-rpc|✅|✅|✅|65|56|| |
63 | | -|Guidance|❌|❌|❌|30|5|| |
64 | | -|LMQL|❌|❌|❌|30|10|| |
65 | | - |
66 | | -## Dependencies |
67 | | - |
68 | | -### Ubuntu |
69 | | - |
70 | | -```bash |
71 | | -sudo apt install -y curl libssl-dev libclang-dev pkg-config cmake git protobuf-compiler |
72 | | -``` |
| 21 | +Contributions are welcome. Please open an issue before starting work on a PR. |
0 commit comments