|
1 | 1 | # 🦙 [llama-cpp-rs][readme]   [![Docs]][docs.rs] [![Latest Version]][crates.io] [![Lisence]][crates.io] |
2 | 2 |
|
3 | 3 | [Docs]: https://img.shields.io/docsrs/llama-cpp-2.svg |
| 4 | + |
4 | 5 | [Latest Version]: https://img.shields.io/crates/v/llama-cpp-2.svg |
| 6 | + |
5 | 7 | [crates.io]: https://crates.io/crates/llama-cpp-2 |
| 8 | + |
6 | 9 | [docs.rs]: https://docs.rs/llama-cpp-2 |
| 10 | + |
7 | 11 | [Lisence]: https://img.shields.io/crates/l/llama-cpp-2.svg |
| 12 | + |
8 | 13 | [llama-cpp-sys]: https://crates.io/crates/llama-cpp-sys-2 |
| 14 | + |
9 | 15 | [utilityai]: https://utilityai.ca |
| 16 | + |
10 | 17 | [readme]: https://github.com/utilityai/llama-cpp-rs/tree/main/llama-cpp-2 |
11 | 18 |
|
12 | | -This is the home for [llama-cpp-2][crates.io]. It also contains the [llama-cpp-sys] bindings which are updated regularly and in sync with [llama-cpp-2][crates.io]. |
| 19 | +This is the home for [llama-cpp-2][crates.io]. It also contains the [llama-cpp-sys] bindings which are updated regularly |
| 20 | +and in sync with [llama-cpp-2][crates.io]. |
13 | 21 |
|
14 | | -This project was created with the explict goal of staying as up to date as possible with llama.cpp, as a result it is dead simple, very close to raw bindings, and does not follow semver meaningfully. |
| 22 | +This project was created with the explict goal of staying as up to date as possible with llama.cpp, as a result it is |
| 23 | +dead simple, very close to raw bindings, and does not follow semver meaningfully. |
15 | 24 |
|
16 | 25 | Check out the [docs.rs] for crate documentation or the [readme] for high level information about the project. |
17 | 26 |
|
| 27 | +## Try it out! |
| 28 | + |
| 29 | +Clone the repo |
| 30 | + |
| 31 | +```bash |
| 32 | +git clone --recursive https://github.com/utilityai/llama-cpp-rs |
| 33 | +``` |
| 34 | + |
| 35 | +Enter the directory |
| 36 | + |
| 37 | +```bash |
| 38 | +cd llama-cpp-rs |
| 39 | +``` |
| 40 | + |
| 41 | +Run the simple example |
| 42 | + |
| 43 | +```bash |
| 44 | +cargo run --release --bin simple "The way to kill a linux process is" hf-model TheBloke/Llama-2-7B-GGUF llama-2-7b.Q4_K_M.gguf |
| 45 | +``` |
| 46 | + |
| 47 | +Or if you have a GPU and want to use it |
| 48 | + |
| 49 | +```bash |
| 50 | +cargo run --features cublas --release --bin simple "The way to kill a linux process is" hf-model TheBloke/Llama-2-7B-GGUF llama-2-7b.Q4_K_M.gguf |
| 51 | +``` |
| 52 | + |
| 53 | +<details> |
| 54 | +<summary>Output</summary> |
| 55 | +<pre> |
| 56 | +ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no |
| 57 | +ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes |
| 58 | +ggml_init_cublas: found 1 CUDA devices: |
| 59 | + Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes |
| 60 | +llama_model_params { n_gpu_layers: 1000, split_mode: 1, main_gpu: 0, tensor_split: 0x0, progress_callback: None, progress_callback_user_data: 0x0, kv_overrides: 0x0, vocab_only: false, use_mmap: true, use_mlock: false } |
| 61 | +llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/marcus/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-GGUF/snapshots/b4e04e128f421c93a5f1e34ac4d7ca9b0af47b80/llama-2-7b.Q4_K_M.gguf (version GGUF V2) |
| 62 | +llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. |
| 63 | +llama_model_loader: - kv 0: general.architecture str = llama |
| 64 | +llama_model_loader: - kv 1: general.name str = LLaMA v2 |
| 65 | +llama_model_loader: - kv 2: llama.context_length u32 = 4096 |
| 66 | +llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 |
| 67 | +llama_model_loader: - kv 4: llama.block_count u32 = 32 |
| 68 | +llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 |
| 69 | +llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 |
| 70 | +llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 |
| 71 | +llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 |
| 72 | +llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 |
| 73 | +llama_model_loader: - kv 10: general.file_type u32 = 15 |
| 74 | +llama_model_loader: - kv 11: tokenizer.ggml.model str = llama |
| 75 | +llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... |
| 76 | +llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... |
| 77 | +llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... |
| 78 | +llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1 |
| 79 | +llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2 |
| 80 | +llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0 |
| 81 | +llama_model_loader: - kv 18: general.quantization_version u32 = 2 |
| 82 | +llama_model_loader: - type f32: 65 tensors |
| 83 | +llama_model_loader: - type q4_K: 193 tensors |
| 84 | +llama_model_loader: - type q6_K: 33 tensors |
| 85 | +llm_load_vocab: special tokens definition check successful ( 259/32000 ). |
| 86 | +llm_load_print_meta: format = GGUF V2 |
| 87 | +llm_load_print_meta: arch = llama |
| 88 | +llm_load_print_meta: vocab type = SPM |
| 89 | +llm_load_print_meta: n_vocab = 32000 |
| 90 | +llm_load_print_meta: n_merges = 0 |
| 91 | +llm_load_print_meta: n_ctx_train = 4096 |
| 92 | +llm_load_print_meta: n_embd = 4096 |
| 93 | +llm_load_print_meta: n_head = 32 |
| 94 | +llm_load_print_meta: n_head_kv = 32 |
| 95 | +llm_load_print_meta: n_layer = 32 |
| 96 | +llm_load_print_meta: n_rot = 128 |
| 97 | +llm_load_print_meta: n_embd_head_k = 128 |
| 98 | +llm_load_print_meta: n_embd_head_v = 128 |
| 99 | +llm_load_print_meta: n_gqa = 1 |
| 100 | +llm_load_print_meta: n_embd_k_gqa = 4096 |
| 101 | +llm_load_print_meta: n_embd_v_gqa = 4096 |
| 102 | +llm_load_print_meta: f_norm_eps = 0.0e+00 |
| 103 | +llm_load_print_meta: f_norm_rms_eps = 1.0e-05 |
| 104 | +llm_load_print_meta: f_clamp_kqv = 0.0e+00 |
| 105 | +llm_load_print_meta: f_max_alibi_bias = 0.0e+00 |
| 106 | +llm_load_print_meta: n_ff = 11008 |
| 107 | +llm_load_print_meta: n_expert = 0 |
| 108 | +llm_load_print_meta: n_expert_used = 0 |
| 109 | +llm_load_print_meta: rope scaling = linear |
| 110 | +llm_load_print_meta: freq_base_train = 10000.0 |
| 111 | +llm_load_print_meta: freq_scale_train = 1 |
| 112 | +llm_load_print_meta: n_yarn_orig_ctx = 4096 |
| 113 | +llm_load_print_meta: rope_finetuned = unknown |
| 114 | +llm_load_print_meta: model type = 7B |
| 115 | +llm_load_print_meta: model ftype = Q4_K - Medium |
| 116 | +llm_load_print_meta: model params = 6.74 B |
| 117 | +llm_load_print_meta: model size = 3.80 GiB (4.84 BPW) |
| 118 | +llm_load_print_meta: general.name = LLaMA v2 |
| 119 | +llm_load_print_meta: BOS token = 1 '<s>' |
| 120 | +llm_load_print_meta: EOS token = 2 '</s>' |
| 121 | +llm_load_print_meta: UNK token = 0 '<unk>' |
| 122 | +llm_load_print_meta: LF token = 13 '<0x0A>' |
| 123 | +llm_load_tensors: ggml ctx size = 0.22 MiB |
| 124 | +llm_load_tensors: offloading 32 repeating layers to GPU |
| 125 | +llm_load_tensors: offloading non-repeating layers to GPU |
| 126 | +llm_load_tensors: offloaded 33/33 layers to GPU |
| 127 | +llm_load_tensors: CUDA0 buffer size = 3820.94 MiB |
| 128 | +llm_load_tensors: CPU buffer size = 70.31 MiB |
| 129 | +.................................................................................................. |
| 130 | +Loaded "/home/marcus/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-GGUF/snapshots/b4e04e128f421c93a5f1e34ac4d7ca9b0af47b80/llama-2-7b.Q4_K_M.gguf" |
| 131 | +llama_new_context_with_model: n_ctx = 2048 |
| 132 | +llama_new_context_with_model: freq_base = 10000.0 |
| 133 | +llama_new_context_with_model: freq_scale = 1 |
| 134 | +llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB |
| 135 | +llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB |
| 136 | +llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB |
| 137 | +ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 164.01 MiB |
| 138 | +ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 8.00 MiB |
| 139 | +llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB |
| 140 | +llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB |
| 141 | +llama_new_context_with_model: graph splits (measure): 3 |
| 142 | +n_len = 32, n_ctx = 2048, k_kv_req = 32 |
| 143 | + |
| 144 | +The way to kill a linux process is to send it a SIGKILL signal. |
| 145 | +The way to kill a windows process is to send it a S |
| 146 | + |
| 147 | +decoded 24 tokens in 0.23 s, speed 105.65 t/s |
| 148 | + |
| 149 | +load time = 727.50 ms |
| 150 | +sample time = 0.46 ms / 24 runs (0.02 ms per token, 51835.85 tokens per second) |
| 151 | +prompt eval time = 68.52 ms / 9 tokens (7.61 ms per token, 131.35 tokens per second) |
| 152 | +eval time = 225.70 ms / 24 runs (9.40 ms per token, 106.34 tokens per second) |
| 153 | +total time = 954.18 ms |
| 154 | +</pre> |
| 155 | +</details> |
| 156 | + |
18 | 157 | ## Hacking |
19 | 158 |
|
20 | 159 | Ensure that when you clone this project you also clone the submodules. This can be done with the following command: |
|
0 commit comments