Skip to content

Commit 678f132

Browse files
authored
Merge pull request #2234 from madeline-underwood/distrib_int
Distrib int_PV to review
2 parents 32460b6 + 2396b66 commit 678f132

File tree

4 files changed

+140
-75
lines changed

4 files changed

+140
-75
lines changed

content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/_index.md

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,19 @@
11
---
22
title: Distributed inference using llama.cpp
33

4-
draft: true
5-
cascade:
6-
draft: true
7-
84
minutes_to_complete: 30
95

10-
who_is_this_for: This learning path is for developers with some experience using llama.cpp who want to learn about distributed inference.
6+
who_is_this_for: This introductory topic is for developers with some experience using llama.cpp who want to learn distributed inference.
117

128
learning_objectives:
13-
- Set up the main host and worker nodes using llama.cpp
14-
- Run a large quantized model (e.g., Llama 3.1 405B) on CPUs in a distributed manner on Arm machines
9+
- Set up a main host and worker nodes with llama.cpp
10+
- Run a large quantized model (for example, Llama 3.1 405B) with distributed CPU inference on Arm machines
1511

1612
prerequisites:
17-
- Three AWS c8g.16xlarge instances with at least 2TB EBS space.
18-
- Python installed on the AWS instances.
19-
- Access to Metas gated repository for the Llama 3.1 model family, with a Hugging Face token generated for downloading the models.
20-
- Familiarity with -> [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
13+
- Three AWS c8g.16xlarge instances with at least 2 TB of EBS storage
14+
- Python 3 installed on each instance
15+
- Access to Meta's gated repository for the Llama 3.1 model family and a Hugging Face token to download models
16+
- Familiarity with the Learning Path [Deploy a Large Language Model (LLM) chatbot with llama.cpp using KleidiAI on Arm servers](/learning-paths/servers-and-cloud-computing/llama-cpu)
2117
- Familiarity with AWS
2218

2319
author: Aryan Bhusari
@@ -38,7 +34,7 @@ operatingsystems:
3834

3935
further_reading:
4036
- resource:
41-
title: Llama.cpp rpc-server code
37+
title: llama.cpp RPC server code
4238
link: https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc
4339
type: Code
4440

content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-1.md

Lines changed: 66 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,46 @@
11
---
2-
title: Convert model to gguf and quantize
2+
title: Convert model to GGUF and quantize
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8+
89
## Overview
9-
This example will run on three AWS Graviton4 c8g.16xlarge instances with 64 cores and 128GB of RAM. The instances should have 2TB disk storage, to store downloaded and quantized model weights.
1010

11-
You will perform these steps in this Learning Path:
11+
This example runs on three AWS Graviton4 `c8g.16xlarge` instances. Each instance has 64 cores, 128 GB of RAM, and 2 TB of disk storage to store the downloaded and quantized model weights.
12+
13+
In this Learning Path, you will:
14+
15+
- Download Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
16+
- Download and build `llama.cpp`, a C++ library for efficient CPU inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
17+
- Convert Meta's `safetensors` files to a single GGUF file.
18+
- Quantize the 16-bit GGUF weights file to 4-bit weights.
19+
- Load and run the model.
1220

13-
1. Download Meta's [405B parameter llama 3.1 model](https://huggingface.co/meta-llama/Llama-3.1-405B).
14-
2. Download and build llama.cpp, a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
15-
3. Convert Meta's safetensors files to a single gguf file.
16-
4. Quantize the 16 bit gguf weights file to 4 bit weights.
17-
5. Load and run the model.
21+
{{% notice Note %}}
22+
The **Reading time** shown on the **Introduction** page does not include downloading, converting, and quantizing the model. These steps can take more than six hours. If you already have a quantized GGUF file, you can skip the download and quantization.
23+
{{% /notice %}}
1824

19-
{{% notice Note %}}The "reading time" mentioned on the Introduction page doesn't include downloading, converting, and requantizing the model. The process mentioned on this page will take 6+ hours. You may skip the model download and quantization if you have a quantized gguf file ready to use.{{% /notice %}}
25+
## Set up dependencies
2026

21-
## Procedure
22-
First, ensure you have permissions to access to Meta's [405B parameter llama 3.1 model](https://huggingface.co/meta-llama/Llama-3.1-405B).
27+
Before you start, make sure you have permission to access Meta's [Llama 3.1 405B parameter model](https://huggingface.co/meta-llama/Llama-3.1-405B).
2328

2429
{{% notice Note %}}
25-
Remember that you will need to replicate the install steps below on each device. Do NOT replicate the download and quantization step, llama.cpp will send the tensors to the cache.
30+
You must repeat the install steps on each device. However, only run the download and quantization steps once as `llama.cpp` caches the tensors for reuse across devices.
2631
{{% /notice %}}
2732

28-
##### 1. Generate a virtual environment
33+
## Create a virtual environment
2934

3035
```bash
3136
apt update
3237
apt install python3.12-venv
3338
python3 -m venv myenv
3439
source myenv/bin/activate
3540
```
36-
##### 2. Clone the llama.cpp repo and build dependencies
41+
42+
## Clone the llama.cpp repo and build dependencies
43+
3744
```bash
3845
git clone https://github.com/ggerganov/llama.cpp
3946
apt install -y cmake build-essential
@@ -45,54 +52,87 @@ cd build-rpc
4552
cmake .. -DGGML_RPC=ON -DLLAMA_BUILD_SERVER=ON
4653
cmake --build . --config Release
4754
```
48-
`llama.cpp` is now built in the `build-rpc/bin` directory.
49-
Check that `llama.cpp` has built correctly by running the help command:
55+
56+
The build output is placed in the `build-rpc/bin` directory.
57+
58+
Verify that the build succeeded by running the help command:
59+
5060
```bash
5161
cd build-rpc
5262
bin/llama-cli -h
5363
```
5464

55-
##### 3. Download the model (on a single instance)
56-
Install Huggingface Hub in the virtual environment:
65+
## Download the model (single instance)
66+
67+
Install Hugging Face Hub in your virtual environment:
68+
5769
```bash
5870
pip3 install huggingface_hub
59-
6071
```
61-
Make a python file and name it download.py:
72+
73+
Create a new Python file named `download.py`:
74+
6275
```bash
6376
vi download.py
6477
```
65-
Write the following code to it:
78+
79+
Add the following code:
80+
6681
```python
6782
import os
6883
from huggingface_hub import snapshot_download
84+
6985
model_id = "meta-llama/Llama-3.1-405B"
7086
local_dir = "llama-hf"
87+
7188
# Create the directory if it doesn't exist
7289
os.makedirs(local_dir, exist_ok=True)
90+
7391
# Download the model snapshot
7492
snapshot_download( repo_id=model_id, local_dir=local_dir,
7593
revision="main",
7694
token="your_hf_token",
7795
allow_patterns=["*.md", "*.json", "*.safetensors"]
7896
)
7997
```
80-
Execute the file:
98+
99+
Run the script:
100+
81101
```bash
82102
python3 download.py
83103
```
84-
##### 4. Convert the model from .safetensors to gguf and quantize (on a single instance)
85-
Following lines installs the files important for conversion to .gguf format.
104+
105+
## Convert and quantize the model (single instance)
106+
107+
Install the conversion dependencies:
108+
86109
```bash
87110
pip3 install -r llama.cpp/requirements.txt
111+
```
112+
113+
Convert the model:
114+
115+
```bash
88116
python3 llama.cpp/convert_hf_to_gguf.py llama-hf
117+
```
118+
119+
Quantize the model to 4-bit weights:
120+
121+
```bash
89122
cd llama.cpp/build-rpc
90-
bin/llama-quantize ../../llama-hf/llama-3.1-405B-F16.gguf Q4_0
123+
bin/llama-quantize ../../llama-hf/llama-3.1-405B-F16.GGUF Q4_0
91124
```
92-
You may rename the resultant file to model.gguf and use it. There are different quantization options as well, as shown below:
125+
126+
You can rename the output file to `model.GGUF` for easier use.
127+
128+
Check available quantization options:
129+
93130
```bash
94131
bin/llama-quantize -h
95132
```
133+
134+
This command lists supported quantization formats and options. For example:
135+
96136
```output
97137
usage: bin/llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type]
98138
[--token-embedding-type] [--tensor-type] [--keep-split] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]
Lines changed: 31 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,46 @@
11
---
2-
title: Worker Node Configuration
2+
title: Configure the worker nodes
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8-
## Cluster overview
9-
llama.cpp is a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments. Just over a year ago from the publication date of this article, rgerganov’s RPC code was merged into llama.cpp, enabling distributed inference of large LLMs across multiple CPU-based machines—even when the models don’t fit into the memory of a single machine. In this learning path, we’ll explore how to run a 405B parameter model on Arm-based CPUs.
108

11-
For the purposes of this demonstration, the following experimental setup will be used:
12-
- Total number of instances: 3
13-
- Instance type: c8g.16xlarge
14-
- Model: model.gguf (Llama-3.1-405B_Q4_0)
9+
## Overview of the cluster
1510

16-
One of the three nodes will serve as the master node, which physically hosts the model file. The other two nodes will act as worker nodes. In llama.cpp, remote procedure calls (RPC) are used to offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where all the actual computation is performed.
11+
`llama.cpp` is a C++ library that enables efficient inference of LLaMA and similar large language models on CPUs, optimized for local and embedded environments.
1712

18-
## Cluster setup
13+
Just over a year before this Learning Path was published, Radoslav Gerganov's (rgerganov) RPC code was merged into `llama.cpp`. This feature enables distributed inference of large LLMs across multiple CPU-based machines, even when the models don’t fit into the memory of a single machine.
1914

20-
Choose two of the three devices to act as backend workers. If the devices had varying compute capacities, the ones with the highest compute should be selected—especially for a 405B model. However, since all three devices have identical compute capabilities in this case, you can select any two to serve as backend workers.
15+
In this Learning Path, you’ll explore how to run a 405B parameter model on Arm-based CPUs.
16+
17+
For this demonstration, the experimental setup includes:
18+
19+
- Number of instances: 3
20+
- Instance type: `c8g.16xlarge`
21+
- Model: `model.GGUF` (Llama-3.1-405B_Q4_0)
22+
23+
One of the three nodes serves as the master node, which physically hosts the model file. The other two nodes act as worker nodes. In `llama.cpp`, remote procedure calls (RPC) offload both the model and the computation over TCP connections between nodes. The master node forwards inference requests to the worker nodes, where computation is performed.
24+
25+
## Set up the worker nodes
26+
27+
Choose two of the three devices to act as backend workers. If the devices have varying compute capacities, select the ones with the highest compute, especially for a 405B model. Because all three devices in this setup are identical, you can select any two to serve as backend workers.
28+
29+
Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master, such as model parameters, tokens, hidden states, and other inference-related information.
30+
31+
{{% notice Note %}}
32+
The RPC feature in `llama.cpp` is not secure by default, so you should never expose it to the open internet. To reduce this risk, ensure that the security groups for all your EC2 instances are configured to restrict access to trusted IPs or internal VPC traffic only. This prevents unauthorized access to the RPC endpoints.
33+
{{% /notice %}}
34+
35+
Start the worker nodes with the following command:
2136

22-
Communication between the master node and the worker nodes occurs through a socket created on each worker. This socket listens for incoming data from the master—such as model parameters, tokens, hidden states, and other inference-related information.
23-
{{% notice Note %}}The RPC feature in llama.cpp is not secure by default, so you should never expose it to the open internet. To mitigate this risk, ensure that the security groups for all your EC2 instances are properly configured—restricting access to only trusted IPs or internal VPC traffic. This helps prevent unauthorized access to the RPC endpoints.{{% /notice %}}
24-
Use the following command to start the listening on the worker nodes:
2537
```bash
2638
bin/rpc-server -c -p 50052 -H 0.0.0.0 -t 64
2739
```
28-
Below are the available flag options that can be used with the rpc-server functionality:
40+
41+
## Review RPC server options
42+
43+
The following flags are available with the `rpc-server` command:
2944

3045
```output
3146
-h, --help show this help message and exit
@@ -36,4 +51,5 @@ Below are the available flag options that can be used with the rpc-server functi
3651
-m MEM, --mem MEM backend memory size (in MB)
3752
-c, --cache enable local file cache
3853
```
39-
Setting the host to 0.0.0.0 might seem counterintuitive given the earlier security warning, but it’s acceptable in this case because the security groups have been properly configured to block any unintended or unauthorized access.
54+
55+
Although setting the host to `0.0.0.0` might seem counterintuitive given the earlier security warning, it is acceptable here because the EC2 security groups are configured to block unintended or unauthorized access.

content/learning-paths/servers-and-cloud-computing/distributed-inference-with-llama-cpp/how-to-3.md

Lines changed: 35 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,54 @@
11
---
2-
title: Configuring Master Node
2+
title: Configure the master node
33
weight: 4
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
8-
## Master node setup
9-
In this learning path, we will use the following two IP addresses for the worker nodes. Replace these with your own node IPs.
8+
9+
## Set up the master node
10+
11+
In this section, you configure the master node and verify communication with worker nodes before running distributed inference.
12+
13+
Export the worker node IP addresses, then replace the example values with the IPs for your own nodes:
1014

1115
```bash
12-
export worker_ips = "172.31.110.11:50052,172.31.110.12:50052"
16+
export worker_ips="172.31.110.11:50052,172.31.110.12:50052"
1317
```
18+
1419
You can find the IP addresses of your AWS instances in the AWS console.
1520

16-
You can verify communication with the worker nodes using the following command on master node:
21+
Verify communication with a worker node by running the following command on the master node:
22+
1723
```bash
1824
telnet 172.31.110.11 50052
1925
```
20-
If the backend server is set up correctly, the output of the `telnet` command should look like the following:
21-
```bash
26+
If the backend server is set up correctly, the output should look like:
27+
28+
```output
2229
Trying 172.31.110.11...
2330
Connected to 172.31.110.11.
2431
Escape character is '^]'.
2532
```
26-
Finally, you can execute the following command, to execute distributed inference:
33+
Run distributed inference using `llama-cli`:
34+
2735
```bash
2836
bin/llama-cli -m ../../model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 999
2937
```
3038

3139
{{% notice Note %}}
32-
It will take a significant amount of time (~30 minutes) to load the tensors on the worker nodes. Pre-loaded tensors are a current development request for llama.cpp.
40+
Loading tensors on the worker nodes can take up to 30 minutes. Pre-loaded tensors are a requested enhancement for llama.cpp.
3341
{{% /notice %}}
42+
## Understand the command flags
3443

35-
Here are short definitions of the flags used in above command:
36-
-n => Number of maximum output tokens
37-
--rpc => list of backend workers
38-
-ngl => Number of layers to be placed on backend workers (999 means offload all layers on workers)
44+
- `-n`: maximum number of output tokens
45+
- `--rpc`: list of backend workers
46+
- `-ngl`: number of layers to offload to backend workers (`999` offloads all layers)
3947

4048
{{% notice Note %}}At the time of publication, llama.cpp only supports up to 16 backend workers.{{% /notice %}}
4149

42-
The output:
50+
## Review example output
51+
4352
```output
4453
build: 5935 (2adf8d83) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for aarch64-linux-gnu
4554
main: llama backend init
@@ -195,18 +204,22 @@ llama_perf_context_print: eval time = 77429.95 ms / 127 runs ( 609
195204
llama_perf_context_print: total time = 79394.06 ms / 132 tokens
196205
llama_perf_context_print: graphs reused = 0
197206
```
198-
That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality. The following table provides brief description of the metrics from `llama_perf`:
207+
That's it! You have successfully run the llama-3.1-8B model on CPUs with the power of llama.cpp RPC functionality.
208+
209+
The following table provides brief description of the metrics from `llama_perf`:
199210

200211

201-
| Log Line | Description |
212+
| Log line | Description |
202213
|-------------------|-----------------------------------------------------------------------------|
203-
| sampling time | Time spent choosing next tokens using sampling strategy (e.g., top-k, top-p). |
204-
| load time | Time to load the model into memory and initialize weights/buffers. |
205-
| prompt eval time | Time to process the input prompt tokens before generation (fills KV cache). |
206-
| eval time | Time to generate output tokens by forward-passing through the model. |
207-
| total time | Total time for both prompt processing and token generation (excludes model load). |
214+
| sampling time | Time spent choosing next tokens using the sampling strategy (for example, top-k, top-p) |
215+
| load time | Time required to load the model into memory and initialize weights and buffers |
216+
| prompt eval time | Time to process the input prompt tokens before generation (fills KV cache) |
217+
| eval time | Time to generate output tokens by forward-passing through the model |
218+
| total time | Total time for both prompt processing and token generation (excludes model load) |
219+
220+
## Run distributed inference with llama-server
208221

209-
Lastly to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference:
222+
Lastly, to set up OpenAI compatible API, you can use the `llama-server` functionality. The process of implementing this is described [here](/learning-paths/servers-and-cloud-computing/llama-cpu) under the "Access the chatbot using the OpenAI-compatible API" section. Here is a snippet, for how to set up llama-server for distributed inference:
210223
```bash
211224
bin/llama-server -m ../../model.gguf --port 8080 --rpc "$worker_ips" -ngl 99
212225
```

0 commit comments

Comments
 (0)