You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: AFM-4.5B deployment on Google Cloud Axion with Llama.cpp
3
3
weight: 2
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
## The AFM-4.5B model
9
+
## AFM-4.5B model and deployment workflow
10
10
11
11
[AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) is a 4.5-billion-parameter foundation model designed to balance accuracy, efficiency, and broad language coverage. Trained on nearly 8 trillion tokens of carefully filtered data, it performs well across a wide range of languages, including Arabic, English, French, German, Hindi, Italian, Korean, Mandarin, Portuguese, Russian, and Spanish.
12
12
13
-
In this Learning Path, you'll deploy [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) using [Llama.cpp](https://github.com/ggerganov/llama.cpp) on an Arm-based Google Cloud Axion instance. You’ll walk through the full workflow, from setting up your environment and compiling the runtime, to downloading, quantizing, and running inference on the model. You'll also evaluate model quality using perplexity, a common metric for measuring how well a language model predicts text.
13
+
In this Learning Path, you’ll deploy [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) using [Llama.cpp](https://github.com/ggerganov/llama.cpp) on a Google Cloud Axion Arm64 instance. You’ll walk through the full workflow, from setting up your environment and compiling the runtime, to downloading, quantizing, and running inference on the model. You’ll also evaluate model quality using perplexity, a standard metric for how well a language model predicts text.
14
14
15
15
This hands-on guide helps developers build cost-efficient, high-performance LLM applications on modern Arm server infrastructure using open-source tools and real-world deployment practices.
16
16
17
-
### LLM deployment workflow on Google Axion
17
+
### Deployment workflow for AFM-4.5B on Google Cloud Axion
18
18
19
-
-**Provision compute**: launch a Google Cloud instance using an Axion-based instance type (for example, `c4a-standard-16`)
19
+
-**Provision compute**: launch a Google Cloud instance using an Axion-based instance type (for example, `c4a-standard-16`)
20
+
-**Set up your environment**: install build tools and dependencies (CMake, Python, Git)
21
+
-**Build the inference engine**: clone the [Llama.cpp](https://github.com/ggerganov/llama.cpp) repository and compile the project for your Arm-based environment
22
+
-**Prepare the model**: download the AFM-4.5B model files from Hugging Face and use Llama.cpp’s quantization tools to reduce model size and optimize performance
23
+
-**Run inference**: load the quantized model and run sample prompts using Llama.cpp
24
+
-**Evaluate model quality**: calculate perplexity or use other metrics to assess performance
20
25
21
-
-**Set up your environment**: install the required build tools and dependencies (such as CMake, Python, and Git)
22
-
23
-
-**Build the inference engine**: clone the [Llama.cpp](https://github.com/ggerganov/llama.cpp) repository and compile the project for your Arm-based environment
24
-
25
-
-**Prepare the model**: download the **AFM-4.5B** model files from Hugging Face and use Llama.cpp's quantization tools to reduce model size and optimize performance
26
-
27
-
-**Run inference**: load the quantized model and run sample prompts using Llama.cpp.
28
-
29
-
-**Evaluate model quality**: calculate **perplexity** or use other metrics to assess model performance
30
-
31
-
{{< notice Note>}}
26
+
{{< notice Note >}}
32
27
You can reuse this deployment flow with other models supported by Llama.cpp by swapping out the model file and adjusting quantization settings.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/arcee-foundation-model-on-gcp/01_launching_an_axion_instance.md
+31-29Lines changed: 31 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: Provision your Axion environment
2
+
title: Provision a Google Cloud Axion Arm64 environment
3
3
weight: 3
4
4
5
5
### FIXED, DO NOT MODIFY
@@ -8,57 +8,59 @@ layout: learningpathall
8
8
9
9
## Requirements
10
10
11
-
Before you begin, make sure you have the following:
11
+
Before you begin, make sure you meet the following requirements:
12
12
13
13
- A Google Cloud account
14
-
- Permission to launch a Google Axion instance of type `c4a-standard-16` (or larger)
15
-
- At least 128 GB of available storage
14
+
- Permission to launch a Google Cloud Axion instance of type `c4a-standard-16` (or larger)
15
+
- At least 128 GB of available storage
16
16
17
-
If you're new to Google Cloud, check out the Learning Path [Getting Started with Google Cloud](/learning-paths/servers-and-cloud-computing/csp/google/).
17
+
If you're new to Google Cloud, see the Learning Path [Getting started with Google Cloud](/learning-paths/servers-and-cloud-computing/csp/google/).
18
18
19
-
## Launch and configure the Compute Engine instance
19
+
## Requirements for Google Cloud Axion
20
20
21
-
In the left sidebar of the [Compute Engine dashboard](https://console.cloud.google.com/compute), select **VM instances**, and then **Create instance**.
21
+
Confirm that your account has sufficient quota for Axion instances and enough storage capacity to host the AFM-4.5B model and dependencies.
22
22
23
-
Use the following settings to configure your instance:
23
+
## Launch and configure a Google Cloud Axion VM
24
24
25
-
-**Name**: `arcee-axion-instance`
26
-
-**Region** and **Zone**: the region and zone where you have access to c4a instances
27
-
- Select **General purpose**, then click **C4A**
28
-
-**Machine type**: c4a-standard-16 or larger
25
+
In the left sidebar of the [Compute Engine dashboard](https://console.cloud.google.com/compute), select **VM instances**, and then **Create instance**.
29
26
30
-
## Configure OS and Storage
27
+
Use the following settings:
31
28
32
-
In the left sidebar, select **OS and storage**.
29
+
-**Name**: `arcee-axion-instance`
30
+
-**Region** and **Zone**: the region and zone where you have access to `c4a` instances
31
+
-**Machine family**: select **General purpose**, then **C4A**
32
+
-**Machine type**: `c4a-standard-16` or larger
33
33
34
-
Under **Operating system and storage**, click on **Change**
34
+
## Configure operating system and storage
35
35
36
-
Select Ubuntu as the Operating system. For version select Ubuntu 24.04 LTS Minimal.
36
+
In the left sidebar, select **OS and storage**.
37
37
38
-
Set the size of the disk to 128 GB, then click on **Select**.
38
+
- Under **Operating system and storage**, click **Change**
39
+
- Select **Ubuntu 24.04 LTS Minimal** as the OS
40
+
- Set the disk size to **128 GB**
41
+
- Click **Select**
39
42
40
-
## Review and launch the instance
43
+
## Review and create your Axion instance
41
44
42
-
Leave the other settings as they are.
45
+
Leave the other settings as they are.
43
46
44
-
When you're ready, click on **Create** to create your Compute Engine instance.
47
+
When you’re ready, click **Create** to launch your Compute Engine instance.
45
48
46
-
## Monitor the instance launch
49
+
## Verify instance launch
47
50
48
-
After a few seconds, you should see that your instance is ready.
51
+
After a few seconds, you should see your instance listed as **Running**.
49
52
50
53
If the launch fails, double-check your settings and permissions, and try again.
51
54
52
-
## Connect to your instance
55
+
## Connect to your Google Cloud Axion VM
53
56
54
-
Open the **SSH** dropdown list, and select **Open in browser window**.
57
+
Open the **SSH** dropdown list, and select **Open in browser window**.
55
58
56
-
Your browser may ask you to authenticate. Once you've done that, a terminal window will open.
59
+
Your browser may ask you to authenticate. Once you’ve done that, a terminal window will open.
57
60
58
-
You are now connected to your Ubuntu instance running on Axion.
61
+
You are now connected to your Ubuntu instance running on Google Cloud Axion.
59
62
60
63
{{% notice Note %}}
61
-
**Region**: make sure you're launching in your preferred Google Cloud region.
62
-
**Storage**: 128 GB is sufficient for the AFM-4.5B model and dependencies.
64
+
-**Region**: make sure you're launching in your preferred Google Cloud region.
65
+
-**Storage**: 128 GB is sufficient for the AFM-4.5B model and dependencies.
title: Configure your Google Cloud Axion Arm64 environment
3
3
weight: 4
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
In this step, you'll set up the Axion instance with the tools and dependencies required to build and run the Arcee Foundation Model. This includes installing system packages and a Python environment.
9
+
In this step, you’ll configure your Google Cloud Axion Arm64 instance with the system packages and Python environment required to build and run the Arcee Foundation Model using Llama.cpp.
10
10
11
-
## Update the package list
11
+
## Update package lists
12
12
13
13
Run the following command to update your local APT package index:
14
14
15
15
```bash
16
16
sudo apt-get update
17
17
```
18
18
19
-
This step ensures you have the most recent metadata about available packages, including versions and dependencies. It helps prevent conflicts when installing new packages.
19
+
This ensures you have the most recent metadata about available packages, versions, and dependencies, helping to prevent conflicts when installing new software.
20
20
21
-
## Install system dependencies
21
+
## Install build tools and Python dependencies
22
22
23
-
Install the build tools and Python environment:
23
+
Install the required build tools and Python environment:
title: Build Llama.cpp on Google Cloud Axion Arm64
3
3
weight: 5
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
-
## Build the Llama.cpp inference engine
9
8
10
-
In this step, you'll build Llama.cpp from source. Llama.cpp is a high-performance C++ implementation of the LLaMA model, optimized for inference on a range of hardware platforms, including Arm-based processors like Google Axion.
9
+
## Build the Llama.cpp inference engine on Google Cloud Axion
11
10
12
-
Even though AFM-4.5B uses a custom model architecture, you can still use the standard Llama.cpp repository - Arcee AI has contributed the necessary modeling code upstream.
11
+
In this step, you’ll build Llama.cpp from source. Llama.cpp is a high-performance C++ implementation of the LLaMA model, optimized for inference on multiple hardware platforms, including Arm64 processors such as Google Cloud Axion.
13
12
14
-
## Clone the repository
13
+
Although AFM-4.5B uses a custom architecture, you can use the standard Llama.cpp repository. Arcee AI has contributed the required modeling code upstream.
14
+
15
+
## Clone the Llama.cpp repository
15
16
16
17
```bash
17
18
git clone https://github.com/ggerganov/llama.cpp
18
19
```
19
20
20
-
This command clones the Llama.cpp repository from GitHub to your local machine. The repository contains the source code, build scripts, and documentation needed to compile the inference engine.
21
+
This command clones the Llama.cpp repository from GitHub. The repository includes source code, build scripts, and documentation.
21
22
22
-
## Navigate to the project directory
23
+
## Navigate to the Llama.cpp directory
23
24
24
25
```bash
25
26
cd llama.cpp
26
27
```
27
28
28
-
Change into the llama.cpp directory to run the build process. This directory contains the `CMakeLists.txt` file and all source code.
29
+
Move into the `llama.cpp` directory to run the build process. This directory contains the `CMakeLists.txt` file and all source code.
29
30
30
-
## Configure the build with CMake
31
+
## Configure the build with CMake for Arm64
31
32
32
33
```bash
33
34
cmake -B .
34
35
```
35
36
36
-
This command configures the build system using CMake:
37
-
38
-
-`-B .` tells CMake to generate build files in the current directory
39
-
- CMake detects your system's compiler, libraries, and hardware capabilities
40
-
- It produces Makefiles (on Linux) or platform-specific build scripts for compiling the project
37
+
This configures the build system using CMake:
41
38
39
+
-`-B .` generates build files in the current directory
40
+
- CMake detects the system compiler, libraries, and hardware capabilities
41
+
- It produces Makefiles (Linux) or platform-specific scripts for compilation
42
42
43
-
If you're running on Axion, the CMake output should include hardware-specific optimizations targeting the Neoverse V2 architecture. These optimizations are crucial for achieving high performance on Axion:
43
+
On Google Cloud Axion, the output should show hardware-specific optimizations for the Neoverse V2 architecture:
44
44
45
45
```output
46
46
-- ARM feature DOTPROD enabled
@@ -51,40 +51,37 @@ If you're running on Axion, the CMake output should include hardware-specific op
51
51
-- Adding CPU backend variant ggml-cpu: -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+dotprod+i8mm+sve
52
52
```
53
53
54
-
These features enable advanced CPU instructions that accelerate inference performance on Arm64:
-**SVE (Scalable Vector Extension)**: advanced vector processing capabilities that can handle variable-length vectors up to 2048 bits, providing significant performance improvements for matrix operations
54
+
These optimizations enable advanced Arm64 CPU instructions:
59
55
60
-
-**MATMUL_INT8**: integer matrix multiplication units optimized for transformers
61
-
62
-
-**FMA**: fused multiply-add operations to speed up floating-point math
63
-
64
-
-**FP16 vector arithmetic**: 16-bit floating-point vector operations to reduce memory use without compromising precision
-`-j16` runs 16 parallel jobs for faster compilation on multi-core Axion systems
73
73
74
-
-`--build .` tells CMake to build the project in the current directory
75
-
-`--config Release` enables optimizations and strips debug symbols
76
-
-`-j16` runs the build with 16 parallel jobs, which speeds up compilation on multi-core systems like Axion.
74
+
The build produces Arm64-optimized binaries in under a minute.
77
75
78
-
The build process compiles the C++ source code into executable binaries optimized for the Arm64 architecture. Compilation typically takes under a minute.
76
+
## Key Llama.cpp binaries after compilation
79
77
80
-
## Key binaries after compilation
78
+
After compilation, you’ll find key tools in the `bin` directory:
81
79
82
-
After compilation, you'll find several key command-line tools in the `bin` directory:
83
-
-`llama-cli`: the main inference executable for running LLaMA models
84
-
-`llama-server`: a web server for serving model inference over HTTP
85
-
-`llama-quantize`: a tool for model quantization to reduce memory usage
86
-
- Additional utilities for model conversion and optimization
80
+
-`llama-cli`: main inference executable
81
+
-`llama-server`: HTTP server for model inference
82
+
-`llama-quantize`: tool for quantization to reduce memory usage
83
+
- Additional utilities for model conversion and optimization
87
84
88
-
You can find more tools and usage details in the llama.cpp [GitHub repository](https://github.com/ggml-org/llama.cpp/tree/master/tools).
85
+
See the [Llama.cpp GitHub repository](https://github.com/ggml-org/llama.cpp/tree/master/tools) for details.
89
86
90
-
These binaries are specifically optimized for the Arm architecture and will provide excellent performance on your Axion instance.
87
+
These binaries are optimized for Arm64 and provide excellent performance on Google Cloud Axion.
0 commit comments