| title | Cactus Graph API Documentation | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| description | Computational graph framework for building and executing tensor operations on mobile devices. Supports matrix multiplication, attention, normalization, and INT4/INT8/FP16 precision. | ||||||||
| keywords |
|
The Cactus Graph API provides a computational graph framework for building and executing tensor operations. It supports multiple precision types, broadcasting, and optimized execution for neural network inference.
- Setup
- Core Concepts
- Getting Started
- Tensor Operations
- Advanced Features
- Complete Examples
- Best Practices
Before using the Cactus Graph API, set up your development environment:
# Setup the environment and install dependencies
source ./setup
# Build the Cactus library
cactus build
# Run tests to verify everything works
cactus testThe framework supports three precision types for tensors:
enum class Precision {
INT4,
INT8,
FP16,
FP32
};Note: INT4 tensors use packed storage (2 values per byte) and automatically unpack to INT8 for computation.
The CactusGraph class manages the computational graph:
CactusGraph graph;
size_t input = graph.input({2, 3}, Precision::INT8);
size_t result = graph.add(input, another_input);
graph.execute();
void* output = graph.get_output(result);For testing, use the provided fixtures that handle memory management:
TestUtils::Int8TestFixture fixture("My Test");
TestUtils::FP16TestFixture fixture("Float Test");#include "cactus/graph/graph.h"
CactusGraph graph;
size_t a = graph.input({4}, Precision::INT8);
size_t b = graph.input({4}, Precision::INT8);
size_t sum = graph.add(a, b);
std::vector<int8_t> data_a = {1, 2, 3, 4};
std::vector<int8_t> data_b = {5, 6, 7, 8};
graph.set_input(a, data_a.data(), Precision::INT8);
graph.set_input(b, data_b.data(), Precision::INT8);
graph.execute();
int8_t* result = static_cast<int8_t*>(graph.get_output(sum)); // [6, 8, 10, 12]size_t add_result = graph.add(a, b); // a + b
size_t sub_result = graph.subtract(a, b); // a - b
size_t mul_result = graph.multiply(a, b); // a * b
size_t div_result = graph.divide(a, b); // a / bsize_t scalar_add = graph.scalar_add(input, 5.0f); // input + 5
size_t scalar_sub = graph.scalar_subtract(input, 2.0f); // input - 2
size_t scalar_mul = graph.scalar_multiply(input, 3.0f); // input * 3
size_t scalar_div = graph.scalar_divide(input, 2.0f); // input / 2size_t exp_result = graph.scalar_exp(input); // e^input
size_t sqrt_result = graph.scalar_sqrt(input); // √input
size_t cos_result = graph.scalar_cos(input); // cos(input)
size_t sin_result = graph.scalar_sin(input); // sin(input)
size_t log_result = graph.scalar_log(input); // ln(input)// Standard matmul: (2,3) x (3,4) = (2,4)
size_t a = graph.input({2, 3}, Precision::FP16);
size_t b = graph.input({3, 4}, Precision::FP16);
size_t result = graph.matmul(a, b);
// With pre-transposed right-hand side
size_t result = graph.matmul(a, b, true);size_t transposed = graph.transpose(input); // (2,3) -> (3,2)size_t reshaped = graph.reshape(input, {6, 1}); // (2,3) -> (6,1)size_t sum_all = graph.sum(input, -1); // -1 for all elements
size_t sum_axis0 = graph.sum(input, 0);
size_t mean_all = graph.mean(input, -1);
size_t var = graph.variance(input, axis);
size_t min_val = graph.min(input, axis);
size_t max_val = graph.max(input, axis);size_t weight = graph.input({hidden_size}, Precision::FP16);
size_t bias = graph.input({hidden_size}, Precision::FP16);
size_t normalized = graph.layernorm(input, weight, bias, 1e-5f);size_t weight = graph.input({hidden_size}, Precision::FP16);
size_t normalized = graph.rms_norm(input, weight, 1e-5f);size_t softmax_result = graph.softmax(input, -1);size_t attention_out = graph.attention(query, key, value, scale);
size_t attention_out = graph.attention(query, key, value, scale, position_offset);
size_t attention_out = graph.attention(query, key, value, scale, position_offset, window_size);size_t rope_output = graph.rope(input, theta, position_offset);size_t silu_out = graph.silu(input);
size_t gelu_out = graph.gelu(input);
size_t gelu_erf_out = graph.gelu_erf(input); // GeLU with erf approximation
size_t sigmoid_out = graph.sigmoid(input);
size_t tanh_out = graph.tanh(input);
size_t relu_out = graph.relu(input);
size_t glu_out = graph.glu(input, axis); // Gated Linear Unit// 1D convolutions
size_t conv1d_out = graph.conv1d(input, weight, has_bias, bias, stride);
size_t conv1d_k3_out = graph.conv1d_k3(input, weight, stride);
size_t conv1d_causal = graph.conv1d_causal(input, weight, kernel_size, dilation);
size_t conv1d_pointwise = graph.conv1d_pointwise(input, weight, has_bias, bias);
size_t conv1d_depthwise = graph.conv1d_same_depthwise_k9(input, weight, has_bias, bias);
// 2D convolutions
size_t conv2d_out = graph.conv2d_k3s2p1(input, weight, has_bias, bias);
size_t conv2d_dw = graph.conv2d_depthwise_k3s2p1(input, weight, has_bias, bias);
size_t conv2d_pw = graph.conv2d_pointwise_1x1(input, weight, has_bias, bias);size_t groupnorm_out = graph.groupnorm(input, weight, bias, num_groups, epsilon);
size_t batchnorm_out = graph.batchnorm(input, weight, bias, running_mean, running_var, axis, epsilon);size_t lstm_out = graph.lstm_cell(input, h_prev, c_prev, weight_ih, weight_hh, bias_ih, bias_hh);
size_t deltanet_out = graph.gated_deltanet_decode(query, key, value, gate_log, beta, initial_state, scale);
size_t deltanet_prefill = graph.gated_deltanet_prefill(query, key, value, gate_log, beta, initial_state, chunk_size, scale);size_t moe_gated = graph.moe_layer_gated(hidden, routing_probs, topk_indices,
w1_weights, w3_weights, w2_weights,
num_experts, num_experts_per_tok, normalize_routing, epsilon, routed_scaling_factor);
size_t moe_ungated = graph.moe_layer_ungated(hidden, routing_probs, topk_indices,
w1_weights, w2_weights,
num_experts, num_experts_per_tok, normalize_routing, epsilon, routed_scaling_factor, activation);size_t stft_out = graph.stft(input, weight, stride, num_fft_bins);size_t embeddings = graph.input({vocab_size, embed_dim}, Precision::FP16);
size_t indices = graph.input({batch_size, seq_len}, Precision::INT8);
size_t gathered = graph.gather(embeddings, indices);size_t embedded = graph.embedding(embedding_tensor, indices);
size_t embedded = graph.embedding("embeddings.bin", indices); // memory-mappedsize_t mmap_embed = graph.mmap_embeddings("embeddings.bin");
size_t weights = graph.mmap_weights("model_weights.bin");size_t concatenated = graph.concat(tensor1, tensor2, axis);
size_t multi_cat = graph.cat({tensor1, tensor2, tensor3}, axis); // cat multiple tensorssize_t sliced = graph.slice(input, axis, start, length);size_t indexed = graph.index(input, index_value, dimension);size_t topk_values = graph.topk(input, k);size_t persistent = graph.persistent(source_node); // cache result across executionssize_t prediction = graph.altup_predict(coefs, streams, num_streams);
size_t correction = graph.altup_correct(coefs, innovation, predictions, num_predictions);size_t interpolated = graph.bilinear_interpolation(pos_embeds, dst_height, dst_width);size_t sampled = graph.sample(logits, temperature, top_p, top_k);The framework automatically handles broadcasting for compatible shapes:
size_t tensor = graph.input({2, 3}, Precision::INT8);
size_t scalar = graph.input({1}, Precision::INT8);
size_t result = graph.add(tensor, scalar); // {1} -> {2,3}
size_t a = graph.input({2, 3}, Precision::INT8);
size_t b = graph.input({2, 1}, Precision::INT8);
size_t result = graph.add(a, b); // {2,1} -> {2,3}
size_t a = graph.input({2, 2, 3}, Precision::INT8);
size_t b = graph.input({2, 3}, Precision::INT8);
size_t result = graph.add(a, b); // {2,3} -> {2,2,3}size_t int8_tensor = graph.input({4}, Precision::INT8);
size_t fp16_tensor = graph.precision_cast(int8_tensor, Precision::FP16);
graph.set_quantization_scale(node_id, scale);const std::string filename = "test_graph_save_load.cg";
CactusGraph graph;
size_t input_a = graph.input({2, 3}, Precision::FP16);
size_t input_b = graph.input({2, 3}, Precision::FP16);
size_t sum_id = graph.add(input_a, input_b);
graph.save(filename);
CactusGraph loaded = CactusGraph::load(filename);
std::vector<__fp16> data_a = {1, 2, 3, 4, 5, 6};
std::vector<__fp16> data_b = {10, 20, 30, 40, 50, 60};
loaded.set_input(0, data_a.data(), Precision::FP16);
loaded.set_input(1, data_b.data(), Precision::FP16);
loaded.execute();GraphFile::save_node(graph, node_id, "output.bin");CactusGraph new_graph;
auto loaded = GraphFile::load_into_graph(new_graph, "output.bin");
size_t node_id = loaded.node_id;
std::vector<size_t> shape = loaded.shape;
Precision precision = loaded.precision;graph.execute();
graph.execute("profile_output.json"); // with profilinggraph.hard_reset(); // clear all nodes and buffers
graph.soft_reset(); // clear only buffers, keep graph structureCactusGraph graph;
size_t input = graph.input({2, 4}, Precision::FP16);
size_t weight = graph.input({4, 8}, Precision::FP16);
size_t bias = graph.input({8}, Precision::FP16);
size_t linear = graph.matmul(input, weight);
size_t with_bias = graph.add(linear, bias);
size_t activated = graph.gelu(with_bias);
size_t ln_weight = graph.input({8}, Precision::FP16);
size_t ln_bias = graph.input({8}, Precision::FP16);
size_t output = graph.layernorm(activated, ln_weight, ln_bias);CactusGraph graph;
size_t hidden_dim = 512;
size_t num_heads = 8;
size_t head_dim = hidden_dim / num_heads;
size_t seq_len = 32;
size_t input = graph.input({1, seq_len, hidden_dim}, Precision::FP16);
size_t q_weight = graph.input({hidden_dim, hidden_dim}, Precision::FP16);
size_t k_weight = graph.input({hidden_dim, hidden_dim}, Precision::FP16);
size_t v_weight = graph.input({hidden_dim, hidden_dim}, Precision::FP16);
size_t query = graph.matmul(input, q_weight);
size_t key = graph.matmul(input, k_weight);
size_t value = graph.matmul(input, v_weight);
query = graph.reshape(query, {1, seq_len, num_heads, head_dim});
key = graph.reshape(key, {1, seq_len, num_heads, head_dim});
value = graph.reshape(value, {1, seq_len, num_heads, head_dim});
float scale = 1.0f / sqrt(head_dim);
size_t attention_out = graph.attention(query, key, value, scale);CactusGraph graph;
size_t vocab_size = 50000;
size_t embed_dim = 768;
size_t tokens = graph.input({2, 10}, Precision::INT8);
size_t embed_table = graph.input({vocab_size, embed_dim}, Precision::FP16);
size_t embeddings = graph.gather(embed_table, tokens);
// or memory-mapped for large models
size_t mmap_table = graph.mmap_embeddings("vocab_embeddings.bin");
size_t embeddings = graph.gather(mmap_table, tokens);
size_t pos_embed = graph.input({1, 10, embed_dim}, Precision::FP16);
size_t final_embed = graph.add(embeddings, pos_embed);TestUtils::FP16TestFixture fixture("Similarity");
size_t text1 = fixture.create_input({1, 768}, Precision::FP16);
size_t text2 = fixture.create_input({1, 768}, Precision::FP16);
// L2 norms
size_t norm1 = fixture.graph().scalar_sqrt(
fixture.graph().sum(fixture.graph().multiply(text1, text1), -1));
size_t norm2 = fixture.graph().scalar_sqrt(
fixture.graph().sum(fixture.graph().multiply(text2, text2), -1));
// cosine similarity = dot(a,b) / (norm(a) * norm(b))
size_t dot_product = fixture.graph().sum(fixture.graph().multiply(text1, text2), -1);
size_t similarity = fixture.graph().divide(dot_product, fixture.graph().multiply(norm1, norm2));- Use appropriate precision: INT4/INT8 for memory efficiency, FP16 for accuracy
- Memory-map large tensors: Use
mmap_embeddings()for vocabulary tables - Reset graphs: Call
hard_reset()when switching between different models - External buffers: Use
set_external_input()to avoid copying large inputs
- Batch operations: Process multiple samples together
- Pre-transpose weights: Use
pretransposed_rhs=truefor matmul when possible - Fused operations: The framework automatically fuses compatible operations
- Backend selection: Use NPU backend for supported operations:
size_t result = graph.matmul(a, b, false, ComputeBackend::NPU);
- Build once, execute many: Construct the graph once, run with different inputs
- Validate shapes: Ensure tensor shapes are compatible before operations
- Handle broadcasts: Be aware of automatic broadcasting rules
- Profile execution: Use
execute("profile.json")to identify bottlenecks
- Use test fixtures: Leverage provided fixtures for automatic cleanup
- Verify outputs: Use
verify_output()methods for tolerance-based comparison - Test edge cases: Include tests for broadcasting, empty tensors, and large inputs
- Check precision: Test operations with different precision types
-
** Define the op in core graph types ** Add the new OpType in
cactus/graph/graph.hIf the op needs additional parameters, add the fields to OpParams in the same file -
** Add a graph builder API **
Add a builder method incactus/graph/graph_builder.cppand its declaration incactus/graph/graph.hFollow the pattern of existing builder methods, e.g. for a new "relu" op:
size_t CactusGraph::relu(size_t input) {
OpParams params;
return add_node(OpType::RELU, {input}, params);
}-
** Implement the op in the execution engine ** Implement the krnel or graph op code in the relevant file, usually in
cactus/kernel/Register the new op in the dispatch table incactus/graph/graph_execute.cppfor the supported backends (CPU, NPU) -
** Export op in FFI bindings **
- header:
cactus/ffi/cactus_ffi.h - implementation:
cactus/ffi/cactus_ffi.cpp
-
** Add python ctypes declaration ** Add
_lib.cactus_graph_my_new_op.argtypes/restypeinpython/src/cactus.py -
** Add python graph wrapper ** Add
Graph.my_new_op(...)inpython/src/graph.py, and optionally a Tensor convenience method. -
** Add serialization schema entry if needed ** If your op has extra parameters that need to be saved/loaded that are not in the default node, add new ParamField enum values.
If the op has any graph-persistent params:
- add any new ParamField enum values in cactus/graph/graph_param_io.cpp
- add read/write logic for those fields
- add the op’s schema entry in
op_schema(...)
If the op has no params, you may not need to touch schema beyond maybe adding an empty schema entry.
The syntax pattern there is:
{OpType::MY_NEW_OP, {
{ParamField::Alpha, FieldPersistence::Persistent},
{ParamField::Mode, FieldPersistence::Persistent},
}},
- ** Add test coverage **
Add unit tests to
tests/test_graph.cppcovering the native graph function, and add python tests inpython/tests/test_graph.pycovering the Python API and end-to-end execution.
and then support those fields in write_field(...) / read_field(...).
If a field is runtime-only, mark it RuntimeOnly instead of Persistent.
try {
CactusGraph graph;
// ... build and execute graph
} catch (const std::exception& e) {
std::cerr << "Graph error: " << e.what() << std::endl;
}CactusGraph graph;
size_t x = graph.input({batch, dim}, Precision::FP16);
x = graph.linear(x, weight1, bias1);
x = graph.gelu(x);
x = graph.layernorm(x, ln_weight1, ln_bias1);
x = graph.linear(x, weight2, bias2);size_t input = graph.input({batch, dim}, Precision::FP16);
size_t processed = graph.matmul(input, weight);
processed = graph.gelu(processed);
size_t output = graph.add(input, processed);size_t input = graph.input({batch, dim}, Precision::FP16);
size_t path1 = graph.matmul(input, weight1);
path1 = graph.silu(path1);
size_t path2 = graph.matmul(input, weight2);
size_t output = graph.multiply(path1, path2);- Cactus Engine API — High-level inference API built on top of Cactus Graph
- Cactus Index API — On-device vector database for embedding storage and search
- Runtime Compatibility — Weight versioning across releases