Karen/tq by kar-m · Pull Request #603 · cactus-compute/cactus

kar-m · 2026-04-21T20:12:02Z

No description provided.

Copilot

Pull request overview

Adds end-to-end support for a new 2-bit quantized tensor format (Precision::TQ2) focused on embedding weights, including mmap loading, descriptor wiring, and runtime dequantization.

Changes:

Introduces CactusTQ2 descriptor plus loader/dequantizer (cactus_tq2_load, cactus_tq2_dequant_row).
Extends graph precision handling to include Precision::TQ2 and routes embedding computation through the TQ2 dequant path.
Updates mmap tensor parsing to construct and retain a CactusTQ2 descriptor for TQ2 files.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
cactus/kernel/kernel_tq2.cpp	Implements TQ2 blob parsing and per-row dequantization (NEON + FWHT).
cactus/kernel/kernel.h	Exposes `CactusTQ2` and the TQ2 load/dequant APIs.
cactus/graph/graph_ops_tensor.cpp	Adds `Precision::TQ2` embedding execution path.
cactus/graph/graph_io.cpp	Allows TQ2 precision in graphs; builds `CactusTQ2` during mmap and wires it into buffers.
cactus/graph/graph.h	Adds `Precision::TQ2` and updates `PrecisionTraits`/buffer descriptor to carry a TQ2 descriptor pointer.
.gitignore	Normalizes `.githooks/` ignore and adds `examples/` to ignores.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-21T20:17:48Z

    static constexpr size_t packed_size_of(Precision prec, size_t count) {
        switch (prec) {
            case Precision::INT4: return (count + 1) / 2;
+            case Precision::TQ2:  return (count + 3) / 4;
            default: return count * size_of(prec);


PrecisionTraits::packed_size_of now treats Precision::TQ2 as 2-bit packed, but PrecisionTraits::byte_offset_of has no TQ2 case and will fall through to element_offset * size_of(TQ2) (currently 1). This inconsistency will compute incorrect byte offsets if any generic tensor ops use byte_offset_of with TQ2. Add a TQ2 case to byte_offset_of (and any related alignment assertions) or make TQ2 opt out of these generic helpers.

Copilot · 2026-04-21T20:17:48Z

            case Precision::INT4: return (count + 1) / 2;
+            case Precision::TQ2:  return (count + 3) / 4;
            default: return count * size_of(prec);
        }


Generic tensor IO uses PrecisionTraits::packed_size_of(precision, total_elements) (e.g., GraphFile::save_node) to decide how many bytes to write. For TQ2 tensors, the on-disk representation is a full self-describing blob (header + codebook/scales/signs/permutation + packed indices), so (count+3)/4 does not represent the serialized size and can lead to truncated/corrupt saves or incorrect buffer sizing. Consider explicitly rejecting Precision::TQ2 in the generic save/size paths, or adding TQ2-specific serialization/size APIs that operate on the blob format.

Copilot · 2026-04-21T20:17:49Z

+    uint64_t off_cb  = rd_u64(blob + 80);
+    uint64_t off_is  = rd_u64(blob + 88);
+    uint64_t off_rot = rd_u64(blob + 96);
+    uint64_t off_sc  = rd_u64(blob + 104);
+    uint64_t off_ix  = rd_u64(blob + 112);
+    uint64_t total   = rd_u64(blob + 120);
+    if (total != blob_size) return 0;
+
+    out->vocab           = (uint32_t)dim0;
+    out->group_size      = group_size;
+    out->num_groups      = num_groups;
+    out->per_group_bytes = group_size / 4;
+
+    out->codebook    = reinterpret_cast<const float*>(blob + off_cb);
+    out->input_scale = reinterpret_cast<const __fp16*>(blob + off_is);
+    out->left_signs  = reinterpret_cast<const int8_t*>(blob + off_rot);
+    out->right_signs = reinterpret_cast<const int8_t*>(blob + off_rot + group_size);
+    out->permutation = reinterpret_cast<const uint32_t*>(blob + off_rot + 2u * group_size);
+    out->scales      = reinterpret_cast<const __fp16*>(blob + off_sc);
+    out->packed      = reinterpret_cast<const uint8_t*>(blob + off_ix);


cactus_tq2_load sets internal pointers from file-provided offsets (off_cb/off_is/off_rot/off_sc/off_ix) without validating that these offsets and the expected ranges (codebook, input_scale, signs+permutation, scales, packed indices) lie within blob_size. A malformed blob could cause out-of-bounds reads later during dequantization. Add explicit bounds checks for each section (including overflow-safe off + size <= blob_size) and reject overlapping/unsorted regions if the format requires it.

Copilot · 2026-04-21T20:17:49Z

+    out->vocab           = (uint32_t)dim0;
+    out->group_size      = group_size;


out->vocab is assigned via a narrowing cast from the 64-bit dim0 without checking range. If a corrupted blob encodes dim0 > UINT32_MAX, this will wrap and can lead to incorrect bounds checks and OOB access. Validate dim0 <= std::numeric_limits<uint32_t>::max() before casting.

Copilot · 2026-04-21T20:17:49Z

+    for (uint32_t i = 0; i < group_size; ++i) {
+        uint32_t p = out->permutation[i];
+        if (p >= group_size) return 0;
+        out->inv_permutation[p] = i;
+    }


The permutation inversion build does not verify that permutation is a bijection. If the blob contains duplicates, some entries of inv_permutation[0..group_size) remain uninitialized, and dequant_group will read them (inv_permutation[k]) leading to undefined behavior. Initialize inv_permutation to a sentinel value and reject duplicates / missing entries.

Suggested change

for (uint32_t i = 0; i < group_size; ++i) {

uint32_t p = out->permutation[i];

if (p >= group_size) return 0;

out->inv_permutation[p] = i;

}

for (uint32_t i = 0; i < group_size; ++i) {

out->inv_permutation[i] = UINT32_MAX;

}

for (uint32_t i = 0; i < group_size; ++i) {

uint32_t p = out->permutation[i];

if (p >= group_size) return 0;

if (out->inv_permutation[p] != UINT32_MAX) return 0;

out->inv_permutation[p] = i;

}

for (uint32_t i = 0; i < group_size; ++i) {

if (out->inv_permutation[i] == UINT32_MAX) return 0;

}

Copilot · 2026-04-21T20:17:49Z

+void cactus_tq2_dequant_row(const CactusTQ2* tq2, uint32_t token_id, __fp16* out) {
+    const uint32_t gs = tq2->group_size;
+    for (uint32_t g = 0; g < tq2->num_groups; ++g) {
+        dequant_group(tq2, token_id, g, out + g * gs);
+    }


cactus_tq2_dequant_row (and dequant_group) assume token_id < tq2->vocab and will index into tq2->scales / tq2->packed based on token_id without any internal bounds check. Since this is a public kernel API, add a fast guard (return/abort) or document/enforce the precondition to prevent OOB reads when called from other sites.

Copilot · 2026-04-21T20:17:50Z

+    if (emb_prec == Precision::TQ2) {
+        const CactusTQ2* tq2 = embeddings_buffer.tq2;
+        if (tq2 == nullptr) {
+            throw std::runtime_error("TQ2 embedding: descriptor missing");
+        }
+        for (size_t i = 0; i < num_indices; i++) {
+            uint32_t idx = static_cast<uint32_t>(indices_ptr[i]);
+            if (idx >= tq2->vocab) idx = 0;
+            cactus_tq2_dequant_row(tq2, idx, output + i * hidden_dim);
+        }


TQ2 embedding index handling currently clamps out-of-range indices to 0, while the INT4/INT8 and FP16 embedding paths throw on out-of-bounds indices. This inconsistency can hide data/graph bugs and makes behavior depend on embedding precision. Consider matching the existing behavior (throw with a clear error) or make the clamping behavior consistent across precisions.

kar-m added 3 commits April 20, 2026 23:22

added tq kernel

7099de4

Merge remote-tracking branch 'origin/main' into karen/tq

e1edf09

cleaned tq implementation

269b6eb

Copilot AI review requested due to automatic review settings April 21, 2026 20:12

Copilot started reviewing on behalf of kar-m April 21, 2026 20:12 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karen/tq#603

Karen/tq#603
kar-m wants to merge 3 commits intomainfrom
karen/tq

kar-m commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kar-m commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants