Skip to content

Karen/tq#603

Open
kar-m wants to merge 3 commits intomainfrom
karen/tq
Open

Karen/tq#603
kar-m wants to merge 3 commits intomainfrom
karen/tq

Conversation

@kar-m
Copy link
Copy Markdown
Collaborator

@kar-m kar-m commented Apr 21, 2026

No description provided.

Copilot AI review requested due to automatic review settings April 21, 2026 20:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end support for a new 2-bit quantized tensor format (Precision::TQ2) focused on embedding weights, including mmap loading, descriptor wiring, and runtime dequantization.

Changes:

  • Introduces CactusTQ2 descriptor plus loader/dequantizer (cactus_tq2_load, cactus_tq2_dequant_row).
  • Extends graph precision handling to include Precision::TQ2 and routes embedding computation through the TQ2 dequant path.
  • Updates mmap tensor parsing to construct and retain a CactusTQ2 descriptor for TQ2 files.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
cactus/kernel/kernel_tq2.cpp Implements TQ2 blob parsing and per-row dequantization (NEON + FWHT).
cactus/kernel/kernel.h Exposes CactusTQ2 and the TQ2 load/dequant APIs.
cactus/graph/graph_ops_tensor.cpp Adds Precision::TQ2 embedding execution path.
cactus/graph/graph_io.cpp Allows TQ2 precision in graphs; builds CactusTQ2 during mmap and wires it into buffers.
cactus/graph/graph.h Adds Precision::TQ2 and updates PrecisionTraits/buffer descriptor to carry a TQ2 descriptor pointer.
.gitignore Normalizes .githooks/ ignore and adds examples/ to ignores.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cactus/graph/graph.h
Comment on lines 170 to 174
static constexpr size_t packed_size_of(Precision prec, size_t count) {
switch (prec) {
case Precision::INT4: return (count + 1) / 2;
case Precision::TQ2: return (count + 3) / 4;
default: return count * size_of(prec);
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrecisionTraits::packed_size_of now treats Precision::TQ2 as 2-bit packed, but PrecisionTraits::byte_offset_of has no TQ2 case and will fall through to element_offset * size_of(TQ2) (currently 1). This inconsistency will compute incorrect byte offsets if any generic tensor ops use byte_offset_of with TQ2. Add a TQ2 case to byte_offset_of (and any related alignment assertions) or make TQ2 opt out of these generic helpers.

Copilot uses AI. Check for mistakes.
Comment thread cactus/graph/graph.h
Comment on lines 172 to 175
case Precision::INT4: return (count + 1) / 2;
case Precision::TQ2: return (count + 3) / 4;
default: return count * size_of(prec);
}
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic tensor IO uses PrecisionTraits::packed_size_of(precision, total_elements) (e.g., GraphFile::save_node) to decide how many bytes to write. For TQ2 tensors, the on-disk representation is a full self-describing blob (header + codebook/scales/signs/permutation + packed indices), so (count+3)/4 does not represent the serialized size and can lead to truncated/corrupt saves or incorrect buffer sizing. Consider explicitly rejecting Precision::TQ2 in the generic save/size paths, or adding TQ2-specific serialization/size APIs that operate on the blob format.

Copilot uses AI. Check for mistakes.
Comment on lines +154 to +173
uint64_t off_cb = rd_u64(blob + 80);
uint64_t off_is = rd_u64(blob + 88);
uint64_t off_rot = rd_u64(blob + 96);
uint64_t off_sc = rd_u64(blob + 104);
uint64_t off_ix = rd_u64(blob + 112);
uint64_t total = rd_u64(blob + 120);
if (total != blob_size) return 0;

out->vocab = (uint32_t)dim0;
out->group_size = group_size;
out->num_groups = num_groups;
out->per_group_bytes = group_size / 4;

out->codebook = reinterpret_cast<const float*>(blob + off_cb);
out->input_scale = reinterpret_cast<const __fp16*>(blob + off_is);
out->left_signs = reinterpret_cast<const int8_t*>(blob + off_rot);
out->right_signs = reinterpret_cast<const int8_t*>(blob + off_rot + group_size);
out->permutation = reinterpret_cast<const uint32_t*>(blob + off_rot + 2u * group_size);
out->scales = reinterpret_cast<const __fp16*>(blob + off_sc);
out->packed = reinterpret_cast<const uint8_t*>(blob + off_ix);
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cactus_tq2_load sets internal pointers from file-provided offsets (off_cb/off_is/off_rot/off_sc/off_ix) without validating that these offsets and the expected ranges (codebook, input_scale, signs+permutation, scales, packed indices) lie within blob_size. A malformed blob could cause out-of-bounds reads later during dequantization. Add explicit bounds checks for each section (including overflow-safe off + size <= blob_size) and reject overlapping/unsorted regions if the format requires it.

Copilot uses AI. Check for mistakes.
Comment on lines +162 to +163
out->vocab = (uint32_t)dim0;
out->group_size = group_size;
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out->vocab is assigned via a narrowing cast from the 64-bit dim0 without checking range. If a corrupted blob encodes dim0 > UINT32_MAX, this will wrap and can lead to incorrect bounds checks and OOB access. Validate dim0 <= std::numeric_limits<uint32_t>::max() before casting.

Copilot uses AI. Check for mistakes.
Comment on lines +178 to +182
for (uint32_t i = 0; i < group_size; ++i) {
uint32_t p = out->permutation[i];
if (p >= group_size) return 0;
out->inv_permutation[p] = i;
}
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The permutation inversion build does not verify that permutation is a bijection. If the blob contains duplicates, some entries of inv_permutation[0..group_size) remain uninitialized, and dequant_group will read them (inv_permutation[k]) leading to undefined behavior. Initialize inv_permutation to a sentinel value and reject duplicates / missing entries.

Suggested change
for (uint32_t i = 0; i < group_size; ++i) {
uint32_t p = out->permutation[i];
if (p >= group_size) return 0;
out->inv_permutation[p] = i;
}
for (uint32_t i = 0; i < group_size; ++i) {
out->inv_permutation[i] = UINT32_MAX;
}
for (uint32_t i = 0; i < group_size; ++i) {
uint32_t p = out->permutation[i];
if (p >= group_size) return 0;
if (out->inv_permutation[p] != UINT32_MAX) return 0;
out->inv_permutation[p] = i;
}
for (uint32_t i = 0; i < group_size; ++i) {
if (out->inv_permutation[i] == UINT32_MAX) return 0;
}

Copilot uses AI. Check for mistakes.
Comment on lines +186 to +190
void cactus_tq2_dequant_row(const CactusTQ2* tq2, uint32_t token_id, __fp16* out) {
const uint32_t gs = tq2->group_size;
for (uint32_t g = 0; g < tq2->num_groups; ++g) {
dequant_group(tq2, token_id, g, out + g * gs);
}
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cactus_tq2_dequant_row (and dequant_group) assume token_id < tq2->vocab and will index into tq2->scales / tq2->packed based on token_id without any internal bounds check. Since this is a public kernel API, add a fast guard (return/abort) or document/enforce the precondition to prevent OOB reads when called from other sites.

Copilot uses AI. Check for mistakes.
Comment on lines +228 to +237
if (emb_prec == Precision::TQ2) {
const CactusTQ2* tq2 = embeddings_buffer.tq2;
if (tq2 == nullptr) {
throw std::runtime_error("TQ2 embedding: descriptor missing");
}
for (size_t i = 0; i < num_indices; i++) {
uint32_t idx = static_cast<uint32_t>(indices_ptr[i]);
if (idx >= tq2->vocab) idx = 0;
cactus_tq2_dequant_row(tq2, idx, output + i * hidden_dim);
}
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TQ2 embedding index handling currently clamps out-of-range indices to 0, while the INT4/INT8 and FP16 embedding paths throw on out-of-bounds indices. This inconsistency can hide data/graph bugs and makes behavior depend on embedding precision. Consider matching the existing behavior (throw with a clear error) or make the clamping behavior consistent across precisions.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants