Conversation
There was a problem hiding this comment.
Pull request overview
Adds end-to-end support for a new 2-bit quantized tensor format (Precision::TQ2) focused on embedding weights, including mmap loading, descriptor wiring, and runtime dequantization.
Changes:
- Introduces
CactusTQ2descriptor plus loader/dequantizer (cactus_tq2_load,cactus_tq2_dequant_row). - Extends graph precision handling to include
Precision::TQ2and routes embedding computation through the TQ2 dequant path. - Updates mmap tensor parsing to construct and retain a
CactusTQ2descriptor for TQ2 files.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| cactus/kernel/kernel_tq2.cpp | Implements TQ2 blob parsing and per-row dequantization (NEON + FWHT). |
| cactus/kernel/kernel.h | Exposes CactusTQ2 and the TQ2 load/dequant APIs. |
| cactus/graph/graph_ops_tensor.cpp | Adds Precision::TQ2 embedding execution path. |
| cactus/graph/graph_io.cpp | Allows TQ2 precision in graphs; builds CactusTQ2 during mmap and wires it into buffers. |
| cactus/graph/graph.h | Adds Precision::TQ2 and updates PrecisionTraits/buffer descriptor to carry a TQ2 descriptor pointer. |
| .gitignore | Normalizes .githooks/ ignore and adds examples/ to ignores. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| static constexpr size_t packed_size_of(Precision prec, size_t count) { | ||
| switch (prec) { | ||
| case Precision::INT4: return (count + 1) / 2; | ||
| case Precision::TQ2: return (count + 3) / 4; | ||
| default: return count * size_of(prec); |
There was a problem hiding this comment.
PrecisionTraits::packed_size_of now treats Precision::TQ2 as 2-bit packed, but PrecisionTraits::byte_offset_of has no TQ2 case and will fall through to element_offset * size_of(TQ2) (currently 1). This inconsistency will compute incorrect byte offsets if any generic tensor ops use byte_offset_of with TQ2. Add a TQ2 case to byte_offset_of (and any related alignment assertions) or make TQ2 opt out of these generic helpers.
| case Precision::INT4: return (count + 1) / 2; | ||
| case Precision::TQ2: return (count + 3) / 4; | ||
| default: return count * size_of(prec); | ||
| } |
There was a problem hiding this comment.
Generic tensor IO uses PrecisionTraits::packed_size_of(precision, total_elements) (e.g., GraphFile::save_node) to decide how many bytes to write. For TQ2 tensors, the on-disk representation is a full self-describing blob (header + codebook/scales/signs/permutation + packed indices), so (count+3)/4 does not represent the serialized size and can lead to truncated/corrupt saves or incorrect buffer sizing. Consider explicitly rejecting Precision::TQ2 in the generic save/size paths, or adding TQ2-specific serialization/size APIs that operate on the blob format.
| uint64_t off_cb = rd_u64(blob + 80); | ||
| uint64_t off_is = rd_u64(blob + 88); | ||
| uint64_t off_rot = rd_u64(blob + 96); | ||
| uint64_t off_sc = rd_u64(blob + 104); | ||
| uint64_t off_ix = rd_u64(blob + 112); | ||
| uint64_t total = rd_u64(blob + 120); | ||
| if (total != blob_size) return 0; | ||
|
|
||
| out->vocab = (uint32_t)dim0; | ||
| out->group_size = group_size; | ||
| out->num_groups = num_groups; | ||
| out->per_group_bytes = group_size / 4; | ||
|
|
||
| out->codebook = reinterpret_cast<const float*>(blob + off_cb); | ||
| out->input_scale = reinterpret_cast<const __fp16*>(blob + off_is); | ||
| out->left_signs = reinterpret_cast<const int8_t*>(blob + off_rot); | ||
| out->right_signs = reinterpret_cast<const int8_t*>(blob + off_rot + group_size); | ||
| out->permutation = reinterpret_cast<const uint32_t*>(blob + off_rot + 2u * group_size); | ||
| out->scales = reinterpret_cast<const __fp16*>(blob + off_sc); | ||
| out->packed = reinterpret_cast<const uint8_t*>(blob + off_ix); |
There was a problem hiding this comment.
cactus_tq2_load sets internal pointers from file-provided offsets (off_cb/off_is/off_rot/off_sc/off_ix) without validating that these offsets and the expected ranges (codebook, input_scale, signs+permutation, scales, packed indices) lie within blob_size. A malformed blob could cause out-of-bounds reads later during dequantization. Add explicit bounds checks for each section (including overflow-safe off + size <= blob_size) and reject overlapping/unsorted regions if the format requires it.
| out->vocab = (uint32_t)dim0; | ||
| out->group_size = group_size; |
There was a problem hiding this comment.
out->vocab is assigned via a narrowing cast from the 64-bit dim0 without checking range. If a corrupted blob encodes dim0 > UINT32_MAX, this will wrap and can lead to incorrect bounds checks and OOB access. Validate dim0 <= std::numeric_limits<uint32_t>::max() before casting.
| for (uint32_t i = 0; i < group_size; ++i) { | ||
| uint32_t p = out->permutation[i]; | ||
| if (p >= group_size) return 0; | ||
| out->inv_permutation[p] = i; | ||
| } |
There was a problem hiding this comment.
The permutation inversion build does not verify that permutation is a bijection. If the blob contains duplicates, some entries of inv_permutation[0..group_size) remain uninitialized, and dequant_group will read them (inv_permutation[k]) leading to undefined behavior. Initialize inv_permutation to a sentinel value and reject duplicates / missing entries.
| for (uint32_t i = 0; i < group_size; ++i) { | |
| uint32_t p = out->permutation[i]; | |
| if (p >= group_size) return 0; | |
| out->inv_permutation[p] = i; | |
| } | |
| for (uint32_t i = 0; i < group_size; ++i) { | |
| out->inv_permutation[i] = UINT32_MAX; | |
| } | |
| for (uint32_t i = 0; i < group_size; ++i) { | |
| uint32_t p = out->permutation[i]; | |
| if (p >= group_size) return 0; | |
| if (out->inv_permutation[p] != UINT32_MAX) return 0; | |
| out->inv_permutation[p] = i; | |
| } | |
| for (uint32_t i = 0; i < group_size; ++i) { | |
| if (out->inv_permutation[i] == UINT32_MAX) return 0; | |
| } |
| void cactus_tq2_dequant_row(const CactusTQ2* tq2, uint32_t token_id, __fp16* out) { | ||
| const uint32_t gs = tq2->group_size; | ||
| for (uint32_t g = 0; g < tq2->num_groups; ++g) { | ||
| dequant_group(tq2, token_id, g, out + g * gs); | ||
| } |
There was a problem hiding this comment.
cactus_tq2_dequant_row (and dequant_group) assume token_id < tq2->vocab and will index into tq2->scales / tq2->packed based on token_id without any internal bounds check. Since this is a public kernel API, add a fast guard (return/abort) or document/enforce the precondition to prevent OOB reads when called from other sites.
| if (emb_prec == Precision::TQ2) { | ||
| const CactusTQ2* tq2 = embeddings_buffer.tq2; | ||
| if (tq2 == nullptr) { | ||
| throw std::runtime_error("TQ2 embedding: descriptor missing"); | ||
| } | ||
| for (size_t i = 0; i < num_indices; i++) { | ||
| uint32_t idx = static_cast<uint32_t>(indices_ptr[i]); | ||
| if (idx >= tq2->vocab) idx = 0; | ||
| cactus_tq2_dequant_row(tq2, idx, output + i * hidden_dim); | ||
| } |
There was a problem hiding this comment.
TQ2 embedding index handling currently clamps out-of-range indices to 0, while the INT4/INT8 and FP16 embedding paths throw on out-of-bounds indices. This inconsistency can hide data/graph bugs and makes behavior depend on embedding precision. Consider matching the existing behavior (throw with a clear error) or make the clamping behavior consistent across precisions.
No description provided.