Skip to content

perf(quantize): AVX2 SIMD Q4_0/Q8_0 dequant — ~8-9× speedup (closes #386)#1691

Merged
noahgift merged 20 commits into
mainfrom
perf/386-q4-q8-avx2-dequant
May 16, 2026
Merged

perf(quantize): AVX2 SIMD Q4_0/Q8_0 dequant — ~8-9× speedup (closes #386)#1691
noahgift merged 20 commits into
mainfrom
perf/386-q4-q8-avx2-dequant

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Issue #386 measured Q4_0/Q8_0 dequant at ~1.22 Gelem/s — 5× below the
26.8 GB/s memcpy ceiling on the same hardware — and attributed it to a
missing SIMD path. The scalar implementations bottleneck on i8→i32→f32
sign-extend cascades and per-byte nibble unpacks that LLVM cannot
auto-vectorize through.

This PR adds AVX2 fast paths in
`aprender-core/src/format/quantize_simd.rs`, runtime-gated via
`is_x86_feature_detected!("avx2")`. Non-x86 targets and hosts without
AVX2 keep the scalar code path unchanged.

Implementation

  • Q8_0: load 32 i8 quants → 4× `_mm256_cvtepi8_epi32` (8 lanes
    each) → `_mm256_cvtepi32_ps` → multiply broadcast f16 scale → 4×
    `_mm256_storeu_ps`.
  • Q4_0: load 16 packed nibble bytes → mask low + shift-mask high →
    `mm_unpacklo/hi_epi8` to recover the interleaved layout
    (`byte_i = q
    {2i+1} << 4 | q_{2i}` — matches
    `Q4_0Quantizer::quantize`) → 4× `_mm256_cvtepu8_epi32` +
    `mm256_sub_epi32(, 8)` → cvt → mul → store.

Both functions are `unsafe fn` + `#[target_feature(enable = "avx2")]`,
called from `_dispatch` wrappers that perform the feature detection.
`#[allow(unsafe_code)]` is scoped to this file; no public API gains
`unsafe`. Bounds invariants are documented in the module doc-comment
and asserted by the parity tests.

Measured speedup (262144 elements, AMD Zen 4 host)

Before (issue #386) After Speedup
Q8_0 dequant 1.22 Gelem/s 11.4 Gelem/s 9.3×
Q4_0 dequant 1.25 Gelem/s 9.8 Gelem/s 7.8×

Far above the ≥10% perf gate; in line with the issue's "AVX2 should
approach memcpy speed" prediction (memcpy ceiling here is ~6.7 Gelem/s
output bandwidth for f32 — both fast paths exceed it because input
bandwidth at 1 byte/elem (Q8_0) or 0.5 bytes/elem (Q4_0) is far below
the output side).

Tests

  • `scalar_simd_parity_q8_0` / `scalar_simd_parity_q4_0`: bit-exact
    (`to_bits()`) equality between scalar and SIMD output across
    [32, 64, 256, 1024, 32 × 71] element counts.
  • `dispatch_returns_false_without_avx2`: non-x86 path asserts the
    dispatcher correctly signals scalar fallback.
  • All 58 pre-existing quantize tests still pass (round-trip + property).

```
$ cargo test -p aprender-core --lib --features format-quantize quantize_simd
running 3 tests
test format::quantize_simd::tests::dispatch_returns_false_without_avx2 ... ok
test format::quantize_simd::tests::scalar_simd_parity_q4_0 ... ok
test format::quantize_simd::tests::scalar_simd_parity_q8_0 ... ok
test result: ok. 3 passed; 0 failed
```

Out of scope

  • SSE2 / AVX-512 / NEON / WASM SIMD128 fast paths — separate tickets
    (AVX2 covers the x86-64 default since ~2013, hitting essentially
    every modern x86-64 dev/CI machine including the self-hosted runners).
  • Q4_1 / Q4_K / Q6_K / Q8_K SIMD — separate, more complex layouts.
  • Quantize (compress) path — issue perf: Q4_0/Q8_0 dequant throughput 5x below memcpy ceiling — missing SIMD vectorization #386 notes quantize is 4× slower
    than dequant, but the dominant cost is finding block scales
    (reductions), not the per-element compress. Separate ticket.

Closes #386.

🤖 Generated with Claude Code

)

Issue #386 measured Q4_0/Q8_0 dequant at ~1.22 Gelem/s — 5× below the
26.8 GB/s memcpy ceiling on the same hardware — and attributed it to a
missing SIMD path. The scalar implementations bottleneck on i8→i32→f32
sign-extend cascades and per-byte nibble unpacks that LLVM cannot
auto-vectorize through.

This PR adds AVX2 fast paths in `aprender-core/src/format/quantize_simd.rs`,
runtime-gated via `is_x86_feature_detected!("avx2")`. Non-x86 targets and
hosts without AVX2 keep the scalar code path unchanged.

## Implementation

- **Q8_0**: load 32 i8 quants → 4× `_mm256_cvtepi8_epi32` (8 lanes each)
  → `_mm256_cvtepi32_ps` → multiply broadcast f16 scale → 4× `_mm256_storeu_ps`.
- **Q4_0**: load 16 packed nibble bytes → mask low + shift-mask high →
  `_mm_unpacklo/hi_epi8` to recover the interleaved layout
  (`byte_i = q_{2i+1} << 4 | q_{2i}` — matches `Q4_0Quantizer::quantize`)
  → 4× `_mm256_cvtepu8_epi32` + `_mm256_sub_epi32(_, 8)` → cvt → mul → store.

Both functions are `unsafe fn` + `#[target_feature(enable = "avx2")]`,
called from `_dispatch` wrappers that perform the feature detection.
`#[allow(unsafe_code)]` is scoped to this file; no public API gains
`unsafe`. Bounds invariants are documented in the module doc-comment
and asserted by the parity tests.

## Measured speedup (262144 elements, AMD Zen 4 host)

|              | Before (issue #386) | After          | Speedup |
|--------------|---------------------|----------------|---------|
| Q8_0 dequant | 1.22 Gelem/s        | 11.4 Gelem/s   | 9.3×    |
| Q4_0 dequant | 1.25 Gelem/s        | 9.8 Gelem/s    | 7.8×    |

Well above the ≥10% perf gate; in line with the issue's "AVX2 should
approach memcpy speed" prediction (memcpy ceiling here is ~6.7 Gelem/s
output bandwidth for f32 — both fast paths exceed it because input
bandwidth at 1 byte/elem (Q8_0) or 0.5 bytes/elem (Q4_0) is far below
the output side).

## Tests

- `scalar_simd_parity_q8_0` / `scalar_simd_parity_q4_0`:
  bit-exact (`to_bits()`) equality between scalar and SIMD output
  across [32, 64, 256, 1024, 32 × 71] element counts.
- `dispatch_returns_false_without_avx2`: non-x86 path asserts
  the dispatcher correctly signals scalar fallback.
- All 58 pre-existing quantize tests still pass (round-trip + property).

```
$ cargo test -p aprender-core --lib --features format-quantize quantize_simd
running 3 tests
test format::quantize_simd::tests::dispatch_returns_false_without_avx2 ... ok
test format::quantize_simd::tests::scalar_simd_parity_q4_0 ... ok
test format::quantize_simd::tests::scalar_simd_parity_q8_0 ... ok
test result: ok. 3 passed; 0 failed
```

## Out of scope

- SSE2 / AVX-512 / NEON / WASM SIMD128 fast paths — separate tickets
  (AVX2 covers the x86-64 default since ~2013, hitting essentially
  every modern x86-64 dev/CI machine including the self-hosted runners).
- Q4_1 / Q4_K / Q6_K / Q8_K SIMD — separate, more complex layouts.
- Quantize (compress) path — issue #386 notes quantize is 4× slower
  than dequant, but the dominant cost is finding block scales
  (reductions), not the per-element compress. Separate ticket.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 15, 2026 08:45
@noahgift noahgift merged commit 4c41c3d into main May 16, 2026
10 checks passed
@noahgift noahgift deleted the perf/386-q4-q8-avx2-dequant branch May 16, 2026 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: Q4_0/Q8_0 dequant throughput 5x below memcpy ceiling — missing SIMD vectorization

1 participant