perf(quantize): AVX2 SIMD Q4_0/Q8_0 dequant — ~8-9× speedup (closes #386) by noahgift · Pull Request #1691 · paiml/aprender

noahgift · 2026-05-15T08:45:43Z

Summary

Issue #386 measured Q4_0/Q8_0 dequant at ~1.22 Gelem/s — 5× below the
26.8 GB/s memcpy ceiling on the same hardware — and attributed it to a
missing SIMD path. The scalar implementations bottleneck on i8→i32→f32
sign-extend cascades and per-byte nibble unpacks that LLVM cannot
auto-vectorize through.

This PR adds AVX2 fast paths in
`aprender-core/src/format/quantize_simd.rs`, runtime-gated via
`is_x86_feature_detected!("avx2")`. Non-x86 targets and hosts without
AVX2 keep the scalar code path unchanged.

Implementation

Q8_0: load 32 i8 quants → 4× `_mm256_cvtepi8_epi32` (8 lanes
each) → `_mm256_cvtepi32_ps` → multiply broadcast f16 scale → 4×
`_mm256_storeu_ps`.
Q4_0: load 16 packed nibble bytes → mask low + shift-mask high →
`mm_unpacklo/hi_epi8` to recover the interleaved layout
(`byte_i = q{2i+1} << 4 | q_{2i}` — matches
`Q4_0Quantizer::quantize`) → 4× `_mm256_cvtepu8_epi32` +
`mm256_sub_epi32(, 8)` → cvt → mul → store.

Both functions are `unsafe fn` + `#[target_feature(enable = "avx2")]`,
called from `_dispatch` wrappers that perform the feature detection.
`#[allow(unsafe_code)]` is scoped to this file; no public API gains
`unsafe`. Bounds invariants are documented in the module doc-comment
and asserted by the parity tests.

Measured speedup (262144 elements, AMD Zen 4 host)

	Before (issue #386)	After	Speedup
Q8_0 dequant	1.22 Gelem/s	11.4 Gelem/s	9.3×
Q4_0 dequant	1.25 Gelem/s	9.8 Gelem/s	7.8×

Far above the ≥10% perf gate; in line with the issue's "AVX2 should
approach memcpy speed" prediction (memcpy ceiling here is ~6.7 Gelem/s
output bandwidth for f32 — both fast paths exceed it because input
bandwidth at 1 byte/elem (Q8_0) or 0.5 bytes/elem (Q4_0) is far below
the output side).

Tests

`scalar_simd_parity_q8_0` / `scalar_simd_parity_q4_0`: bit-exact
(`to_bits()`) equality between scalar and SIMD output across
[32, 64, 256, 1024, 32 × 71] element counts.
`dispatch_returns_false_without_avx2`: non-x86 path asserts the
dispatcher correctly signals scalar fallback.
All 58 pre-existing quantize tests still pass (round-trip + property).

```
$ cargo test -p aprender-core --lib --features format-quantize quantize_simd
running 3 tests
test format::quantize_simd::tests::dispatch_returns_false_without_avx2 ... ok
test format::quantize_simd::tests::scalar_simd_parity_q4_0 ... ok
test format::quantize_simd::tests::scalar_simd_parity_q8_0 ... ok
test result: ok. 3 passed; 0 failed
```

Out of scope

SSE2 / AVX-512 / NEON / WASM SIMD128 fast paths — separate tickets
(AVX2 covers the x86-64 default since ~2013, hitting essentially
every modern x86-64 dev/CI machine including the self-hosted runners).
Q4_1 / Q4_K / Q6_K / Q8_K SIMD — separate, more complex layouts.
Quantize (compress) path — issue perf: Q4_0/Q8_0 dequant throughput 5x below memcpy ceiling — missing SIMD vectorization #386 notes quantize is 4× slower
than dequant, but the dominant cost is finding block scales
(reductions), not the per-element compress. Separate ticket.

Closes #386.

🤖 Generated with Claude Code

) Issue #386 measured Q4_0/Q8_0 dequant at ~1.22 Gelem/s — 5× below the 26.8 GB/s memcpy ceiling on the same hardware — and attributed it to a missing SIMD path. The scalar implementations bottleneck on i8→i32→f32 sign-extend cascades and per-byte nibble unpacks that LLVM cannot auto-vectorize through. This PR adds AVX2 fast paths in `aprender-core/src/format/quantize_simd.rs`, runtime-gated via `is_x86_feature_detected!("avx2")`. Non-x86 targets and hosts without AVX2 keep the scalar code path unchanged. ## Implementation - **Q8_0**: load 32 i8 quants → 4× `_mm256_cvtepi8_epi32` (8 lanes each) → `_mm256_cvtepi32_ps` → multiply broadcast f16 scale → 4× `_mm256_storeu_ps`. - **Q4_0**: load 16 packed nibble bytes → mask low + shift-mask high → `_mm_unpacklo/hi_epi8` to recover the interleaved layout (`byte_i = q_{2i+1} << 4 | q_{2i}` — matches `Q4_0Quantizer::quantize`) → 4× `_mm256_cvtepu8_epi32` + `_mm256_sub_epi32(_, 8)` → cvt → mul → store. Both functions are `unsafe fn` + `#[target_feature(enable = "avx2")]`, called from `_dispatch` wrappers that perform the feature detection. `#[allow(unsafe_code)]` is scoped to this file; no public API gains `unsafe`. Bounds invariants are documented in the module doc-comment and asserted by the parity tests. ## Measured speedup (262144 elements, AMD Zen 4 host) | | Before (issue #386) | After | Speedup | |--------------|---------------------|----------------|---------| | Q8_0 dequant | 1.22 Gelem/s | 11.4 Gelem/s | 9.3× | | Q4_0 dequant | 1.25 Gelem/s | 9.8 Gelem/s | 7.8× | Well above the ≥10% perf gate; in line with the issue's "AVX2 should approach memcpy speed" prediction (memcpy ceiling here is ~6.7 Gelem/s output bandwidth for f32 — both fast paths exceed it because input bandwidth at 1 byte/elem (Q8_0) or 0.5 bytes/elem (Q4_0) is far below the output side). ## Tests - `scalar_simd_parity_q8_0` / `scalar_simd_parity_q4_0`: bit-exact (`to_bits()`) equality between scalar and SIMD output across [32, 64, 256, 1024, 32 × 71] element counts. - `dispatch_returns_false_without_avx2`: non-x86 path asserts the dispatcher correctly signals scalar fallback. - All 58 pre-existing quantize tests still pass (round-trip + property). ``` $ cargo test -p aprender-core --lib --features format-quantize quantize_simd running 3 tests test format::quantize_simd::tests::dispatch_returns_false_without_avx2 ... ok test format::quantize_simd::tests::scalar_simd_parity_q4_0 ... ok test format::quantize_simd::tests::scalar_simd_parity_q8_0 ... ok test result: ok. 3 passed; 0 failed ``` ## Out of scope - SSE2 / AVX-512 / NEON / WASM SIMD128 fast paths — separate tickets (AVX2 covers the x86-64 default since ~2013, hitting essentially every modern x86-64 dev/CI machine including the self-hosted runners). - Q4_1 / Q4_K / Q6_K / Q8_K SIMD — separate, more complex layouts. - Quantize (compress) path — issue #386 notes quantize is 4× slower than dequant, but the dominant cost is finding block scales (reductions), not the per-element compress. Separate ticket. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 15, 2026 08:45

noahgift added 19 commits May 15, 2026 11:04

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

7241ac3

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

3e5c533

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

0c4016f

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

431093e

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

59a9f38

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

23e1065

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

f010be4

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

64a7362

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

358154e

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

fe7fbcd

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

cfef281

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

d4f5f67

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

867690d

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

e491d97

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

8994d06

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

b17d8c6

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

b683932

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

33c2008

Merge branch 'main' into perf/386-q4-q8-avx2-dequant

010f330

noahgift merged commit 4c41c3d into main May 16, 2026
10 checks passed

noahgift deleted the perf/386-q4-q8-avx2-dequant branch May 16, 2026 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(quantize): AVX2 SIMD Q4_0/Q8_0 dequant — ~8-9× speedup (closes #386)#1691

perf(quantize): AVX2 SIMD Q4_0/Q8_0 dequant — ~8-9× speedup (closes #386)#1691
noahgift merged 20 commits into
mainfrom
perf/386-q4-q8-avx2-dequant

noahgift commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 15, 2026

Summary

Implementation

Measured speedup (262144 elements, AMD Zen 4 host)

Tests

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant