WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

infinitalo · 2025-08-29T13:37:13Z

This MR is a work-in-progress.

The current commits are able to get inference working for Q8_0 on Adreno 830 (Samsung S25), but finetuning still crashes.

We're currently working on a fix for lora-finetuning on Adreno A830, but you can use this for testing in the meanwhile.

…a is provided

Signed-off-by: vineet <[email protected]>

…lation Signed-off-by: vineet <[email protected]>

This fixes the vkDeviceLostError on Mali

This makes MUL_MAT tests pass for Q8_0 when n=9 failed.

infinitalo · 2025-09-01T15:12:45Z

Steps to run the backend-ops test suite:

Set up your Android environment for testing llama.cpp. You can use this comment as a reference if you haven't built it already: Add initial LoRA finetuning support; vulkan OUT_PROD; vulkan cross-entropy-backward #5 (comment)
Configure your build with: cmake -B build -DGGML_VULKAN=1 -DCMAKE_BUILD_TYPE=Debug -DBUILD_TESTING=ON
Build llama.cpp: cmake --build build --config Debug -j2
Run the backend-ops tests: ./build/bin/test-backend-ops
You can also run tests for specific operators with the -o option, for example: ./build/bin/test-backend-ops -o MUL_MAT

This PR has a commit disabling several tests for quantized datatypes that are not currently working properly on Adreno 830.

If you run the test suite as described above with this branch, it should say 2/2 backends passing at the end, with no failing tests on A830, as the attached file shows.

test_adreno_q8_inf2.txt

makaveli10 and others added 20 commits August 19, 2025 10:07

Add lora finetuning from adapter

f7b0025

Add: create new lora adapter for target modules to finetune if no lor…

116f3dd

…a is provided

Fix identical loss over epochs; fix garbage lora initization

9e6d8ce

Signed-off-by: vineet <[email protected]>

Remove lora training from finetune.cpp

8bb11c0

Signed-off-by: vineet <[email protected]>

Add adapter saving & other lora target modules

486ebc1

Signed-off-by: vineet <[email protected]>

Add finetune-lora for lora finetuning in examples

c23ada9

Signed-off-by: vineet <[email protected]>

Add dequantization to out_prod cuda kernel

3f295e1

Signed-off-by: vineet <[email protected]>

Update README with finetune-lora

0c1ffd1

Signed-off-by: vineet <[email protected]>

Vulkan: add support for fp32 OUT_PROD op

e9f5d88

CPU: add support for fp16_fp32 OUT_PROD op

fb0e501

Vulkan: add support for f16_f32 OUT_PROD op

2b0c835

Vulkan: Add Q4_0/Q8_0 OUT_PROD Vulkan support

0aef6c8

vulkan: Add initial cross entropy loss backward shader

25c5316

Signed-off-by: vineet <[email protected]>

vulkan: Fix cross-entropy-loss-back dispatch size and wg denominator

0721550

Signed-off-by: vineet <[email protected]>

vulkan: Change uint32 cast to int32 for outprod; allows android compi…

bc7dd9f

…lation Signed-off-by: vineet <[email protected]>

vulkan: Deallocate memory after destroying buffer

c36aeee

vulkan: Set specialization constants to { 0 } for out_prod

1709861

This fixes the vkDeviceLostError on Mali

vulkan: Set out_prod pipeline disable_robustness to true

b0c5b5b

Fix out_prod; vulkan ci issues

075d1cb

Add GEGLU backward (Vulkan) to enable Gemma training.

191dd7e

github-actions bot added Nvidia GPU Vulkan examples ggml testing labels Aug 29, 2025

Italo Nicola added 5 commits September 1, 2025 10:57

(wip) Vulkan: remove packed16 optimization for Q8_0 dequant4

3ccd40a

(wip) Vulkan: disable packed16 optimizations for Q8_0 src0

089377d

(wip) Vulkan: disable mulmat device->integer_dot_product optimization

d127984

This makes MUL_MAT tests pass for Q8_0 when n=9 failed.

(wip) Vulkan: remove [[unroll]] and dot() calls in mul_mat_vec shader

0547341

(wip) Vulkan: stop using data_b_v4 in mul_mat_vec shader for Q8_0

4bf9f07

Italo Nicola added 5 commits September 1, 2025 10:57

(wip) Vulkan: severely reduce threshold needed for submitting mul_mat

9c20a4c

(wip) Vulkan: Disable device->subgroup_size_control

642ea3b

(wip) Vulkan: disable device->integer_dot_product

405c90c

(wip) Vulkan: disable COOPMAT support

58d3c68

(wip) Tests: disable non-q8 quant tests

208747f

infinitalo force-pushed the italo/tether/adreno_q8_inference branch 2 times, most recently from cbea88f to 208747f Compare September 1, 2025 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

infinitalo commented Aug 29, 2025

Uh oh!

infinitalo commented Sep 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

Are you sure you want to change the base?

WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

Conversation

infinitalo commented Aug 29, 2025

Uh oh!

infinitalo commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

infinitalo commented Sep 1, 2025 •

edited

Loading