Skip to content

vulkan : add fp16 support for the conv_2d kernel #14872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 27, 2025

Conversation

Green-Sky
Copy link
Collaborator

@Green-Sky Green-Sky commented Jul 25, 2025

This enables you to run a fp16 sd1.x model with sd.cpp.

This is my first time touching the vulkan code, feedback appreciated.

Related discussions: leejet/stable-diffusion.cpp#739

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 25, 2025
@Green-Sky Green-Sky force-pushed the vk_conv2d_fp16_knl branch from 2c62fd5 to f0f7b73 Compare July 25, 2025 09:52
Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I haven't tested it locally, though.

@Green-Sky Green-Sky marked this pull request as ready for review July 25, 2025 13:46
@Green-Sky Green-Sky requested a review from 0cc4m as a code owner July 25, 2025 13:46
@github-actions github-actions bot added the testing Everything test related label Jul 25, 2025
Comment on lines 5143 to 5151
for (auto act_case : cases) {
test_cases.emplace_back(new test_conv_2d(
{ act_case[iwh_idx], act_case[iwh_idx], act_case[Cin_idx], act_case[B_idx] },
{ act_case[kwh_idx], act_case[kwh_idx], act_case[Cin_idx], act_case[Cout_idx] }, 1, 1, 0, 0, 1, 1, false));
{ act_case[kwh_idx], act_case[kwh_idx], act_case[Cin_idx], act_case[Cout_idx] },
GGML_TYPE_F32, 1, 1, 0, 0, 1, 1, false));
test_cases.emplace_back(new test_conv_2d(
{ act_case[iwh_idx], act_case[iwh_idx], act_case[Cin_idx], act_case[B_idx] },
{ act_case[kwh_idx], act_case[kwh_idx], act_case[Cin_idx], act_case[Cout_idx] },
GGML_TYPE_F16, 1, 1, 0, 0, 1, 1, false));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we repeat the same test for different formats we typically loop through a ggml_type instead.

// glu ops
for (ggml_type type : {GGML_TYPE_F16, GGML_TYPE_F32}) {
for (int v : {0, 1}) {
for (int op = 0; op < GGML_GLU_OP_COUNT; op++) {
for (bool swapped : {false, true}) {
test_cases.emplace_back(new test_glu((ggml_glu_op) op, type, { 128, 2, 2, 2 }, v, swapped));
test_cases.emplace_back(new test_glu((ggml_glu_op) op, type, { 5, 7, 11, 13 }, v, swapped));
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also added the f16 tests to the normal tests, since they seem to run fast.

@Green-Sky Green-Sky force-pushed the vk_conv2d_fp16_knl branch from 62cbfe3 to 4fa0331 Compare July 25, 2025 18:45
@Green-Sky
Copy link
Collaborator Author

Looks like the error in a few cases is just a little too much.

$ bin/test-backend-ops test -o CONV_2D
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 2070
  Device memory: 8192 MB (8192 MB free)

[CONV_2D] NMSE = 0.000000127 > 0.000000100   CONV_2D(ne_input=[1,1,1,2],ne_kernel=[1,1,1,12],type_kernel=f16,stride0=1,stride1=5,padding0=5,padding1=5,dilation0=2,dilation1=4,cwhn=0): FAIL
[CONV_2D] NMSE = 0.000000115 > 0.000000100   CONV_2D(ne_input=[1,1,1,2],ne_kernel=[2,1,1,12],type_kernel=f16,stride0=1,stride1=5,padding0=5,padding1=5,dilation0=2,dilation1=4,cwhn=0): FAIL
[CONV_2D] NMSE = 0.000000243 > 0.000000100   CONV_2D(ne_input=[1,1,25,2],ne_kernel=[1,2,25,1],type_kernel=f16,stride0=1,stride1=5,padding0=5,padding1=5,dilation0=2,dilation1=4,cwhn=0): FAIL

...

  8129/8132 tests passed
  Backend Vulkan0: FAIL
Backend 2/2: CPU
  Skipping CPU backend
1/2 backends passed
FAIL

and here in ci:

 ggml_vulkan: Found 1 Vulkan devices:
 ggml_vulkan: 0 = llvmpipe (LLVM 15.0.7, 256 bits) (llvmpipe) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 8 | shared memory: 32768 | int dot: 0 | matrix cores: none
 ggml_vulkan: Warning: Device type is CPU. This is probably not the device you want.
 Testing 2 devices
 
 Backend 1/2: Vulkan0
   Device description: llvmpipe (LLVM 15.0.7, 256 bits)
   Device memory: 15995 MB (15995 MB free)

[CONV_2D] NMSE = 0.000000193 > 0.000000100   CONV_2D(ne_input=[1,1,25,2],ne_kernel=[2,1,25,1],type_kernel=f16,stride0=1,stride1=5,padding0=5,padding1=5,dilation0=2,dilation1=4,cwhn=0): FAIL
[CONV_2D] NMSE = 0.000000294 > 0.000000100   CONV_2D(ne_input=[1,1,25,2],ne_kernel=[1,2,25,1],type_kernel=f16,stride0=1,stride1=5,padding0=5,padding1=5,dilation0=2,dilation1=4,cwhn=0): FAIL
[CONV_2D] NMSE = 0.000001574 > 0.000000100   CONV_2D(ne_input=[1,1,25,2],ne_kernel=[3,1,25,1],type_kernel=f16,stride0=3,stride1=5,padding0=5,padding1=5,dilation0=2,dilation1=4,cwhn=0): FAIL
[CONV_2D] NMSE = 0.000000101 > 0.000000100   CONV_2D(ne_input=[1,1,25,2],ne_kernel=[3,1,25,12],type_kernel=f16,stride0=3,stride1=5,padding0=5,padding1=5,dilation0=2,dilation1=4,cwhn=0): FAIL

   8128/8132 tests passed
   Backend Vulkan0: FAIL
 Backend 2/2: CPU
   Skipping CPU backend
 1/2 backends passed
 FAIL

@Green-Sky
Copy link
Collaborator Author

The error does not seem to be deterministic, is that expected?

(Just had only 2 cases surpass the error threshold)

@0cc4m
Copy link
Collaborator

0cc4m commented Jul 25, 2025

Maybe another case of RTE?

Edit: No, just tried it, that does not resolve it.

@jeffbolznv
Copy link
Collaborator

The error does not seem to be deterministic, is that expected?

Yes, the test values are randomly generated.

IIUC the kernel just promotes the fp16 values to fp32, nothing is done with fp16 math, so there ought not be any precision issues (or at least, no worse than with fp32).

@etasnadi
Copy link
Contributor

The error does not seem to be deterministic, is that expected?

(Just had only 2 cases surpass the error threshold)

I did not look into the cpu code but their f16 impl might use f16 for intermediate values - that could explain the divergence.

@jeffbolznv
Copy link
Collaborator

I think you're right. I see this:

                        if (kernel_type == GGML_TYPE_F32) {
                            *(float *) element_ptr = src_val;
                        } else if (kernel_type == GGML_TYPE_F16) {
                            *(ggml_fp16_t *) element_ptr = GGML_CPU_FP32_TO_FP16(src_val);
                        }

If we eventually want to accelerate these operations using tensor cores then having the sources both in fp16 is what we'll want. So I think we should change the shader to convert the source values to fp16.

@Green-Sky
Copy link
Collaborator Author

Green-Sky commented Jul 26, 2025

I hacked in the cat to kernel type, but it still errors. The error is smaller though.
edit: ci agrees

eg

[CONV_2D] NMSE = 0.000000130 > 0.000000100   CONV_2D(ne_input=[1,1,1,2],ne_kernel=[1,1,1,1],type_kernel=f16,stride0=1,stride1=5,padding0=5,padding1=5,dilation0=2,dilation1=4,cwhn=0): FAIL
[CONV_2D] NMSE = 0.000000105 > 0.000000100   CONV_2D(ne_input=[1,1,25,2],ne_kernel=[2,2,25,1],type_kernel=f16,stride0=1,stride1=5,padding0=5,padding1=5,dilation0=2,dilation1=4,cwhn=0): FAIL

@0cc4m
Copy link
Collaborator

0cc4m commented Jul 26, 2025

That change means the shader no longer works without fp16 compute support.

If we eventually want to accelerate these operations using tensor cores then having the sources both in fp16 is what we'll want. So I think we should change the shader to convert the source values to fp16.

The usual assumption was that the CPU backend would do at least 32-bit precision, while GPU backends sacrifice precision for performance. This doesn't seem to be true here. I don't really see why better precision should cause failed tests, maybe the threshold should be increased slightly. We definitely need a 32-bit shader version just to support old devices.

@netrunnereve
Copy link
Collaborator

If you make the CPU implementation use FP32 only do the errors go away?

@etasnadi
Copy link
Contributor

I hacked in the cat to kernel type, but it still errors. The error is smaller though. edit: ci agrees

eg

[CONV_2D] NMSE = 0.000000130 > 0.000000100   CONV_2D(ne_input=[1,1,1,2],ne_kernel=[1,1,1,1],type_kernel=f16,stride0=1,stride1=5,padding0=5,padding1=5,dilation0=2,dilation1=4,cwhn=0): FAIL
[CONV_2D] NMSE = 0.000000105 > 0.000000100   CONV_2D(ne_input=[1,1,25,2],ne_kernel=[2,2,25,1],type_kernel=f16,stride0=1,stride1=5,padding0=5,padding1=5,dilation0=2,dilation1=4,cwhn=0): FAIL

Maybe the NMSE threshold is too strict? Fp16 has ~4 significant digits, and I assume that the cpu/Vulkan code do not execute the same calculations in the same order. So it might be useful to define a threshold that respects the number format. E.g. the closest fp16 number to 1/3 is 0.33325195 according to Wikipedia. If the numbers differ from the 5th digit or more the test should be accepted.

@0cc4m
Copy link
Collaborator

0cc4m commented Jul 27, 2025

@ggerganov @slaren Do you have an opinion on the test threshold? I don't want to reduce backend precision just to follow the CPU implementation.

@ggerganov
Copy link
Member

I guess for test_conv_2d we can set the same max NMSE as for test_mul_mat: 5e-4.

@Green-Sky Green-Sky force-pushed the vk_conv2d_fp16_knl branch from a7da6ac to d6c0382 Compare July 27, 2025 08:42
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Green-Sky Green-Sky merged commit 89d1029 into ggml-org:master Jul 27, 2025
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants