Skip to content

Conversation

@CISC
Copy link
Collaborator

@CISC CISC commented Jul 18, 2025

Implemented missing BF16 CPY ops and enabled CONT op for BF16.

Tests before
  CONT(type=bf16,ne=[2,1,1,1]): not supported [CUDA0] 
  CONT(type=bf16,ne=[2,1,3,5]): not supported [CUDA0] 
  CONT(type=bf16,ne=[2,3,5,7]): not supported [CUDA0] 
[...]
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): not supported [CUDA0] 
[...]
  CPY(type_src=f16,type_dst=bf16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=f16,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
[...]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=f16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=f16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
[...]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
Tests after
  CONT(type=bf16,ne=[2,1,1,1]): OK
  CONT(type=bf16,ne=[2,1,3,5]): OK
  CONT(type=bf16,ne=[2,3,5,7]): OK
[...]
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): OK
[...]
  CPY(type_src=f16,type_dst=bf16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=f16,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
[...]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=f16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=f16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
[...]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK

Also fixed a cut'n'paste error for F16->F16 in ggml_cuda_cpy_fn and deduplicated all copy functions.

@CISC CISC requested a review from JohannesGaessler July 18, 2025 20:55
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 18, 2025
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking I am not a fan of how the float conversions are being done currently. I think the code could be deduplicated significantly by unconditionally casting half, nv_bfloat16, and float to float and then simply using that float value to set the destination. I would appreciate it if you were to do this in this PR, otherwise I'll keep it as one of the tasks to hand out when people ask me for a good first issue to work on.

@CISC CISC requested a review from JohannesGaessler July 21, 2025 14:49
@CISC CISC requested a review from JohannesGaessler July 21, 2025 16:02
@CISC CISC requested a review from JohannesGaessler July 21, 2025 21:07
@CISC CISC merged commit e28c0b8 into master Jul 22, 2025
47 checks passed
@CISC CISC deleted the cisc/cuda-bf16-cpy-cont branch July 22, 2025 10:33
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jul 23, 2025
* origin/master: (49 commits)
ci : correct label refactor->refactoring (ggml-org#14832)
CUDA: fix quantized KV cache + multiple sequences (ggml-org#14822)
tests : add non-cont K,V FA tests
memory : handle saving/loading null layers in recurrent memory (ggml-org#14675)
ggml: fix loongarch quantize_row_q8_1 error (ggml-org#14827)
CANN: weight format to NZ for Ascend310P3 (ggml-org#14407)
CUDA: add fused rms norm (ggml-org#14800)
ggml : model card yaml tab->2xspace (ggml-org#14819)
vulkan: fix rms_norm_mul to handle broadcasting dim0 (ggml-org#14817)
llama : add model type detection for rwkv7 7B&14B (ggml-org#14816)
imatrix: add option to display importance score statistics for a given imatrix file (ggml-org#12718)
Mtmd: add a way to select device for vision encoder (ggml-org#14236)
cuda : implement bf16 cpy ops and enable bf16 cont (ggml-org#14763)
opencl: remove unreachable `return` (ggml-org#14806)
server : allow setting `--reverse-prompt` arg (ggml-org#14799)
cuda: remove linking to cublasLt (ggml-org#14790)
opencl: fix `im2col` when `KW!=KH` (ggml-org#14803)
opencl: add conv2d kernel (ggml-org#14403)
sycl: Fix im2col (ggml-org#14797)
kleidiai: add support for get_rows (ggml-org#14676)
...
taronaeo pushed a commit to taronaeo/llama.cpp-s390x that referenced this pull request Jul 25, 2025
* implement bf16 cpy ops and enable bf16 cont

* deduplicate copy functions

* deduplicate checks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants