Skip to content

【启航计划】PaddlePaddle PHI算子库CUDA Kernel规范化 #75226

@YqGe585

Description

@YqGe585

背景

飞桨在3.1 版本推出了 类 CUDA 硬件接入方案。该方案在 Custom Device硬件接入方案 的基础上进行了升级,最大的特点是可以 复用飞桨 PHI 算子库中的大量 CUDA Kernel。 当前此方案已经成功接入沐曦(metax_gpu)和天数智芯(iluvatar_gpu)

然而,目前PHI 算子库中的部分 CUDA Kernel 并未考虑被其他模块复用的情况,导致出现以下问题: 部分 Kernel 缺少函数声明,类 CUDA 硬件在复用时不得不直接 #include .cu 源文件,这不符合代码规范。

因此,本次活动旨在对 PHI算子库的 CUDA Kernel 进行规范化修复:

  • 在Paddle仓库中为缺少头文件的 Kernel 新增对应声明文件(.h);
  • 修复 PaddleCustomDevice 仓库中错误的 #include cu 用法,改为 #include 正确的头文件。

涉及范围

  • 涉及仓库

    1. Paddle
    2. PaddleCustomDevice
  • 影响文件
    PaddleCustomDevice 仓 中,所有被 #include 到注册文件中的算子 Kernel .cu 源文件,共 136 个
    具体文件列表见下方表格:

任务

修复目标

  1. PaddlePaddle 仓库 中为缺少声明的 Kernel 新增头文件
  2. PaddleCustomDevice 仓库 中修改错误的 #include *.cu,改为 #include 新增的头文件,同时把Kernel的实现代码正确的添加到CMakelists编译列表中。需要修改的代码只出现在backends/metax_gpubackends/iluvatar_gpu这两个目录下。
序号 文件名称 认领人 / 状态 / PR号
1 paddle/phi/kernels/fusion/gpu/distributed_fused_lamb_init_kernel.cu @Le-soleile #2004
@YqGe585
2 paddle/phi/kernels/fusion/gpu/fused_bias_act_kernel.cu @Le-soleile #75506 #2004
3 paddle/phi/kernels/fusion/gpu/fused_bias_dropout_residual_layer_norm_grad_kernel.cu @wanglezz #75601
4 paddle/phi/kernels/fusion/gpu/fused_bias_dropout_residual_layer_norm_kernel.cu @wanglezz #75625
5 paddle/phi/kernels/fusion/gpu/fused_embedding_eltwise_layernorm_kernel.cu @wanglezz #75626
6 paddle/phi/kernels/fusion/gpu/fused_layernorm_kernel.cu @WanRui37 #75532
7 paddle/phi/kernels/fusion/gpu/fused_seqpool_cvm_grad_kernel.cu @SpongeBob0318 #75531 #75536 #2007 #2008
8 paddle/phi/kernels/fusion/gpu/fused_seqpool_cvm_kernel.cu @SpongeBob0318 #75537 #2009
9 paddle/phi/kernels/fusion/gpu/fused_softmax_mask_grad_kernel.cu @SpongeBob0318 #75538 #2010
10 paddle/phi/kernels/fusion/gpu/fused_softmax_mask_kernel.cu @youge325 #75655
11 paddle/phi/kernels/fusion/gpu/fused_softmax_mask_upper_triangle_kernel.cu
12 paddle/phi/kernels/fusion/gpu/fused_stack_transpose_quant_kernel.cu @youge325 #75658 #2045
13 paddle/phi/kernels/fusion/gpu/fused_transpose_split_quant_kernel.cu @SpongeBob0318 #75539 #2011
14 paddle/phi/kernels/fusion/gpu/fused_transpose_wlch_split_quant_kernel.cu @SpongeBob0318 #75540 #2012
15 paddle/phi/kernels/fusion/gpu/fusion_group_kernel.cu @SpongeBob0318 #75541 #2013
16 paddle/phi/kernels/fusion/gpu/masked_multihead_attention_kernel.cu @Le-soleile #75706
17 paddle/phi/kernels/fusion/gpu/qkv_unpack_mha_kernel.cu @Le-soleile #75707
18 paddle/phi/kernels/fusion/gpu/skip_layernorm_kernel.cu @SpongeBob0318 #75542 #2014
19 paddle/phi/kernels/gpu/affine_channel_grad_kernel.cu @SpongeBob0318 #75543 #2015 #2025 #2029
20 paddle/phi/kernels/gpu/affine_channel_kernel.cu @SpongeBob0318 #75545 #2016
21 paddle/phi/kernels/gpu/ap_facade_kernel.cu @youge325 #75659 #2046
@Echo-Nie #75657 #2043
22 paddle/phi/kernels/gpu/ap_trivial_fusion_begin_kernel.cu @youge325 #75660
23 paddle/phi/kernels/gpu/ap_trivial_fusion_end_kernel.cu @youge325 #75661
24 paddle/phi/kernels/gpu/ap_variadic_kernel.cu @youge325 #75662
25 paddle/phi/kernels/gpu/argsort_grad_kernel.cu
26 paddle/phi/kernels/gpu/barrier_kernel.cu @youge325 #75663
27 paddle/phi/kernels/gpu/bce_loss_grad_kernel.cu @Luxorion-12
28 paddle/phi/kernels/gpu/bce_loss_kernel.cu @tjujingzong
29 paddle/phi/kernels/gpu/binomial_kernel.cu @tjujingzong
30 paddle/phi/kernels/gpu/bmm_grad_kernel.cu @tjujingzong
31 paddle/phi/kernels/gpu/bmm_kernel.cu @tjujingzong
32 paddle/phi/kernels/gpu/box_clip_kernel.cu @algorithm1832 #75592 #2021
33 paddle/phi/kernels/gpu/c_concat_kernel.cu @algorithm1832 #75648 #2052
34 paddle/phi/kernels/gpu/c_embedding_grad_kernel.cu @algorithm1832 #2036
35 paddle/phi/kernels/gpu/c_scatter_kernel.cu @algorithm1832 #75653 #2059
36 paddle/phi/kernels/gpu/c_softmax_with_cross_entropy_grad_kernel.cu @youge325 #75664
37 paddle/phi/kernels/gpu/cast_kernel.cu
38 paddle/phi/kernels/gpu/class_center_sample_kernel.cu
39 paddle/phi/kernels/gpu/collect_fpn_proposals_kernel.cu @youge325 #75665
40 paddle/phi/kernels/gpu/comm_init_all_kernel.cu @youge325 #75666
41 paddle/phi/kernels/gpu/complex_kernel.cu
42 paddle/phi/kernels/gpu/correlation_grad_kernel.cu @tjujingzong #75633 #2047
43 paddle/phi/kernels/gpu/correlation_kernel.cu @youge325 #75667
44 paddle/phi/kernels/gpu/ctc_align_kernel.cu
45 paddle/phi/kernels/gpu/cvm_grad_kernel.cu @Le-soleile #75704
46 paddle/phi/kernels/gpu/cvm_kernel.cu @Le-soleile #75703
47 paddle/phi/kernels/gpu/deformable_conv_grad_kernel.cu @SqZhang666
48 paddle/phi/kernels/gpu/deformable_conv_kernel.cu @SqZhang666
49 paddle/phi/kernels/gpu/elementwise_grad_kernel.cu
50 paddle/phi/kernels/gpu/embedding_with_scaled_gradient_grad_kernel.cu
51 paddle/phi/kernels/gpu/exponential_kernel.cu
52 paddle/phi/kernels/gpu/flip_kernel.cu
53 paddle/phi/kernels/gpu/fused_token_prune_kernel.cu @Le-soleile #75701
54 paddle/phi/kernels/gpu/gather_grad_kernel.cu
55 paddle/phi/kernels/gpu/gelu_grad_kernel.cu
56 paddle/phi/kernels/gpu/global_gather_kernel.cu @Le-soleile #75700
57 paddle/phi/kernels/gpu/global_scatter_kernel.cu @Le-soleile #75699
58 paddle/phi/kernels/gpu/group_norm_grad_kernel.cu @chenjin060204
59 paddle/phi/kernels/gpu/group_norm_kernel.cu @chenjin060204
60 paddle/phi/kernels/gpu/gru_kernel.cu @algorithm1832 #75845
61 paddle/phi/kernels/gpu/index_add_grad_kernel.cu @algorithm1832
62 paddle/phi/kernels/gpu/interpolate_grad_kernel.cu @algorithm1832
63 paddle/phi/kernels/gpu/interpolate_kernel.cu @algorithm1832
64 paddle/phi/kernels/gpu/kldiv_loss_grad_kernel.cu @algorithm1832
65 paddle/phi/kernels/gpu/kldiv_loss_kernel.cu
66 paddle/phi/kernels/gpu/l1_norm_grad_kernel.cu @Le-soleile #75647
67 paddle/phi/kernels/gpu/l1_norm_kernel.cu
68 paddle/phi/kernels/gpu/label_smooth_grad_kernel.cu
69 paddle/phi/kernels/gpu/label_smooth_kernel.cu
70 paddle/phi/kernels/gpu/lamb_kernel.cu @dh-Unicorn
71 paddle/phi/kernels/gpu/lgamma_kernel.cu @dh-Unicorn
72 paddle/phi/kernels/gpu/log_softmax_grad_kernel.cu @dh-Unicorn
73 paddle/phi/kernels/gpu/logsumexp_kernel.cu
74 paddle/phi/kernels/gpu/lookup_table_grad_kernel.cu @Le-soleile #75645
75 paddle/phi/kernels/gpu/lookup_table_kernel.cu @Le-soleile #75645
76 paddle/phi/kernels/gpu/lu_solve_kernel.cu
77 paddle/phi/kernels/gpu/margin_cross_entropy_kernel.cu
78 paddle/phi/kernels/gpu/matrix_power_grad_kernel.cu
79 paddle/phi/kernels/gpu/matrix_power_kernel.cu
80 paddle/phi/kernels/gpu/mean_all_grad_kernel.cu
81 paddle/phi/kernels/gpu/moe_unpermute_kernel.cu @Le-soleile #75644
82 paddle/phi/kernels/gpu/momentum_kernel.cu
83 paddle/phi/kernels/gpu/mp_allreduce_sum_kernel.cu
84 paddle/phi/kernels/gpu/multiclass_nms3_kernel.cu
85 paddle/phi/kernels/gpu/multiplex_grad_kernel.cu
86 paddle/phi/kernels/gpu/nonzero_kernel.cu
87 paddle/phi/kernels/gpu/pad3d_kernel.cu
88 paddle/phi/kernels/gpu/partial_allgather_kernel.cu @Le-soleile #75643
89 paddle/phi/kernels/gpu/partial_concat_grad_kernel.cu @Le-soleile #75642
90 paddle/phi/kernels/gpu/partial_concat_kernel.cu
91 paddle/phi/kernels/gpu/partial_recv_kernel.cu @Le-soleile #75641
92 paddle/phi/kernels/gpu/partial_send_kernel.cu @Le-soleile #75640
93 paddle/phi/kernels/gpu/psroi_pool_grad_kernel.cu @xxiu1
94 paddle/phi/kernels/gpu/quantize_linear_kernel.cu
95 paddle/phi/kernels/gpu/reduce_kernel.cu
96 paddle/phi/kernels/gpu/repeat_interleave_grad_kernel.cu @SqZhang666
97 paddle/phi/kernels/gpu/repeat_interleave_kernel.cu @SqZhang666
98 paddle/phi/kernels/gpu/rmsprop_kernel.cu
99 paddle/phi/kernels/gpu/roi_align_grad_kernel.cu
100 paddle/phi/kernels/gpu/roi_align_kernel.cu @Le-soleile #2005
101 paddle/phi/kernels/gpu/row_conv_grad_kernel.cu @Le-soleile #75554
102 paddle/phi/kernels/gpu/row_conv_kernel.cu @Le-soleile #75562
103 paddle/phi/kernels/gpu/seed_kernel.cu @Le-soleile #75577
104 paddle/phi/kernels/gpu/sequence_expand_kernel.cu @Le-soleile #75578
105 paddle/phi/kernels/gpu/set_value_kernel.cu @Le-soleile #2018
106 paddle/phi/kernels/gpu/shuffle_channel_grad_kernel.cu @Le-soleile #75580
107 paddle/phi/kernels/gpu/shuffle_channel_kernel.cu @Le-soleile #2020 #75608
108 paddle/phi/kernels/gpu/soft_relu_grad_kernel.cu @Le-soleile #75581
109 paddle/phi/kernels/gpu/spectral_norm_grad_kernel.cu @Le-soleile #2027
110 paddle/phi/kernels/gpu/spectral_norm_kernel.cu @Le-soleile #2028
111 paddle/phi/kernels/gpu/stack_grad_kernel.cu
112 paddle/phi/kernels/gpu/stft_grad_kernel.cu @Le-soleile #75614
113 paddle/phi/kernels/gpu/sync_batch_norm_grad_kernel.cu
114 paddle/phi/kernels/gpu/top_k_kernel.cu
115 paddle/phi/kernels/gpu/uniform_random_batch_size_like_kernel.cu @Le-soleile #75615
116 paddle/phi/kernels/gpu/weighted_sample_neighbors_kernel.cu
117 paddle/phi/kernels/gpu/yolo_box_head_kernel.cu @Le-soleile #75616
118 paddle/phi/kernels/gpu/yolo_box_post_kernel.cu @Le-soleile #75636
119 paddle/phi/kernels/kps/elementwise_kernel.cu
120 paddle/phi/kernels/legacy/gpu/cal_aux_loss_grad_kernel.cu @Le-soleile #75637
121 paddle/phi/kernels/legacy/gpu/cal_aux_loss_kernel.cu @Le-soleile #75639
122 paddle/phi/kernels/legacy/gpu/expand_modality_expert_id_kernel.cu @Le-soleile #75708
123 paddle/phi/kernels/legacy/gpu/ext_build_src_rank_and_local_expert_id_kernel.cu @Le-soleile #75709
124 paddle/phi/kernels/legacy/gpu/fp8_quant_blockwise_kernel.cu @Le-soleile #75710
125 paddle/phi/kernels/legacy/gpu/int_bincount.cu @junhaoguo809-crypto
126 paddle/phi/kernels/legacy/gpu/layer_norm_cuda_kernel.cu @junhaoguo809-crypto
127 paddle/phi/kernels/legacy/gpu/moe_combine_grad_kernel.cu @junhaoguo809-crypto
128 paddle/phi/kernels/legacy/gpu/moe_combine_kernel.cu @junhaoguo809-crypto
129 paddle/phi/kernels/legacy/gpu/moe_combine_no_weight_kernel.cu @junhaoguo809-crypto
130 paddle/phi/kernels/legacy/gpu/moe_gate_dispatch_grad_kernel.cu @junhaoguo809-crypto
131 paddle/phi/kernels/legacy/gpu/moe_gate_dispatch_kernel.cu
132 paddle/phi/kernels/legacy/gpu/moe_gate_dispatch_permute_grad_kernel.cu @Le-soleile #75711
133 paddle/phi/kernels/legacy/gpu/moe_gate_dispatch_permute_kernel.cu @Le-soleile #75713
134 paddle/phi/kernels/legacy/gpu/moe_ops_partial_nosoftmaxtopk_grad_kernel.cu @Le-soleile #75714
135 paddle/phi/kernels/legacy/gpu/moe_ops_partial_nosoftmaxtopk_kernel.cu @Le-soleile #75715
136 paddle/phi/kernels/legacy/kps/compare_kernel.cu

示例修复&代码提交方式

请参考 #75226 (comment)

认领方式

请大家以 comment 的形式认领任务,如:

【报名】:1、3、2-3
  • 多个任务之间需要使用中文顿号分隔,报名多个连续任务可用横线表示,如 1-2
  • PR 提交格式:
    • 两个仓库分别提交 PR,Paddle 的 PR 合入后,再提交 PaddleCustomDevice 的 PR
    • 两个仓库的 PR 标题均以 【CUDA Kernel No.xxx】 开头,注明任务编号
    • Paddle 仓库的 PR 标题以 -part 结尾

看板信息

任务方向 任务数量 提交作品 / 任务认领 提交率 完成 完成率
CUDA Kernel规范化 136 71 / 96 52.21% 3 2.21%

统计信息

排名不分先后 @SpongeBob0318 (2) @Le-soleile (1)

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions