Personal CUDA operator learning lab with teaching-oriented kernels, lightweight engineering, and public dev notes.
- 中文
- English
- 开发日志 / Dev Log
- LayerNorm API
- LayerNorm Kernel
- LayerNorm Test
- LayerNorm Benchmark
- LayerNorm Example
- LayerNorm Experiment Driver
- RMSNorm API
- RMSNorm Kernel
- RMSNorm Test
- RMSNorm Benchmark
- RMSNorm Example
- RMSNorm Experiment Driver
- CI Workflow
中文
cuda-oplib 是我的个人 CUDA 算子开发学习项目。
它的核心目标不是只做出几个能跑的 kernel,而是把 CUDA 算子从 demo -> benchmark -> 修 bug -> 正式接入 的全过程尽量保留下来,逐步整理成一个同时具备学习价值和轻度工程化结构的个人项目。
这个仓库会持续朝四个方向推进:
学习化:系统练习 CUDA 算子开发笔记化:记录版本演进、实验结果和设计取舍工程化:把算子逐步接入统一 API、实现、测试、benchmark、example输出化:作为我公开分享 CUDA 学习过程和实践结果的载体
[!NOTE] 这不是一个只追求最终性能数字的黑盒仓库。我更在意把开发过程、思考路径、踩坑记录和教学级代码一起留下来。
| Operator | Status | What It Is | Entry |
|---|---|---|---|
vector_add |
Integrated | 最小 float32 元素加法算子,用来验证项目骨架 | src/kernel/vector_add.cu |
layernorm_half |
Integrated | 基于 half2 路径的 half 精度 LayerNorm,统计使用 float 累加 |
src/kernel/layernorm_half2.cu |
rmsnorm_half |
Integrated | 基于 half2 路径的 half 精度 RMSNorm,平方和统计使用 float 累加 |
src/kernel/rmsnorm_half2.cu |
layernorm_half 目前已经具备完整的正式接入链路:
- API:
include/cuda_oplib/layernorm.h - Kernel:
src/kernel/layernorm_half2.cu - Test:
tests/cpp/test_layernorm.cu - Benchmark:
benchmarks/bench_layernorm.cu - Example:
examples/cpp/layernorm_example.cu - Dev Log:
devLog.md
这个算子目前重点体现的是:
- 行级 LayerNorm 的 CUDA 实现
- Welford 风格的均值/方差统计
- half 存储、float 累加
- 对齐时走
half2向量化路径 - odd tail 或未对齐时自动退回标量处理
除了正式接入链路,LayerNorm 现在还有一套开发级实验层,用来统一做原型验证、版本对比和教学记录:
- Experiment Driver:
python/experiments/layernorm/compare.py - Registry:
python/experiments/layernorm/registry.py - Cases:
python/experiments/layernorm/cases.py - Reports:
python/experiments/layernorm/report.py - References:
python/experiments/layernorm/refs.py
这套实验层当前支持把 torch_official、torch_python、warp、reduction、welford、half2 注册进统一对比流程,在同一组 case 下跑 correctness 和 benchmark。
rmsnorm_half 现在也已经具备完整的正式接入链路:
- API:
include/cuda_oplib/rmsnorm.h - Kernel:
src/kernel/rmsnorm_half2.cu - Test:
tests/cpp/test_rmsnorm.cu - Benchmark:
benchmarks/bench_rmsnorm.cu - Example:
examples/cpp/rmsnorm_example.cu - Dev Log:
devLog.md
这个算子目前重点体现的是:
- 行级 RMSNorm 的 CUDA 实现
- half 存储、float 累加平方和
- 对齐时走
half2向量化路径 - odd tail 或未对齐时自动退回标量处理
- 从开发级实验层晋升到工程级正式算子的完整接入过程
RMSNorm 也已经有对应的开发级实验层:
- Experiment Driver:
python/experiments/RMSnorm/compare.py - Registry:
python/experiments/RMSnorm/registry.py - Cases:
python/experiments/RMSnorm/cases.py - Reports:
python/experiments/RMSnorm/report.py - References:
python/experiments/RMSnorm/refs.py
这套实验层当前支持把 torch_official、torch_python、f32_warp、half2_warp 注册进统一对比流程,在同一组 case 下跑 correctness 和 benchmark。
下表是当前阶段具有代表性的实验结果,用来展示内核演进方向,而不是作为最终性能结论。
| Variant | Shape | Dtype | Avg Latency |
|---|---|---|---|
torch.nn.LayerNorm |
512 x 768 |
float32 |
0.026 ms |
warp kernel |
512 x 768 |
float32 |
0.014 ms |
reduction kernel |
512 x 768 |
float32 |
0.037 ms |
welford kernel |
512 x 768 |
float32 |
0.015 ms |
更完整的开发过程、版本差异和实验背景记录在 devLog.md。
下面这组数据来自工程级正式 benchmark,可直接反映当前项目内正式算子的吞吐水平。
| Operator | Shape | Iters | Avg Latency | Approx Throughput |
|---|---|---|---|---|
layernorm_half |
4096 x 768 |
200 |
0.208 ms |
90.735 GB/s |
rmsnorm_half |
4096 x 768 |
200 |
0.114 ms |
165.056 GB/s |
这组结果说明,在当前正式实现下,rmsnorm_half 明显快于 layernorm_half。这符合两个算子的计算结构差异,因为 RMSNorm 少了均值中心化和 beta 仿射路径。
cuda-oplib
├── include/ public APIs
├── src/kernel/ formal CUDA operator implementations
├── src/pydemo/ older experiments and prototype-stage scripts
├── python/experiments/ development-grade experiment modules and compare drivers
├── tests/ correctness tests
├── benchmarks/ performance benchmarks
├── examples/ minimal usage examples
├── bindings/ future framework bindings
└── docs/ notes, architecture, and planning
这也是这个仓库想强调的一个方向:先在开发级实验层里快速验证,再把稳定版本推进到工程级正式层。
development-grade experiments
-> registry / compare / report
-> stable kernel selection
-> formal operator integration
-> test / benchmark / example
其中:
python/experiments/负责开发级实验、统一 case、统一注册和统一报告输出src/kernel/ + include/ + tests/ + benchmarks/ + examples/负责工程级正式算子能力
这个项目尽量保持“教学可读 + 工程可落地”的平衡。
我的推进方式通常是:
- 先实现一个容易解释的版本
- 再逐步做性能优化
- 在每一轮优化中记录为什么这么改
- 最后把稳定路径接进正式项目结构
因此你会在仓库里看到:
- baseline 和优化版本并存
- 开发级统一实验入口
- bug 修复的上下文
- 从实验原型走向正式接入的完整痕迹
要求:
- CUDA Toolkit 12.x 或更新
- CMake 3.24 或更新
- 支持 C++17 的主机编译器
./scripts/build.sh运行测试:
./scripts/run_tests.sh或者直接使用 CMake:
cmake -S . -B build
cmake --build build -j
ctest --test-dir build开发级实验层当前的统一入口示例:
python3 python/experiments/layernorm/compare.py --case main_fp32
python3 python/experiments/layernorm/compare.py --case main_fp16 --markdown
python3 python/experiments/RMSnorm/compare.py --case main_fp32
python3 python/experiments/RMSnorm/compare.py --case main_fp16 --markdown-
vector_addscaffold -
layernorm_halfhalf2 operator integration - LayerNorm correctness test
- LayerNorm benchmark
- LayerNorm example
- LayerNorm development-grade experiment framework
-
rmsnorm_halfhalf2 operator integration - RMSNorm correctness test
- RMSNorm benchmark
- RMSNorm example
- RMSNorm development-grade experiment framework
- float LayerNorm path
- PyTorch binding
- Softmax
- More public dev notes and teaching-grade kernels
Apache-2.0
English
cuda-oplib is my personal CUDA operator learning project.
The goal is not just to produce a few working kernels. I want this repository to preserve the full path from demo -> benchmark -> bug fixing -> formal integration, and gradually shape that process into a project that is both educational and lightly engineered.
This repository is intentionally built around four parallel goals:
Learning-oriented: systematic CUDA operator practiceNotebook-oriented: version history, experiment results, and design tradeoffsLightly engineered: operators wired into API, implementation, tests, benchmarks, and examplesPublic-facing: a place to share what I am learning and building
[!NOTE] This is not meant to be a black-box repository that only shows final performance numbers. The development process, reasoning, bugs, and teaching-oriented code are part of the product.
| Operator | Status | What It Is | Entry |
|---|---|---|---|
vector_add |
Integrated | Minimal float32 add operator used to validate the project scaffold | src/kernel/vector_add.cu |
layernorm_half |
Integrated | Half-precision LayerNorm centered around a half2 execution path with float accumulation |
src/kernel/layernorm_half2.cu |
rmsnorm_half |
Integrated | Half-precision RMSNorm centered around a half2 execution path with float accumulation |
src/kernel/rmsnorm_half2.cu |
The current layernorm_half path already includes:
- API:
include/cuda_oplib/layernorm.h - Kernel:
src/kernel/layernorm_half2.cu - Test:
tests/cpp/test_layernorm.cu - Benchmark:
benchmarks/bench_layernorm.cu - Example:
examples/cpp/layernorm_example.cu - Dev Log:
devLog.md
At a high level, it currently emphasizes:
- row-wise LayerNorm in CUDA
- Welford-style mean/variance reduction
- half storage with float accumulation
half2vectorized execution when alignment permits- scalar fallback for odd tails or unaligned cases
Besides the formal operator path, LayerNorm now also has a development-grade experiment layer used for prototype comparison, version tracking, and teaching-oriented benchmarking:
- Experiment Driver:
python/experiments/layernorm/compare.py - Registry:
python/experiments/layernorm/registry.py - Cases:
python/experiments/layernorm/cases.py - Reports:
python/experiments/layernorm/report.py - References:
python/experiments/layernorm/refs.py
This experiment layer currently supports registering torch_official, torch_python, warp, reduction, welford, and half2, then comparing them under unified cases for correctness and latency.
The current rmsnorm_half path already includes:
- API:
include/cuda_oplib/rmsnorm.h - Kernel:
src/kernel/rmsnorm_half2.cu - Test:
tests/cpp/test_rmsnorm.cu - Benchmark:
benchmarks/bench_rmsnorm.cu - Example:
examples/cpp/rmsnorm_example.cu - Dev Log:
devLog.md
At a high level, it currently emphasizes:
- row-wise RMSNorm in CUDA
- half storage with float accumulation for the sum of squares
half2vectorized execution when alignment permits- scalar fallback for odd tails or unaligned cases
- a complete promotion path from development-grade experiments into the engineering-grade operator layer
RMSNorm also now has a development-grade experiment layer used for prototype comparison, version tracking, and teaching-oriented benchmarking:
- Experiment Driver:
python/experiments/RMSnorm/compare.py - Registry:
python/experiments/RMSnorm/registry.py - Cases:
python/experiments/RMSnorm/cases.py - Reports:
python/experiments/RMSnorm/report.py - References:
python/experiments/RMSnorm/refs.py
This experiment layer currently supports registering torch_official, torch_python, f32_warp, and half2_warp, then comparing them under unified cases for correctness and latency.
The table below shows representative prototype-stage results. It is intended as a development snapshot, not a final performance claim.
| Variant | Shape | Dtype | Avg Latency |
|---|---|---|---|
torch.nn.LayerNorm |
512 x 768 |
float32 |
0.026 ms |
warp kernel |
512 x 768 |
float32 |
0.014 ms |
reduction kernel |
512 x 768 |
float32 |
0.037 ms |
welford kernel |
512 x 768 |
float32 |
0.015 ms |
The table below comes from the project-native engineering benchmark targets and reflects the current formal operator implementations.
| Operator | Shape | Iters | Avg Latency | Approx Throughput |
|---|---|---|---|---|
layernorm_half |
4096 x 768 |
200 |
0.208 ms |
90.735 GB/s |
rmsnorm_half |
4096 x 768 |
200 |
0.114 ms |
165.056 GB/s |
At the moment, rmsnorm_half is clearly faster than layernorm_half in the formal benchmark path. That matches the structural difference between the operators, since RMSNorm avoids mean-centering and the beta affine path.
For the full iteration history, notes, and debugging context, see devLog.md.
cuda-oplib
├── include/ public APIs
├── src/kernel/ formal CUDA operator implementations
├── src/pydemo/ older experiments and prototype-stage scripts
├── python/experiments/ development-grade experiment modules and compare drivers
├── tests/ correctness tests
├── benchmarks/ performance benchmarks
├── examples/ minimal usage examples
├── bindings/ future framework bindings
└── docs/ notes, architecture, and planning
One of the main ideas behind this repository is to keep the path visible: validate quickly in the development-grade experiment layer, then promote stable kernels into the formal project layer.
development-grade experiments
-> registry / compare / report
-> stable kernel selection
-> formal operator integration
-> test / benchmark / example
In practice:
python/experiments/handles fast comparison, shared cases, and unified experiment reportssrc/kernel/ + include/ + tests/ + benchmarks/ + examples/handle formal project integration
This project tries to balance teaching readability with practical engineering.
My usual workflow is:
- build a version that is easy to explain
- optimize it step by step
- document why each optimization exists
- promote the stable path into the formal project structure
That is why the repository intentionally keeps:
- baseline and optimized versions
- a development-grade unified experiment entrypoint
- bug-fix context
- traces of how a prototype evolves into an operator
Requirements:
- CUDA Toolkit 12.x or newer
- CMake 3.24 or newer
- A C++17-capable host compiler
./scripts/build.shRun tests:
./scripts/run_tests.shOr directly with CMake:
cmake -S . -B build
cmake --build build -j
ctest --test-dir buildCurrent development-grade experiment entrypoints include:
python3 python/experiments/layernorm/compare.py --case main_fp32
python3 python/experiments/layernorm/compare.py --case main_fp16 --markdown
python3 python/experiments/RMSnorm/compare.py --case main_fp32
python3 python/experiments/RMSnorm/compare.py --case main_fp16 --markdown-
vector_addscaffold -
layernorm_halfhalf2 operator integration - LayerNorm correctness test
- LayerNorm benchmark
- LayerNorm example
- LayerNorm development-grade experiment framework
-
rmsnorm_halfhalf2 operator integration - RMSNorm correctness test
- RMSNorm benchmark
- RMSNorm example
- RMSNorm development-grade experiment framework
- float LayerNorm path
- PyTorch binding
- Softmax
- More public dev notes and teaching-grade kernels
Apache-2.0