Skip to content

ProIg-Chaa/cuda-oplib

Repository files navigation

cuda-oplib

Personal CUDA operator learning lab with teaching-oriented kernels, lightweight engineering, and public dev notes.

CUDA C++ CI License

Quick Links

中文

项目定位

cuda-oplib 是我的个人 CUDA 算子开发学习项目。

它的核心目标不是只做出几个能跑的 kernel,而是把 CUDA 算子从 demo -> benchmark -> 修 bug -> 正式接入 的全过程尽量保留下来,逐步整理成一个同时具备学习价值和轻度工程化结构的个人项目。

这个仓库会持续朝四个方向推进:

  • 学习化:系统练习 CUDA 算子开发
  • 笔记化:记录版本演进、实验结果和设计取舍
  • 工程化:把算子逐步接入统一 API、实现、测试、benchmark、example
  • 输出化:作为我公开分享 CUDA 学习过程和实践结果的载体

[!NOTE] 这不是一个只追求最终性能数字的黑盒仓库。我更在意把开发过程、思考路径、踩坑记录和教学级代码一起留下来。

当前内容

Operator Snapshot

Operator Status What It Is Entry
vector_add Integrated 最小 float32 元素加法算子,用来验证项目骨架 src/kernel/vector_add.cu
layernorm_half Integrated 基于 half2 路径的 half 精度 LayerNorm,统计使用 float 累加 src/kernel/layernorm_half2.cu
rmsnorm_half Integrated 基于 half2 路径的 half 精度 RMSNorm,平方和统计使用 float 累加 src/kernel/rmsnorm_half2.cu

LayerNorm Overview

layernorm_half 目前已经具备完整的正式接入链路:

这个算子目前重点体现的是:

  • 行级 LayerNorm 的 CUDA 实现
  • Welford 风格的均值/方差统计
  • half 存储、float 累加
  • 对齐时走 half2 向量化路径
  • odd tail 或未对齐时自动退回标量处理

除了正式接入链路,LayerNorm 现在还有一套开发级实验层,用来统一做原型验证、版本对比和教学记录:

这套实验层当前支持把 torch_officialtorch_pythonwarpreductionwelfordhalf2 注册进统一对比流程,在同一组 case 下跑 correctness 和 benchmark。

RMSNorm Overview

rmsnorm_half 现在也已经具备完整的正式接入链路:

这个算子目前重点体现的是:

  • 行级 RMSNorm 的 CUDA 实现
  • half 存储、float 累加平方和
  • 对齐时走 half2 向量化路径
  • odd tail 或未对齐时自动退回标量处理
  • 从开发级实验层晋升到工程级正式算子的完整接入过程

RMSNorm 也已经有对应的开发级实验层:

这套实验层当前支持把 torch_officialtorch_pythonf32_warphalf2_warp 注册进统一对比流程,在同一组 case 下跑 correctness 和 benchmark。

LayerNorm Benchmark Snapshot

下表是当前阶段具有代表性的实验结果,用来展示内核演进方向,而不是作为最终性能结论。

Variant Shape Dtype Avg Latency
torch.nn.LayerNorm 512 x 768 float32 0.026 ms
warp kernel 512 x 768 float32 0.014 ms
reduction kernel 512 x 768 float32 0.037 ms
welford kernel 512 x 768 float32 0.015 ms

更完整的开发过程、版本差异和实验背景记录在 devLog.md

Engineering Benchmark Snapshot

下面这组数据来自工程级正式 benchmark,可直接反映当前项目内正式算子的吞吐水平。

Operator Shape Iters Avg Latency Approx Throughput
layernorm_half 4096 x 768 200 0.208 ms 90.735 GB/s
rmsnorm_half 4096 x 768 200 0.114 ms 165.056 GB/s

这组结果说明,在当前正式实现下,rmsnorm_half 明显快于 layernorm_half。这符合两个算子的计算结构差异,因为 RMSNorm 少了均值中心化和 beta 仿射路径。

仓库结构

cuda-oplib
├── include/                public APIs
├── src/kernel/             formal CUDA operator implementations
├── src/pydemo/             older experiments and prototype-stage scripts
├── python/experiments/     development-grade experiment modules and compare drivers
├── tests/                  correctness tests
├── benchmarks/             performance benchmarks
├── examples/               minimal usage examples
├── bindings/               future framework bindings
└── docs/                   notes, architecture, and planning

这也是这个仓库想强调的一个方向:先在开发级实验层里快速验证,再把稳定版本推进到工程级正式层。

development-grade experiments
-> registry / compare / report
-> stable kernel selection
-> formal operator integration
-> test / benchmark / example

其中:

  • python/experiments/ 负责开发级实验、统一 case、统一注册和统一报告输出
  • src/kernel/ + include/ + tests/ + benchmarks/ + examples/ 负责工程级正式算子能力

开发风格

这个项目尽量保持“教学可读 + 工程可落地”的平衡。

我的推进方式通常是:

  1. 先实现一个容易解释的版本
  2. 再逐步做性能优化
  3. 在每一轮优化中记录为什么这么改
  4. 最后把稳定路径接进正式项目结构

因此你会在仓库里看到:

  • baseline 和优化版本并存
  • 开发级统一实验入口
  • bug 修复的上下文
  • 从实验原型走向正式接入的完整痕迹

Build

要求:

  • CUDA Toolkit 12.x 或更新
  • CMake 3.24 或更新
  • 支持 C++17 的主机编译器
./scripts/build.sh

运行测试:

./scripts/run_tests.sh

或者直接使用 CMake:

cmake -S . -B build
cmake --build build -j
ctest --test-dir build

开发级实验层当前的统一入口示例:

python3 python/experiments/layernorm/compare.py --case main_fp32
python3 python/experiments/layernorm/compare.py --case main_fp16 --markdown
python3 python/experiments/RMSnorm/compare.py --case main_fp32
python3 python/experiments/RMSnorm/compare.py --case main_fp16 --markdown

Roadmap

  • vector_add scaffold
  • layernorm_half half2 operator integration
  • LayerNorm correctness test
  • LayerNorm benchmark
  • LayerNorm example
  • LayerNorm development-grade experiment framework
  • rmsnorm_half half2 operator integration
  • RMSNorm correctness test
  • RMSNorm benchmark
  • RMSNorm example
  • RMSNorm development-grade experiment framework
  • float LayerNorm path
  • PyTorch binding
  • Softmax
  • More public dev notes and teaching-grade kernels

License

Apache-2.0

English

Project Purpose

cuda-oplib is my personal CUDA operator learning project.

The goal is not just to produce a few working kernels. I want this repository to preserve the full path from demo -> benchmark -> bug fixing -> formal integration, and gradually shape that process into a project that is both educational and lightly engineered.

This repository is intentionally built around four parallel goals:

  • Learning-oriented: systematic CUDA operator practice
  • Notebook-oriented: version history, experiment results, and design tradeoffs
  • Lightly engineered: operators wired into API, implementation, tests, benchmarks, and examples
  • Public-facing: a place to share what I am learning and building

[!NOTE] This is not meant to be a black-box repository that only shows final performance numbers. The development process, reasoning, bugs, and teaching-oriented code are part of the product.

Current Status

Operator Snapshot

Operator Status What It Is Entry
vector_add Integrated Minimal float32 add operator used to validate the project scaffold src/kernel/vector_add.cu
layernorm_half Integrated Half-precision LayerNorm centered around a half2 execution path with float accumulation src/kernel/layernorm_half2.cu
rmsnorm_half Integrated Half-precision RMSNorm centered around a half2 execution path with float accumulation src/kernel/rmsnorm_half2.cu

LayerNorm Overview

The current layernorm_half path already includes:

At a high level, it currently emphasizes:

  • row-wise LayerNorm in CUDA
  • Welford-style mean/variance reduction
  • half storage with float accumulation
  • half2 vectorized execution when alignment permits
  • scalar fallback for odd tails or unaligned cases

Besides the formal operator path, LayerNorm now also has a development-grade experiment layer used for prototype comparison, version tracking, and teaching-oriented benchmarking:

This experiment layer currently supports registering torch_official, torch_python, warp, reduction, welford, and half2, then comparing them under unified cases for correctness and latency.

RMSNorm Overview

The current rmsnorm_half path already includes:

At a high level, it currently emphasizes:

  • row-wise RMSNorm in CUDA
  • half storage with float accumulation for the sum of squares
  • half2 vectorized execution when alignment permits
  • scalar fallback for odd tails or unaligned cases
  • a complete promotion path from development-grade experiments into the engineering-grade operator layer

RMSNorm also now has a development-grade experiment layer used for prototype comparison, version tracking, and teaching-oriented benchmarking:

This experiment layer currently supports registering torch_official, torch_python, f32_warp, and half2_warp, then comparing them under unified cases for correctness and latency.

LayerNorm Benchmark Snapshot

The table below shows representative prototype-stage results. It is intended as a development snapshot, not a final performance claim.

Variant Shape Dtype Avg Latency
torch.nn.LayerNorm 512 x 768 float32 0.026 ms
warp kernel 512 x 768 float32 0.014 ms
reduction kernel 512 x 768 float32 0.037 ms
welford kernel 512 x 768 float32 0.015 ms

Engineering Benchmark Snapshot

The table below comes from the project-native engineering benchmark targets and reflects the current formal operator implementations.

Operator Shape Iters Avg Latency Approx Throughput
layernorm_half 4096 x 768 200 0.208 ms 90.735 GB/s
rmsnorm_half 4096 x 768 200 0.114 ms 165.056 GB/s

At the moment, rmsnorm_half is clearly faster than layernorm_half in the formal benchmark path. That matches the structural difference between the operators, since RMSNorm avoids mean-centering and the beta affine path.

For the full iteration history, notes, and debugging context, see devLog.md.

Repository Layout

cuda-oplib
├── include/                public APIs
├── src/kernel/             formal CUDA operator implementations
├── src/pydemo/             older experiments and prototype-stage scripts
├── python/experiments/     development-grade experiment modules and compare drivers
├── tests/                  correctness tests
├── benchmarks/             performance benchmarks
├── examples/               minimal usage examples
├── bindings/               future framework bindings
└── docs/                   notes, architecture, and planning

One of the main ideas behind this repository is to keep the path visible: validate quickly in the development-grade experiment layer, then promote stable kernels into the formal project layer.

development-grade experiments
-> registry / compare / report
-> stable kernel selection
-> formal operator integration
-> test / benchmark / example

In practice:

  • python/experiments/ handles fast comparison, shared cases, and unified experiment reports
  • src/kernel/ + include/ + tests/ + benchmarks/ + examples/ handle formal project integration

Development Style

This project tries to balance teaching readability with practical engineering.

My usual workflow is:

  1. build a version that is easy to explain
  2. optimize it step by step
  3. document why each optimization exists
  4. promote the stable path into the formal project structure

That is why the repository intentionally keeps:

  • baseline and optimized versions
  • a development-grade unified experiment entrypoint
  • bug-fix context
  • traces of how a prototype evolves into an operator

Build

Requirements:

  • CUDA Toolkit 12.x or newer
  • CMake 3.24 or newer
  • A C++17-capable host compiler
./scripts/build.sh

Run tests:

./scripts/run_tests.sh

Or directly with CMake:

cmake -S . -B build
cmake --build build -j
ctest --test-dir build

Current development-grade experiment entrypoints include:

python3 python/experiments/layernorm/compare.py --case main_fp32
python3 python/experiments/layernorm/compare.py --case main_fp16 --markdown
python3 python/experiments/RMSnorm/compare.py --case main_fp32
python3 python/experiments/RMSnorm/compare.py --case main_fp16 --markdown

Roadmap

  • vector_add scaffold
  • layernorm_half half2 operator integration
  • LayerNorm correctness test
  • LayerNorm benchmark
  • LayerNorm example
  • LayerNorm development-grade experiment framework
  • rmsnorm_half half2 operator integration
  • RMSNorm correctness test
  • RMSNorm benchmark
  • RMSNorm example
  • RMSNorm development-grade experiment framework
  • float LayerNorm path
  • PyTorch binding
  • Softmax
  • More public dev notes and teaching-grade kernels

License

Apache-2.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors