-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Labels
Description
batchnorm(bn) is very popular in CV, almost every conv op will be followed by bn. I see layernorm in triton achieved best HBM bandwidth. So I'm curious about implement batchnorm in triton.
My questions:
- is there any chance that triton implemented bn faster than pytorch?
- may you guys give more details on layernorm's triton implementation? about why it achieved so amazing bandwidth