-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Which component has the problem?
CuTe DSL
Bug Report
Describe the bug
Even in the newest environment (CTK 13.1, driver 590), cute-dsl still uses ptxas 12.9 to compile, leading to suboptimal code.
This is important as CTK 12.9 and 13.0 generates suboptimal SASS for MMA while CTK 13.1 might generate better SASS (#2408 (comment)).
Is there a way for users to force compilation with a newer version of ptxas?
Right now I'm guessing ptxas is embedded in _cutlass_ir.cpython-312-x86_64-linux-gnu.so, is there a way to pass the path to the system's ptxas?
I compile the blackwell gemm example (https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/blackwell/dense_gemm_persistent.py) with CUTE_DSL_KEEP_CUBIN=1 then look at the SASS, and it shows:
//--------------------- .note.nv.tkinfo --------------------------
.section .note.nv.tkinfo,"",@"SHT_NOTE"
.sectionflags @"SHF_NOTE_NV_TKINFO"
.tkinfo
/*0018*/ .word 0x00000081
/*0030*/ .string ""
/*0030*/ .string "ptxas"
/*0030*/ .string "Cuda compilation tools, release 12.9, V12.9.83"
/*0030*/ .string "Build system must define TOOLS_VERSION_EXTENDED"
/*0030*/ .string "-O 3 -arch sm_100a "
Steps/Code to reproduce bug
CUTE_DSL_KEEP_CUBIN=1 python dense_gemm_persistent.py --mma_tiler_mn 256,256 --cluster_shape_mn 2,1 --mnkl 8192,8192,8192,1 --use_tma_store --use_2cta_instrs --benchmark --warmup_iterations=1 --iterations=30 --ab_dtype BFloat16
nvdisasm cutlass_bmm___main__PersistentDenseGemmKernelobjectat_Tensorgmemoi64i641_Tensorgmemoi641i64_Tensorgmemoi64i641_74_CUstream0x0_functionrunlocalslambdaat.sm_100a.cubin > sm100_dense_gemm.sass
vim sm100_dense_gemm.sass
Environment details (please complete the following information):
Driver Version: 590.48.01
CUDA Version: 13.1
nvidia-cutlass-dsl: 4.3.5
nvcc: V13.1.80