Add nvfp4 group gemm example. #77
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tested results on B200 chip with Python3.13.8 and CuTe DSL 4.3.0.dev0
(env13_8) nvfp4_group_gemm$ python3 eval.py test task.yml
main
test-count: 10
test.0.spec: m: [128, 128]; n: [128, 256]; k: [128, 512]; g: 2; seed: 1111
test.0.status: pass
test.1.spec: m: [256, 128]; n: [512, 384]; k: [256, 256]; g: 2; seed: 1111
test.1.status: pass
test.2.spec: m: [128, 128]; n: [128, 256]; k: [128, 512]; g: 2; seed: 1111
test.2.status: pass
test.3.spec: m: [256, 128, 256]; n: [384, 256, 128]; k: [256, 512, 128]; g: 3; seed: 1111
test.3.status: pass
test.4.spec: m: [512, 256, 128]; n: [768, 128, 256]; k: [512, 512, 128]; g: 3; seed: 1111
test.4.status: pass
test.5.spec: m: [128, 768, 512]; n: [128, 384, 512]; k: [384, 512, 128]; g: 3; seed: 1111
test.5.status: pass
test.6.spec: m: [512, 768, 384]; n: [256, 512, 512]; k: [768, 128, 768]; g: 3; seed: 1111
test.6.status: pass
test.7.spec: m: [128, 128, 128, 128]; n: [128, 128, 128, 128]; k: [128, 128, 128, 128]; g: 4; seed: 1111
test.7.status: pass
test.8.spec: m: [256, 128, 384, 512]; n: [512, 384, 256, 128]; k: [256, 256, 256, 256]; g: 4; seed: 1111
test.8.status: pass
test.9.spec: m: [512, 384, 256, 128]; n: [256, 256, 256, 256]; k: [512, 128, 512, 128]; g: 4; seed: 1111
test.9.status: pass
check: pass
main end
(env13_8) nvfp4_group_gemm$ python3 eval.py benchmark task.yml
main
benchmark-count: 4
benchmark.0.spec: m: [128, 128, 128, 128, 128, 128, 128, 128]; n: [2048, 6144, 2048, 5120, 2048, 7168, 3072, 5120]; k: [7168, 7168, 7168, 7168, 7168, 7168, 7168, 7168]; g: 8; seed: 1111
benchmark.0.runs: 100
benchmark.0.mean: 357707.5186371803
benchmark.0.std: 71426.6155640933
benchmark.0.err: 7142.66155640933
benchmark.0.best: 328736.00721359253
benchmark.0.worst: 769056.0221672058
benchmark.1.spec: m: [128, 128, 128, 128, 128, 128, 128, 128]; n: [6144, 8192, 5120, 8192, 7168, 7168, 8192, 7168]; k: [2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048]; g: 8; seed: 1111
benchmark.1.runs: 100
benchmark.1.mean: 321697.92115688324
benchmark.1.std: 68806.66814911975
benchmark.1.err: 6880.666814911975
benchmark.1.best: 295967.9961204529
benchmark.1.worst: 698400.0205993652
benchmark.2.spec: m: [256, 256]; n: [3072, 3072]; k: [4096, 4096]; g: 2; seed: 1111
benchmark.2.runs: 100
benchmark.2.mean: 165156.15984797478
benchmark.2.std: 39791.514431117816
benchmark.2.err: 3979.1514431117816
benchmark.2.best: 151552.0066022873
benchmark.2.worst: 546688.0202293396
benchmark.3.spec: m: [128, 384]; n: [4096, 4096]; k: [1536, 1536]; g: 2; seed: 1111
benchmark.3.runs: 100
benchmark.3.mean: 158621.1197078228
benchmark.3.std: 45832.550645002164
benchmark.3.err: 4583.255064500217
benchmark.3.best: 143071.9941854477
benchmark.3.worst: 489407.9864025116
check: pass
main end
(env13_8) nvfp4_group_gemm$ python3 eval.py leaderboard task.yml
main
benchmark-count: 4
benchmark.0.spec: m: [128, 128, 128, 128, 128, 128, 128, 128]; n: [2048, 6144, 2048, 5120, 2048, 7168, 3072, 5120]; k: [7168, 7168, 7168, 7168, 7168, 7168, 7168, 7168]; g: 8; seed: 1111
benchmark.0.runs: 100
benchmark.0.mean: 557592.6411151886
benchmark.0.std: 306415.5682646463
benchmark.0.err: 30641.556826464628
benchmark.0.best: 482336.01450920105
benchmark.0.worst: 3452960.0143432617
benchmark.1.spec: m: [128, 128, 128, 128, 128, 128, 128, 128]; n: [6144, 8192, 5120, 8192, 7168, 7168, 8192, 7168]; k: [2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048]; g: 8; seed: 1111
benchmark.1.runs: 100
benchmark.1.mean: 458914.5600795746
benchmark.1.std: 105335.3649110644
benchmark.1.err: 10533.53649110644
benchmark.1.best: 416832.00001716614
benchmark.1.worst: 878624.0220069885
benchmark.2.spec: m: [256, 256]; n: [3072, 3072]; k: [4096, 4096]; g: 2; seed: 1111
benchmark.2.runs: 100
benchmark.2.mean: 214728.64016890526
benchmark.2.std: 51155.40134188374
benchmark.2.err: 5115.540134188374
benchmark.2.best: 198592.00716018677
benchmark.2.worst: 717055.9763908386
benchmark.3.spec: m: [128, 384]; n: [4096, 4096]; k: [1536, 1536]; g: 2; seed: 1111
benchmark.3.runs: 100
benchmark.3.mean: 213030.39968013763
benchmark.3.std: 63644.099319036424
benchmark.3.err: 6364.409931903642
benchmark.3.best: 195904.00159358978
benchmark.3.worst: 662688.0168914795
check: pass
main end