Add cutlass python dsl executor for `quack-kernels` #2719

crcrpar · 2025-11-05T17:46:17Z

What does this PR do?

As per title, this adds cutlass python dsl executor.
In this PR, the kernels defined in https://github.com/Dao-AILab/quack, except matmul, are registered. Also, backward is not integrated.

Copilot

Pull Request Overview

This PR adds support for the CUTLASS DSL executor (cutlass_dsl_ex) to Thunder, integrating the quack library for optimized operations like softmax, cross_entropy, layer_norm, and RMS norm on NVIDIA SM9.0/10.0 GPUs.

Introduces a new cutlass_dsl_ex executor with quack operation implementations
Adds comprehensive test coverage for quack operations
Adds benchmark suites for performance comparison against nvfuser and torch_compile
Registers the new executor in Thunder's executor registry

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
thunder/executors/cutlass_dsl_ex.py	New file implementing the cutlass_dsl executor with quack operations for softmax, cross_entropy, layer_norm, and RMS norm
thunder/extend/init.py	Registers cutlass_dsl_ex in the get_all_executors function
thunder/tests/test_extend.py	Updates test to include cutlass_dsl executor in the expected executors list
thunder/tests/test_cutlass_dsl_ex.py	New test file with comprehensive tests for quack operations
thunder/benchmarks/targets.py	Adds benchmark classes and test functions for quack operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-06T08:28:35Z

thunder/executors/cutlass_dsl_ex.py

+        if requires_reshpae := a.ndim > 2:
+            a = a.view(-1, original_shape[-1])
+        ret = softmax_fwd(a)
+        if requires_reshpae:


Corrected spelling of 'requires_reshpae' to 'requires_reshape'.

Suggested change

if requires_reshpae := a.ndim > 2:

a = a.view(-1, original_shape[-1])

ret = softmax_fwd(a)

if requires_reshpae:

if requires_reshape := a.ndim > 2:

a = a.view(-1, original_shape[-1])

ret = softmax_fwd(a)

if requires_reshape:

Copilot · 2025-11-06T08:28:36Z

thunder/executors/cutlass_dsl_ex.py

+        if requires_reshpae := a.ndim > 2:
+            a = a.view(-1, original_shape[-1])
+        ret = softmax_fwd(a)
+        if requires_reshpae:


Corrected spelling of 'requires_reshpae' to 'requires_reshape'.

Suggested change

if requires_reshpae := a.ndim > 2:

a = a.view(-1, original_shape[-1])

ret = softmax_fwd(a)

if requires_reshpae:

if requires_reshape := a.ndim > 2:

a = a.view(-1, original_shape[-1])

ret = softmax_fwd(a)

if requires_reshape:

Copilot · 2025-11-06T08:28:36Z

thunder/executors/cutlass_dsl_ex.py

+            a.ndim != 2
+            or a.dtype not in {dtypes.float16, dtypes.bfloat16, dtypes.float32}
+            and target.ndim == 1
+            and target.dytpe in {dtypes.int32, dtypes.int64}


Corrected spelling of 'dytpe' to 'dtype'.

Suggested change

and target.dytpe in {dtypes.int32, dtypes.int64}

and target.dtype in {dtypes.int32, dtypes.int64}

thunder/executors/cutlass_dsl_ex.py

Copilot · 2025-11-06T08:28:36Z

thunder/executors/cutlass_dsl_ex.py

+    def quack_softmax_backward_meta(g: TensorProxy, a: TensorProxy) -> TensorProxy:
+        return TensorProxy(like=g)
+
+    quack_softmax_backward = cutlass_dsl_ex.register_operator(


The global variable 'quack_softmax_backward' is not used.

Suggested change

quack_softmax_backward = cutlass_dsl_ex.register_operator(

cutlass_dsl_ex.register_operator(

Copilot · 2025-11-06T08:28:37Z

thunder/benchmarks/targets.py

+    return thunder.jit(fn, executors=[nvfuserex])
+
+
+class BaseBenchmarkForQuack(Benchmark, metaclass=UserFacingBenchmarkMeta):


This class does not call Benchmark.init during initialization. (BaseBenchmarkForQuack.init may be missing a call to a base class init)

riccardofelluga · 2025-11-06T13:15:32Z

thunder/executors/cutlass_dsl_ex.py

+        weight: TensorProxy | None = None,
+        bias: TensorProxy | None = None,
+        eps: Number = 1e-5,
+    ) -> bool:
+        if (
+            a.dtype not in {dtypes.float16, dtypes.bfloat16, dtypes.float32}
+            or weight.ndim != 1
+            or a.shape[-1] != weight.shape[0]
+            or weight.dtype not in {dtypes.float32}


Can weight be None? In that case this would need to check before trying to access .ndim

good catch. will check it

kiya00 · 2025-11-10T13:26:42Z

thunder/executors/cutlass_dsl_ex.py

+
+quack_version: LooseVersion
+try:
+    import quack


do we need to add this into requirements to install it?

I'd not think we should do so. Because pip install quack-kernels seems to install cuda python packages such as nvidia-cutlass-dsl and I don't know how to having requirements.txt install cuda python packages that respect users local environments

kiya00 · 2025-11-10T13:30:38Z

thunder/tests/test_cutlass_dsl_ex.py

+
+    expected = F.cross_entropy(ref_x, targets, reduction="none")
+    actual = jitted(x, targets, reduction="none")
+    torch.testing.assert_close(expected, actual)


It seems the backward is not tested

I've not managed to have backward work

Starting with quack's softmax Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]>

it seems that quack's cross-entropy function upcasts inputs to fp32, thus updating test and meta function Signed-off-by: Masaki Kozuki <[email protected]>

Signed-off-by: Masaki Kozuki <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Masaki Kozuki <[email protected]>

crcrpar requested review from KaelanDt, lantiga, mruberry and t-vi as code owners November 5, 2025 17:46

crcrpar requested review from Copilot, kiya00 and riccardofelluga November 6, 2025 08:25

Copilot AI reviewed Nov 6, 2025

View reviewed changes

riccardofelluga reviewed Nov 6, 2025

View reviewed changes

kiya00 reviewed Nov 10, 2025

View reviewed changes

crcrpar added 19 commits November 11, 2025 10:48

Add cutlass-python-dsl executor.

7fab90c

Starting with quack's softmax Signed-off-by: Masaki Kozuki <[email protected]>

[no ci] add crossentropy

2f42fa2

Signed-off-by: Masaki Kozuki <[email protected]>

[no ci] add layer norm forward

b95c9e4

Signed-off-by: Masaki Kozuki <[email protected]>

[no ci] add rmsnorm

a9eaf4d

Signed-off-by: Masaki Kozuki <[email protected]>

fix

0b51a12

Signed-off-by: Masaki Kozuki <[email protected]>

fix backward of crossentropy

d769096

Signed-off-by: Masaki Kozuki <[email protected]>

fix checkers

4a0ced4

Signed-off-by: Masaki Kozuki <[email protected]>

[no ci] add test

1a2a868

Signed-off-by: Masaki Kozuki <[email protected]>

DRY: dtypes & their ids

d6efb9a

Signed-off-by: Masaki Kozuki <[email protected]>

comment out backward for now

c7bbf34

Signed-off-by: Masaki Kozuki <[email protected]>

upcast inputs to fp32 for reference

4daff6e

it seems that quack's cross-entropy function upcasts inputs to fp32, thus updating test and meta function Signed-off-by: Masaki Kozuki <[email protected]>

fix how softmax is called

3711497

Signed-off-by: Masaki Kozuki <[email protected]>

upcast and downcast for reference layernorm

85a8e65

Signed-off-by: Masaki Kozuki <[email protected]>

fix typo of rmsnorm

1b633d5

Signed-off-by: Masaki Kozuki <[email protected]>

fix meta

27ff159

Signed-off-by: Masaki Kozuki <[email protected]>

add cutlass_dsl_ex to all_executors

0958dba

Signed-off-by: Masaki Kozuki <[email protected]>

Only forward, no backward support for now

6680878

Signed-off-by: Masaki Kozuki <[email protected]>

call non-augmented forward in execution transform

b9c876d

Signed-off-by: Masaki Kozuki <[email protected]>

quack bench

340f14e

Signed-off-by: Masaki Kozuki <[email protected]>

pre-commit-ci bot and others added 3 commits November 11, 2025 10:49

[pre-commit.ci] auto fixes from pre-commit.com hooks

968420c

for more information, see https://pre-commit.ci

fix quack availability check

b4408d1

Signed-off-by: Masaki Kozuki <[email protected]>

mandate weight in layer|rms norm

1e5f3b2

Signed-off-by: Masaki Kozuki <[email protected]>

crcrpar force-pushed the crpa/quack branch from 98a1b7d to 1e5f3b2 Compare November 11, 2025 01:49

	and target.dytpe in {dtypes.int32, dtypes.int64}
	and target.dtype in {dtypes.int32, dtypes.int64}

	quack_softmax_backward = cutlass_dsl_ex.register_operator(
	cutlass_dsl_ex.register_operator(

		return thunder.jit(fn, executors=[nvfuserex])


		class BaseBenchmarkForQuack(Benchmark, metaclass=UserFacingBenchmarkMeta):

Add cutlass python dsl executor for quack-kernels #2719

Are you sure you want to change the base?

Add cutlass python dsl executor for quack-kernels #2719

Uh oh!

Conversation

crcrpar commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

riccardofelluga Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

crcrpar Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

kiya00 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

crcrpar Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

kiya00 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

crcrpar Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add cutlass python dsl executor for `quack-kernels` #2719

Add cutlass python dsl executor for `quack-kernels` #2719

crcrpar commented Nov 5, 2025 •

edited

Loading