[STF] Fix incorrect level index in 3-depth execution policy #6089

19970126ljl · 2025-10-01T06:45:17Z

Description

Fixes a typo in places.cuh line 1665 where l2_size was incorrectly getting width from level 1 instead of level 2.

auto spec = par(hw_scope::device, 1,
                par<4320>(hw_scope::block,
                    con<256>(hw_scope::thread)));

Expected: 1 device × 4320 blocks × 256 threads
Actual error: "Maximum block size 640 threads, requested 4320 (level 2)"

The fix ensures l2_size correctly reads from level 2 (p.get_width(2)),
which represents threads per block, not level 1 which represents number of blocks. see places.cuh

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

Fixed a typo in places.cuh line 1665 where l2_size was incorrectly getting width from level 1 instead of level 2, causing Unsatisfiable

copy-pr-bot · 2025-10-01T06:45:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

caugonnet · 2025-10-01T07:43:58Z

@19970126ljl great catch thanks a lot for all of your efforts !

I'm hijacking this PR to tell you are welcome to send any feature request or feedback on these launch constructs ([email protected]), the launch API is something we believe useful and the multi-level synchronization itself should be ok even if we can reduce the launch overhead and make it more efficient with CUDA graphs, but mechanisms like apply_partition to iterate over data subsets are not optimized enough to be applicable.

caugonnet · 2025-10-01T07:44:35Z

/ok to test a03c77a

19970126ljl · 2025-10-01T08:09:47Z

@19970126ljl great catch thanks a lot for all of your efforts !

I'm hijacking this PR to tell you are welcome to send any feature request or feedback on these launch constructs ([email protected]), the launch API is something we believe useful and the multi-level synchronization itself should be ok even if we can reduce the launch overhead and make it more efficient with CUDA graphs, but mechanisms like apply_partition to iterate over data subsets are not optimized enough to be applicable.

Thank you for for sharing such valuable insights!

I'm currently learning and experimenting with cudaSTF to write some parallel programs, and I find it an excellent framework as well as a great learning resource.

If I come across any new findings, ideas, or potential improvements while using it, I’ll be sure to reach out and discuss them with you right away.

…iguration in multi-level specs

caugonnet · 2025-10-01T09:10:56Z

I've added an additional unit test as a PR in your branch which we could add to your fix

…ndex Add a new test to ensure proper CUDA configurations

caugonnet · 2025-10-01T14:57:23Z

pre-commit.ci autofix

caugonnet · 2025-10-01T15:04:52Z

/ok to test d5350b6

caugonnet · 2025-10-02T05:29:11Z

I can't push to your branch, but i'll add an additional

//===----------------------------------------------------------------------===//
//
// Part of CUDASTF in CUDA C++ Core Libraries,
// under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
// SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES.
//
//===----------------------------------------------------------------------===//

//! file
//! !brief Check that multi-level launch specification are fulfilled

#include <cuda/experimental/stf.cuh>

#include <cassert>
#include <iostream>

using namespace cuda::experimental::stf;

int main()
{
  stream_ctx ctx;
  
  // Create a 3-level thread hierarchy specification that would expose the bug:
  // Level 0: only 1 device to run on CI
  // Level 1: 4 blocks per device (width 4)
  // Level 2: 64 threads per block (width 64)
  auto spec = par(hw_scope::device, 1,
    con<4>(hw_scope::block,
    con<64>(hw_scope::thread)));
  
  int test_result = 0;
  auto l_test_result = ctx.logical_data(make_slice(&test_result, 1));
  
  ctx.launch(spec, exec_place::current_device(), l_test_result.rw())->*[] __device__(auto th, auto result) {
    if (th.rank() == 0)
    {
      bool level0_correct = (th.size(0) == 1);  // device level
      bool level1_correct = (th.size(1) == 1*4) && (gridDim.x == 4);  // blocks per device  
      bool level2_correct = (th.size(2) == 1*4*64) && (blockDim.x == 64); // threads per block

      // Set test result based on whether all levels are correct
      result[0] = level0_correct && level1_correct && level2_correct ? 1 : 0;
    }
  };
  
  ctx.finalize();

  _CCCL_ASSERT(test_result == 1, "Hierarchy dimensions are incorrect!");

  return 0;
}

github-actions · 2025-10-02T05:37:29Z

🥳 CI Workflow Results

🟩 Finished in 14h 30m: Pass: 100%/53 | Total: 10h 24m | Max: 47m 08s | Hits: 81%/22571

See results here.

[STF] Fix incorrect level index in 3-depth execution policy

515ef78

Fixed a typo in places.cuh line 1665 where l2_size was incorrectly getting width from level 1 instead of level 2, causing Unsatisfiable

19970126ljl requested a review from a team as a code owner October 1, 2025 06:45

19970126ljl requested a review from ericniebler October 1, 2025 06:45

github-project-automation bot added this to CCCL Oct 1, 2025

github-project-automation bot moved this to Todo in CCCL Oct 1, 2025

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Oct 1, 2025

Merge branch 'main' into fix-stf-execution-policy-level-index

a03c77a

19970126ljl mentioned this pull request Oct 1, 2025

[STF]: Integer Overflow in dim4 Class Causing Runtime Exception #5891

Closed

1 task

caugonnet added the stf Sequential Task Flow programming model label Oct 1, 2025

Add a new test to check that we are using the proper CUDA kernel conf…

79e7b81

…iguration in multi-level specs

This comment has been minimized.

Sign in to view

Merge pull request #1 from caugonnet/fix-stf-execution-policy-level-i…

72e3e34

…ndex Add a new test to ensure proper CUDA configurations

19970126ljl requested a review from a team as a code owner October 1, 2025 14:55

19970126ljl requested a review from bernhardmgruber October 1, 2025 14:55

pre-commit-ci bot and others added 2 commits October 1, 2025 14:59

[pre-commit.ci] auto code formatting

bed16ca

Merge branch 'main' into fix-stf-execution-policy-level-index

d5350b6

This comment has been minimized.

Sign in to view

caugonnet enabled auto-merge (squash) October 2, 2025 05:29

Merge branch 'main' into fix-stf-execution-policy-level-index

63e31f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[STF] Fix incorrect level index in 3-depth execution policy #6089

[STF] Fix incorrect level index in 3-depth execution policy #6089

Uh oh!

19970126ljl commented Oct 1, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Oct 1, 2025

Uh oh!

caugonnet commented Oct 1, 2025

Uh oh!

caugonnet commented Oct 1, 2025

Uh oh!

19970126ljl commented Oct 1, 2025

Uh oh!

caugonnet commented Oct 1, 2025

Uh oh!

This comment has been minimized.

caugonnet commented Oct 1, 2025

Uh oh!

caugonnet commented Oct 1, 2025

Uh oh!

This comment has been minimized.

caugonnet commented Oct 2, 2025

Uh oh!

github-actions bot commented Oct 2, 2025

Uh oh!

Uh oh!

[STF] Fix incorrect level index in 3-depth execution policy #6089

Are you sure you want to change the base?

[STF] Fix incorrect level index in 3-depth execution policy #6089

Uh oh!

Conversation

19970126ljl commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Oct 1, 2025

Uh oh!

caugonnet commented Oct 1, 2025

Uh oh!

caugonnet commented Oct 1, 2025

Uh oh!

19970126ljl commented Oct 1, 2025

Uh oh!

caugonnet commented Oct 1, 2025

Uh oh!

This comment has been minimized.

caugonnet commented Oct 1, 2025

Uh oh!

caugonnet commented Oct 1, 2025

Uh oh!

This comment has been minimized.

caugonnet commented Oct 2, 2025

Uh oh!

github-actions bot commented Oct 2, 2025

🥳 CI Workflow Results

🟩 Finished in 14h 30m: Pass: 100%/53 | Total: 10h 24m | Max: 47m 08s | Hits: 81%/22571

Uh oh!

Uh oh!

19970126ljl commented Oct 1, 2025 •

edited

Loading