Skip to content

Conversation

19970126ljl
Copy link

@19970126ljl 19970126ljl commented Oct 1, 2025

Description

Fixes a typo in places.cuh line 1665 where l2_size was incorrectly getting width from level 1 instead of level 2.

closes #6090

auto spec = par(hw_scope::device, 1,
                par<4320>(hw_scope::block,
                    con<256>(hw_scope::thread)));

Expected: 1 device × 4320 blocks × 256 threads
Actual error: "Maximum block size 640 threads, requested 4320 (level 2)"

The fix ensures l2_size correctly reads from level 2 (p.get_width(2)),
which represents threads per block, not level 1 which represents number of blocks. see places.cuh

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

      Fixed a typo in places.cuh line 1665 where l2_size was incorrectly
      getting width from level 1 instead of level 2, causing Unsatisfiable
@19970126ljl 19970126ljl requested a review from a team as a code owner October 1, 2025 06:45
@github-project-automation github-project-automation bot moved this to Todo in CCCL Oct 1, 2025
Copy link
Contributor

copy-pr-bot bot commented Oct 1, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Oct 1, 2025
@caugonnet
Copy link
Contributor

@19970126ljl great catch thanks a lot for all of your efforts !

I'm hijacking this PR to tell you are welcome to send any feature request or feedback on these launch constructs ([email protected]), the launch API is something we believe useful and the multi-level synchronization itself should be ok even if we can reduce the launch overhead and make it more efficient with CUDA graphs, but mechanisms like apply_partition to iterate over data subsets are not optimized enough to be applicable.

@caugonnet
Copy link
Contributor

/ok to test a03c77a

@19970126ljl
Copy link
Author

@19970126ljl great catch thanks a lot for all of your efforts !

I'm hijacking this PR to tell you are welcome to send any feature request or feedback on these launch constructs ([email protected]), the launch API is something we believe useful and the multi-level synchronization itself should be ok even if we can reduce the launch overhead and make it more efficient with CUDA graphs, but mechanisms like apply_partition to iterate over data subsets are not optimized enough to be applicable.

Thank you for for sharing such valuable insights!

I'm currently learning and experimenting with cudaSTF to write some parallel programs, and I find it an excellent framework as well as a great learning resource.

If I come across any new findings, ideas, or potential improvements while using it, I’ll be sure to reach out and discuss them with you right away.

@caugonnet caugonnet added the stf Sequential Task Flow programming model label Oct 1, 2025
@caugonnet
Copy link
Contributor

I've added an additional unit test as a PR in your branch which we could add to your fix

This comment has been minimized.

…ndex

Add a new test to ensure proper CUDA configurations
@19970126ljl 19970126ljl requested a review from a team as a code owner October 1, 2025 14:55
@caugonnet
Copy link
Contributor

pre-commit.ci autofix

@caugonnet
Copy link
Contributor

/ok to test d5350b6

This comment has been minimized.

@caugonnet
Copy link
Contributor

I can't push to your branch, but i'll add an additional

//===----------------------------------------------------------------------===//
//
// Part of CUDASTF in CUDA C++ Core Libraries,
// under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
// SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES.
//
//===----------------------------------------------------------------------===//

//! file
//! !brief Check that multi-level launch specification are fulfilled

#include <cuda/experimental/stf.cuh>

#include <cassert>
#include <iostream>

using namespace cuda::experimental::stf;

int main()
{
  stream_ctx ctx;
  
  // Create a 3-level thread hierarchy specification that would expose the bug:
  // Level 0: only 1 device to run on CI
  // Level 1: 4 blocks per device (width 4)
  // Level 2: 64 threads per block (width 64)
  auto spec = par(hw_scope::device, 1,
    con<4>(hw_scope::block,
    con<64>(hw_scope::thread)));
  
  int test_result = 0;
  auto l_test_result = ctx.logical_data(make_slice(&test_result, 1));
  
  ctx.launch(spec, exec_place::current_device(), l_test_result.rw())->*[] __device__(auto th, auto result) {
    if (th.rank() == 0)
    {
      bool level0_correct = (th.size(0) == 1);  // device level
      bool level1_correct = (th.size(1) == 1*4) && (gridDim.x == 4);  // blocks per device  
      bool level2_correct = (th.size(2) == 1*4*64) && (blockDim.x == 64); // threads per block

      // Set test result based on whether all levels are correct
      result[0] = level0_correct && level1_correct && level2_correct ? 1 : 0;
    }
  };
  
  ctx.finalize();

  _CCCL_ASSERT(test_result == 1, "Hierarchy dimensions are incorrect!");

  return 0;
}

@caugonnet caugonnet enabled auto-merge (squash) October 2, 2025 05:29
Copy link
Contributor

github-actions bot commented Oct 2, 2025

🥳 CI Workflow Results

🟩 Finished in 14h 30m: Pass: 100%/53 | Total: 10h 24m | Max: 47m 08s | Hits: 81%/22571

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stf Sequential Task Flow programming model
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

[BUG]: incorrect level index in 3-depth execution policy
2 participants