[FEA]: Load-balanced segmented reduce

### Is this a duplicate?

- [x] I confirmed there appear to be no [duplicate issues](https://github.com/NVIDIA/cccl/issues) for this request and that I agree to the [Code of Conduct](CODE_OF_CONDUCT.md)

### Area

CUB

### Is your feature request related to a problem? Please describe.

I would like better performance from `cub::DeviceSegmentedReduce` for "small" segment sizes O(1)/O(10).

The current implementation uses a simple mapping of 1 CTA per segment which is inefficient when segment sizes are small like this. 

### Describe the solution you'd like

I would like `cub::DeviceSegmentedReduce` to take advantage of the same "load balancing" approach the @gevtushenko developed for `cub::DeviceSegmentedSort` that involves binning segments based on size and dispatching to different levels of parallelism depending on how big the segments are.

@gevtushenko already had a PoC implementation in the old repo that showed dramatic performance improvements https://github.com/NVIDIA/cub/pull/578

<img width="1858" height="1962" alt="Image" src="https://github.com/user-attachments/assets/b7856e7e-553c-44af-947d-1e2bb6df95d0" />

This issue can be closed by picking up where https://github.com/NVIDIA/cub/pull/578 left off and refactor `cub::DeviceSegmentedReduce` to take advantage of the load balanced approach with improved performance for small segment sizes. 

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA]: Load-balanced segmented reduce #6171

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA]: Load-balanced segmented reduce #6171

Description

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions