Skip to content

[FEA]: Load-balanced segmented reduceΒ #6171

@jrhemstad

Description

@jrhemstad

Is this a duplicate?

Area

CUB

Is your feature request related to a problem? Please describe.

I would like better performance from cub::DeviceSegmentedReduce for "small" segment sizes O(1)/O(10).

The current implementation uses a simple mapping of 1 CTA per segment which is inefficient when segment sizes are small like this.

Describe the solution you'd like

I would like cub::DeviceSegmentedReduce to take advantage of the same "load balancing" approach the @gevtushenko developed for cub::DeviceSegmentedSort that involves binning segments based on size and dispatching to different levels of parallelism depending on how big the segments are.

@gevtushenko already had a PoC implementation in the old repo that showed dramatic performance improvements NVIDIA/cub#578

Image

This issue can be closed by picking up where NVIDIA/cub#578 left off and refactor cub::DeviceSegmentedReduce to take advantage of the load balanced approach with improved performance for small segment sizes.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions