-
Notifications
You must be signed in to change notification settings - Fork 300
Description
Is this a duplicate?
- I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct
Area
CUB
Is your feature request related to a problem? Please describe.
I would like better performance from cub::DeviceSegmentedReduce for "small" segment sizes O(1)/O(10).
The current implementation uses a simple mapping of 1 CTA per segment which is inefficient when segment sizes are small like this.
Describe the solution you'd like
I would like cub::DeviceSegmentedReduce to take advantage of the same "load balancing" approach the @gevtushenko developed for cub::DeviceSegmentedSort that involves binning segments based on size and dispatching to different levels of parallelism depending on how big the segments are.
@gevtushenko already had a PoC implementation in the old repo that showed dramatic performance improvements NVIDIA/cub#578
This issue can be closed by picking up where NVIDIA/cub#578 left off and refactor cub::DeviceSegmentedReduce to take advantage of the load balanced approach with improved performance for small segment sizes.
Describe alternatives you've considered
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status