Skip to content

Conversation

@PointKernel
Copy link
Member

This PR adds a best practices guide for NVBench, providing code examples to help users quickly get started and conduct effective performance comparisons in real-world scenarios.

@oleksandr-pavlyk
Copy link
Collaborator

@PointKernel I like the document as a friendly "getting started guide".

I would expect the best practices guide to provide answers to:

  • nvbench vs. NCU
  • should I or should I not lock GPU frequency
  • should I use a single kernel in the launchable lambda per benchmark, or can I use multiple kernels
  • how to do performance tuning, with a kernel example and reasoning for arriving at an optimal choice of parameters

So perhaps the document could be renamed, otherwise looks good to me.

@PointKernel
Copy link
Member Author

PointKernel commented Oct 2, 2025

Regarding the comparison between nvbench and NCU, could you elaborate a bit on what kind of information would be most useful for users? From my perspective, nvbench primarily gathers runtime data and basic metrics, which feels more similar to NSYS, whereas NCU provides deeper kernel-level profiling with detailed hardware utilization insights.

Yes, I should have been more clear. What I had in mind was the difference in kernel runtime estimate by NCU and by nvbench. I was just bitten by a discrepancy in timings caused by NCU locking GPU frequency while timing, and nvbench not doing that. I needed to use --clock-control none option on NCU to get timings measured by NCU to agree with those reported by nvbench.

While NCU may have other reasons to lock frequency (to get sampling-based estimates, such as stall rates, more accurate), GPU Mode talk https://www.youtube.com/watch?v=CtrqBmYtSEk by @gevtushenko makes the point that timings obtained with locked frequency may not be representative of real-world kernel performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants