-
-
Notifications
You must be signed in to change notification settings - Fork 892
Feature/dcgm gpu backend #1391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/dcgm gpu backend #1391
Conversation
- Introduce a new DCGM-backed NVIDIA GPU collector on Linux that populates the existing Gpu::gpu_info structures using dcgmUpdateAllFields and dcgmGetLatestValuesForFields. - Prefer DCGM over NVML when built with -DBTOP_DCGM=ON and libdcgm is available, while keeping NVML as a transparent fallback on systems without DCGM. - Track a unified nvidia_device_count so AMD (ROCm SMI) and Intel GPU backends stack correctly after whichever NVIDIA backend is active. - Expose a new CMake option BTOP_DCGM and link libdcgm when enabled, keeping GPU runtime behaviour controlled via existing shown_gpus and show_gpu_info config options. - Update README GPU compatibility and CMake documentation to describe the DCGM backend, including usage on DGX Spark / data center GB-series systems and how to enable it. Tests: - Built with -DBTOP_GPU=ON -DBTOP_DCGM=ON on Linux and verified that btop runs with DCGM present (DGX-style node) and falls back to NVML or no NVIDIA GPUs when DCGM is unavailable.
GPU name retrieval (lines 1290-1305): Added dcgmGetDeviceAttributes() to get proper GPU names, with the same cleanup logic as NVML supported_functions initialization (lines 1482-1496): During collect<1>, supported_functions is now set based on which fields returned valid data, with unsupported features (pwr_state, pcie_txrx, encoder/decoder) explicitly set to false Empty deque fallback (lines 1499-1509): All deques now guaranteed to have at least one value (0) to prevent .back() crashes
GPU name retrieval (lines 1290-1305): Added dcgmGetDeviceAttributes() to get proper GPU names, with the same cleanup logic as NVML supported_functions initialization (lines 1482-1496): During collect<1>, supported_functions is now set based on which fields returned valid data, with unsupported features (pwr_state, pcie_txrx, encoder/decoder) explicitly set to false Empty deque fallback (lines 1499-1509): All deques now guaranteed to have at least one value (0) to prevent .back() crashes
deckstose
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lacks Makefile support.
In my opinion the changes can be split out of btop_collect into their own file.
What's up with the changes to the utility functions? Please undo them.
|
@deckstose Thinking about adding rules that PR's that are obviously vide coded should be dismissed unless the author has some proof that they actually understand the code in the PR (like for example that they have other repositories in C++ that aren't vibe coded). |
|
I totally agree |
|
@deckstose
|
Added Nvidia unified memory architecture based GB (Grace Blackwell) GPU support by using DCGM support. Tested on Nvidia DGX Spark (GB10)