Skip to content

remapped_rows.histogram.* queries cause kernel warnings in vGPU guests #384

@akay

Description

@akay

Describe the bug
When running nvidia_gpu_exporter inside a VM with a vGPU (L40-16Q), the kernel journal gets repeated warnings like:

NVRM: serverControl_ValidateCookie: Unsupported ROUTE_TO_PHYSICAL control 0x20801347 was called on vGPU guest

The warnings stop when the exporter container is stopped, so my guess was that nvidia-smi queries by the exporter were triggering an unsupported physical-GPU control path from inside the vGPU guest. I traced this down to the exporter’s default/AUTO nvidia-smi --query-gpu=... field list. The warning is specifically triggered by querying the remapped_rows.histogram.* fields inside a vGPU guest:

remapped_rows.histogram.max
remapped_rows.histogram.high
remapped_rows.histogram.partial
remapped_rows.histogram.low
remapped_rows.histogram.none

The exporter still functions, but it causes kernel log noise as long as the exporter is scraping.

To Reproduce
Steps to reproduce the behavior:

  1. Use a VM with an vGPU attached (L40-16Q in my case).
  2. Install NVIDIA drivers (580.126.09) in the guest.
  3. Run nvidia_gpu_exporter via Docker Compose.
  4. Observe kernel logs with journalctl -b -k -f
  5. After the exporter starts scraping, this warning appears and is repeated every second:
    NVRM: serverControl_ValidateCookie: Unsupported ROUTE_TO_PHYSICAL control 0x20801347 was called on vGPU guest
  6. Stop the exporter container.
  7. The warnings stop.

Minimal reproducer example:

nvidia-smi --query-gpu=remapped_rows.histogram.max --format=csv,noheader,nounits

Expected behavior
I expected the exporter to avoid querying fields that trigger unsupported physical-GPU control paths in a vGPU guest.

At minimum, it would be helpful if:

  • the exporter filtered these fields out automatically in vGPU guests, or
  • the exporter had the ability to exclude fields manually via --query-field-names-exclude or something similar, or
  • the README/documentation mentioned that remapped_rows.histogram.* is unsafe in vGPU guests and should be excluded via --query-field-names.

Console output
Add the error logs and/or the output to help us diagnose the problem.

Model and Version

  • GPU Model [e.g. GeForce RTX 2080 Super]: L40-16Q
  • App version and architecture [e.g. v0.1.0 - linux_x86_64]: utkuozdemir/nvidia_gpu_exporter:1.4.1
  • Installation method [e.g. homebrew, binary download]: docker compose
  • Operating System [e.g. Ubuntu Desktop 20.04, Windows 10]: Ubuntu 24.04.4
  • Nvidia GPU driver version [e.g. Linux driver nvidia-driver-440, Windows Game Ready Driver 466.63]: 580.126.09

Additional context
The final result was that all of these reproduce the warning individually:

remapped_rows.histogram.max
remapped_rows.histogram.high
remapped_rows.histogram.partial
remapped_rows.histogram.low
remapped_rows.histogram.none

and these do not:

remapped_rows.correctable
remapped_rows.uncorrectable
remapped_rows.pending
remapped_rows.failure

So it looks like the histogram breakdown queries are hitting a physical-device-only path that is not supported in a vGPU guest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions