Skip to content

nvidia-validator failed in Talos #1687

@hzhangxyz

Description

@hzhangxyz

Describe the bug
A clear and concise description of what the bug is.

nvidia-operator-validator-xxxxx keep in Init status as the container driver-validation loop forever because of its command nvidia-validator:

INFO[0000] version: 616690d8-amd64, commit: 616690d
INFO[0000] Attempting to validate a pre-installed driver on the host
INFO[0000] Attempting to validate a driver container installation
WARN[0000] failed to validate the driver, retrying after 5 seconds
INFO[0005] Attempting to validate a driver container installation
WARN[0005] failed to validate the driver, retrying after 5 seconds

To Reproduce
Detailed steps to reproduce the issue.

  1. Install talos with factory.talos.dev/metal-installer/cecf9ff004dbd06f21360a148bb5b4e231ca87f18d55eafc1191038ad50b4255:v1.11.0
  2. Install gpu operator

Expected behavior

Expect nvidia-validator return successfully, but it keep looping

Environment (please provide the following information):

  • GPU Operator Version: v25.3.2
  • OS: factory.talos.dev/metal-installer/cecf9ff004dbd06f21360a148bb5b4e231ca87f18d55eafc1191038ad50b4255:v1.11.0
  • Kernel Version: 6.12.45-talos
  • Container Runtime Version: containerd://2.1.4
  • Kubernetes Distro and Version: v1.33.0

*Information to attach

# kubectl --namespace gpu-operator logs pods/nvidia-operator-validator-7hcvp --container driver-validation | head
time="2025-09-10T03:06:07Z" level=info msg="version: 616690d8-amd64, commit: 616690d"
time="2025-09-10T03:06:07Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2025-09-10T03:06:07Z" level=info msg="Attempting to validate a driver container installation"
time="2025-09-10T03:06:07Z" level=warning msg="failed to validate the driver, retrying after 5 seconds\n"
time="2025-09-10T03:06:12Z" level=info msg="Attempting to validate a driver container installation"
time="2025-09-10T03:06:12Z" level=warning msg="failed to validate the driver, retrying after 5 seconds\n"
time="2025-09-10T03:06:17Z" level=info msg="Attempting to validate a driver container installation"
time="2025-09-10T03:06:17Z" level=warning msg="failed to validate the driver, retrying after 5 seconds\n"
time="2025-09-10T03:06:22Z" level=info msg="Attempting to validate a driver container installation"
time="2025-09-10T03:06:22Z" level=warning msg="failed to validate the driver, retrying after 5 seconds\n"

I also tried stracing it in pod, the key result of strace -o log -ff nvidia-validator is:

write(2, "\33[36mINFO\33[0m[0000] Attempting t"..., 79) = 79
newfstatat(AT_FDCWD, "/host/usr/lib/wsl/lib/nvidia-smi", 0xc00001a858, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/host/usr/bin/nvidia-smi", 0xc00001a928, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)

and

write(2, "\33[36mINFO\33[0m[0000] Attempting t"..., 76) = 76
newfstatat(AT_FDCWD, "/usr", {st_mode=S_IFDIR|0755, st_size=41, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local", {st_mode=S_IFDIR|0755, st_size=28, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/libnvidia-ml.so.1", 0xc00001b218, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr", {st_mode=S_IFDIR|0755, st_size=41, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local", {st_mode=S_IFDIR|0755, st_size=28, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/usr", 0xc00001b488, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr", {st_mode=S_IFDIR|0755, st_size=41, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local", {st_mode=S_IFDIR|0755, st_size=28, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/usr", 0xc00001b6f8, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr", {st_mode=S_IFDIR|0755, st_size=41, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local", {st_mode=S_IFDIR|0755, st_size=28, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/usr", 0xc00001b968, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr", {st_mode=S_IFDIR|0755, st_size=41, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local", {st_mode=S_IFDIR|0755, st_size=28, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/lib64", 0xc00001bbd8, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr", {st_mode=S_IFDIR|0755, st_size=41, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local", {st_mode=S_IFDIR|0755, st_size=28, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/lib", {st_mode=S_IFDIR|0755, st_size=71, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/lib/x86_64-linux-gnu", 0xc000240378, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr", {st_mode=S_IFDIR|0755, st_size=41, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local", {st_mode=S_IFDIR|0755, st_size=28, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/lib", {st_mode=S_IFDIR|0755, st_size=71, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/lib/aarch64-linux-gnu", 0xc0006562a8, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
write(2, "\33[33mWARN\33[0m[0000] failed to va"..., 77) = 77

The environment of this pod(container driver-validation) has been set as:

env:
    - name: WITH_WAIT
      value: "true"
    - name: COMPONENT
      value: driver
    - name: OPERATOR_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: DRIVER_INSTALL_DIR
      value: /usr/local
    - name: DRIVER_INSTALL_DIR_CTR_PATH
      value: /usr/local

and this information maybe helpful:

# find /usr/local -name nvidia-smi
/usr/local/bin/nvidia-smi
# find /usr/local -name 'libnvidia-ml.so*'
/usr/local/glibc/usr/lib/libnvidia-ml.so
/usr/local/glibc/usr/lib/libnvidia-ml.so.1
/usr/local/glibc/usr/lib/libnvidia-ml.so.570.172.08

Also see: #1276 #1050

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions