-
Notifications
You must be signed in to change notification settings - Fork 392
Open
Description
Describe the bug
A clear and concise description of what the bug is.
nvidia-operator-validator-xxxxx
keep in Init status as the container driver-validation
loop forever because of its command nvidia-validator
:
INFO[0000] version: 616690d8-amd64, commit: 616690d
INFO[0000] Attempting to validate a pre-installed driver on the host
INFO[0000] Attempting to validate a driver container installation
WARN[0000] failed to validate the driver, retrying after 5 seconds
INFO[0005] Attempting to validate a driver container installation
WARN[0005] failed to validate the driver, retrying after 5 seconds
To Reproduce
Detailed steps to reproduce the issue.
- Install talos with factory.talos.dev/metal-installer/cecf9ff004dbd06f21360a148bb5b4e231ca87f18d55eafc1191038ad50b4255:v1.11.0
- Install gpu operator
Expected behavior
Expect nvidia-validator
return successfully, but it keep looping
Environment (please provide the following information):
- GPU Operator Version: v25.3.2
- OS: factory.talos.dev/metal-installer/cecf9ff004dbd06f21360a148bb5b4e231ca87f18d55eafc1191038ad50b4255:v1.11.0
- Kernel Version: 6.12.45-talos
- Container Runtime Version: containerd://2.1.4
- Kubernetes Distro and Version: v1.33.0
*Information to attach
# kubectl --namespace gpu-operator logs pods/nvidia-operator-validator-7hcvp --container driver-validation | head
time="2025-09-10T03:06:07Z" level=info msg="version: 616690d8-amd64, commit: 616690d"
time="2025-09-10T03:06:07Z" level=info msg="Attempting to validate a pre-installed driver on the host"
time="2025-09-10T03:06:07Z" level=info msg="Attempting to validate a driver container installation"
time="2025-09-10T03:06:07Z" level=warning msg="failed to validate the driver, retrying after 5 seconds\n"
time="2025-09-10T03:06:12Z" level=info msg="Attempting to validate a driver container installation"
time="2025-09-10T03:06:12Z" level=warning msg="failed to validate the driver, retrying after 5 seconds\n"
time="2025-09-10T03:06:17Z" level=info msg="Attempting to validate a driver container installation"
time="2025-09-10T03:06:17Z" level=warning msg="failed to validate the driver, retrying after 5 seconds\n"
time="2025-09-10T03:06:22Z" level=info msg="Attempting to validate a driver container installation"
time="2025-09-10T03:06:22Z" level=warning msg="failed to validate the driver, retrying after 5 seconds\n"
I also tried stracing it in pod, the key result of strace -o log -ff nvidia-validator
is:
write(2, "\33[36mINFO\33[0m[0000] Attempting t"..., 79) = 79
newfstatat(AT_FDCWD, "/host/usr/lib/wsl/lib/nvidia-smi", 0xc00001a858, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/host/usr/bin/nvidia-smi", 0xc00001a928, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
and
write(2, "\33[36mINFO\33[0m[0000] Attempting t"..., 76) = 76
newfstatat(AT_FDCWD, "/usr", {st_mode=S_IFDIR|0755, st_size=41, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local", {st_mode=S_IFDIR|0755, st_size=28, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/libnvidia-ml.so.1", 0xc00001b218, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr", {st_mode=S_IFDIR|0755, st_size=41, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local", {st_mode=S_IFDIR|0755, st_size=28, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/usr", 0xc00001b488, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr", {st_mode=S_IFDIR|0755, st_size=41, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local", {st_mode=S_IFDIR|0755, st_size=28, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/usr", 0xc00001b6f8, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr", {st_mode=S_IFDIR|0755, st_size=41, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local", {st_mode=S_IFDIR|0755, st_size=28, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/usr", 0xc00001b968, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr", {st_mode=S_IFDIR|0755, st_size=41, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local", {st_mode=S_IFDIR|0755, st_size=28, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/lib64", 0xc00001bbd8, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr", {st_mode=S_IFDIR|0755, st_size=41, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local", {st_mode=S_IFDIR|0755, st_size=28, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/lib", {st_mode=S_IFDIR|0755, st_size=71, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/lib/x86_64-linux-gnu", 0xc000240378, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/usr", {st_mode=S_IFDIR|0755, st_size=41, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local", {st_mode=S_IFDIR|0755, st_size=28, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/lib", {st_mode=S_IFDIR|0755, st_size=71, ...}, AT_SYMLINK_NOFOLLOW) = 0
newfstatat(AT_FDCWD, "/usr/local/lib/aarch64-linux-gnu", 0xc0006562a8, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
write(2, "\33[33mWARN\33[0m[0000] failed to va"..., 77) = 77
The environment of this pod(container driver-validation) has been set as:
env:
- name: WITH_WAIT
value: "true"
- name: COMPONENT
value: driver
- name: OPERATOR_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: DRIVER_INSTALL_DIR
value: /usr/local
- name: DRIVER_INSTALL_DIR_CTR_PATH
value: /usr/local
and this information maybe helpful:
# find /usr/local -name nvidia-smi
/usr/local/bin/nvidia-smi
# find /usr/local -name 'libnvidia-ml.so*'
/usr/local/glibc/usr/lib/libnvidia-ml.so
/usr/local/glibc/usr/lib/libnvidia-ml.so.1
/usr/local/glibc/usr/lib/libnvidia-ml.so.570.172.08
sinanmohd and horjulf
Metadata
Metadata
Assignees
Labels
No labels