1. Quick Debug Information
- OS/Version: Ubuntu22.04
- Kernel Version: Linux 5.15.0-117-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.7.17-k3s1.28
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k3s v1.28.11+k3s1
- GPU Operator Version: gpu-operator-v24.6.0
2. Issue or feature description
When I install the vgpu-manager, the error appears to be /usr/local/bin/nvidia-driver: line 1: popd: directory stack empty. I check dmesg log in the node it gives me Direct firmware load for nvidia/550.54.10/gsp_ga10x.bin failed with error -2. It looks like the firmware not loaded properly
Can somebody help me how to resolve this error with step by step reference?
3. Information to attach (optional if deemed irrelevant)
- Logs pod
nvidia-vgpu-manager-daemonset
+ DRIVER_VERSION=550.54.10
+ DRIVER_ARCH=x86_64
+ DRIVER_RESET_RETRIES=10
++ uname -r
+ KERNEL_VERSION=5.15.0-117-generic
+ RUN_DIR=/run/nvidia
+ export DEBIAN_FRONTEND=noninteractive
+ DEBIAN_FRONTEND=noninteractive
+ '[' 1 -eq 0 ']'
+ command=init
+ shift
+ case "${command}" in
++ getopt -l accept-license -o a --
+ options=' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --'
++ set -- --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=5.15.0-117-generic
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ init
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap _shutdown EXIT
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_vgpu_vfio_refs=0
+ echo 'Stopping NVIDIA vGPU Manager...'
+ '[' -f /var/run/nvidia-vgpu-mgr/nvidia-vgpu-mgr.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
+ '[' -f /sys/module/nvidia_vgpu_vfio/refcnt ']'
Stopping NVIDIA vGPU Manager...
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia/refcnt ']'
+ nvidia_refs=0
+ rmmod_args+=("nvidia")
+ '[' 1 -gt 0 ']'
+ rmmod nvidia
+ '[' 0 '!=' 0 ']'
+ return 0
+ _unmount_rootfs
Unmounting NVIDIA driver rootfs...
+ echo 'Unmounting NVIDIA driver rootfs...'
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ umount -l -R /run/nvidia/driver
Updating the package cache...
+ _update_package_cache
+ '[' '' '!=' builtin ']'
+ echo 'Updating the package cache...'
+ apt-get -qq update
+ _resolve_kernel_version
++ apt-cache show linux-headers-5.15.0-117-generic
++ sed -nE 's/^Version:\s+(([0-9]+\.){2}[0-9]+)[-.]([0-9]+).*/\1-\3/p'
++ head -1
+ local version=5.15.0-117
++ echo 5.15.0-117-generic
++ sed 's/[^a-z]*//'
++ grep -Ev '^generic|virtual'
+ local flavor=
+ echo 'Resolving Linux kernel version...'
+ '[' -z 5.15.0-117 ']'
Resolving Linux kernel version...
+ KERNEL_VERSION=5.15.0-117-generic
+ echo 'Proceeding with Linux kernel version 5.15.0-117-generic'
+ return 0
Proceeding with Linux kernel version 5.15.0-117-generic
+ _install_prerequisites
++ mktemp -d
+ local tmp_dir=/tmp/tmp.kG79fBJSB3
+ trap 'popd; rm -rf /tmp/tmp.kG79fBJSB3' RETURN EXIT
+ pushd /tmp/tmp.kG79fBJSB3
/tmp/tmp.kG79fBJSB3 /driver
+ rm -rf /lib/modules/5.15.0-117-generic
+ mkdir -p /lib/modules/5.15.0-117-generic/proc
+ echo 'Installing Linux kernel headers...'
Installing Linux kernel headers...
+ apt-get -qq install --no-install-recommends linux-headers-5.15.0-117-generic
+ echo 'Installing Linux kernel module files...'
+ apt-get -qq download linux-image-5.15.0-117-generic
Installing Linux kernel module files...
+ dpkg -x linux-image-5.15.0-117-generic_5.15.0-117.127_amd64.deb .
+ mv lib/modules/5.15.0-117-generic/modules.builtin lib/modules/5.15.0-117-generic/modules.builtin.modinfo lib/modules/5.15.0-117-generic/modules.order /lib/modules/5.15.0-117-generic
+ mv lib/modules/5.15.0-117-generic/kernel /lib/modules/5.15.0-117-generic
+ depmod 5.15.0-117-generic
+ echo 'Generating Linux kernel version string...'
Generating Linux kernel version string...
+ file boot/vmlinuz-5.15.0-117-generic
+ awk 'BEGIN { RS="," } $1=="version" { print $2 }' -
+ '[' -z 5.15.0-117-generic ']'
+ mv version /lib/modules/5.15.0-117-generic/proc
/driver
++ popd
++ rm -rf /tmp/tmp.kG79fBJSB3
Creating '/dev/char' directory
+ _create_dev_char_directory
+ '[' '!' -d /dev/char ']'
+ echo 'Creating '\''/dev/char'\'' directory'
+ mkdir -p /dev/char
+ _install_driver
++ mktemp -d
+ local tmp_dir=/tmp/tmp.GGwiHUdShK
+ sh NVIDIA-Linux-x86_64-550.54.10-vgpu-kvm.run --ui=none --no-questions --tmpdir /tmp/tmp.GGwiHUdShK --no-systemd
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.54.10......................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Welcome to the NVIDIA Software Installer for Unix/Linux
Detected 128 CPUs online; setting concurrency level to 32.
Unable to locate any tools for listing initramfs contents.
Unable to scan initramfs: no tool found
This system requires use of the NVIDIA open kernel modules; these will be selected by default.
Installing NVIDIA driver version 550.54.10.
Performing CC sanity check with CC="/usr/bin/cc".
Performing CC check.
Kernel source path: '/lib/modules/5.15.0-117-generic/build'
Kernel output path: '/lib/modules/5.15.0-117-generic/build'
Performing Compiler check.
Performing Dom0 check.
Performing Xen check.
Performing PREEMPT_RT check.
Performing vgpu_kvm check.
Cleaning kernel module build directory.
Building kernel modules:
[##############################] 100%
Kernel module compilation complete.
Kernel messages:
[ 3941.692907] nvidia 0000:03:00.0: driver left SR-IOV enabled after remove
[ 3941.693251] nvidia 0000:64:00.0: driver left SR-IOV enabled after remove
[ 3941.693477] nvidia 0000:63:00.0: driver left SR-IOV enabled after remove
[ 3941.693818] nvidia 0000:e4:00.0: driver left SR-IOV enabled after remove
[ 3941.694212] nvidia 0000:e3:00.0: driver left SR-IOV enabled after remove
[ 3941.694639] NVOC: __nvoc_objDelete: Child class OBJIOVASPACE not freed from parent class OBJVMM.
[ 3941.694790] nvidia-nvlink: Unregistered Nvlink Core, major device number 499
[ 3989.137665] nvidia-nvlink: Nvlink Core is being initialized, major device number 499
[ 3989.137675] NVRM: The NVIDIA probe routine was not called for 256 device(s).
[ 3989.570567] NVRM: This can occur when another driver was loaded and
NVRM: obtained ownership of the NVIDIA device(s).
[ 3989.570570] NVRM: Try unloading the conflicting kernel module (and/or
NVRM: reconfigure your kernel without the conflicting
NVRM: driver(s)), then try loading the NVIDIA kernel module
NVRM: again.
[ 3989.570590] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 550.54.10 Release Build (dvs-builder@U16-I3-B13-2-1) Wed Feb 14 16:21:59 UTC 2024
[ 3989.716774] nvidia 0000:84:00.0: driver left SR-IOV enabled after remove
[ 3989.717546] nvidia 0000:83:00.0: driver left SR-IOV enabled after remove
[ 3989.718001] nvidia 0000:04:00.0: driver left SR-IOV enabled after remove
[ 3989.718288] nvidia 0000:03:00.0: driver left SR-IOV enabled after remove
[ 3989.718588] nvidia 0000:64:00.0: driver left SR-IOV enabled after remove
[ 3989.719319] nvidia 0000:63:00.0: driver left SR-IOV enabled after remove
[ 3989.719774] nvidia 0000:e4:00.0: driver left SR-IOV enabled after remove
[ 3989.720154] nvidia 0000:e3:00.0: driver left SR-IOV enabled after remove
[ 3989.720839] nvidia-nvlink: Unregistered Nvlink Core, major device number 499
Searching for conflicting files:: Searching
[##############################] 100%
Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (550.54.10):: Installing
[# ] 0%
Unable to determine whether NVIDIA kernel modules are present in the initramfs. Existing NVIDIA kernel modules in the initramfs, if any, may interfere with the newly installed driver.
[##############################] 100%
Driver file installation is complete.
Running distribution scripts: Executing /usr/lib/nvidia/post-install
[##############################] 100%
Running post-install sanity check:: Checking
[##############################] 100%
Post-install sanity check passed.
Installation of the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 550.54.10) is now complete.
+ _load_driver
+ /usr/bin/nvidia-vgpud
+ '[' '!' -f /sys/module/nvidia_vgpu_vfio/refcnt ']'
+ /usr/bin/nvidia-vgpu-mgr
+ '[' '!' -f /sys/module/nvidia/refcnt ']'
+ return 0
+ _mount_rootfs
+ echo 'Mounting NVIDIA driver rootfs...'
+ mount -o remount,rw /sys
Mounting NVIDIA driver rootfs...
+ mount --make-runbindable /sys
+ mount --make-private /sys
+ mkdir -p /run/nvidia/driver
+ mount --rbind / /run/nvidia/driver
+ _enable_vfs
+ local retry
+ (( retry = 0 ))
+ (( retry <= 10 ))
+ /usr/lib/nvidia/sriov-manage -e ALL
GPU at 0000:03:00.0 already has VFs enabled.
GPU at 0000:04:00.0 already has VFs enabled.
GPU at 0000:63:00.0 already has VFs enabled.
GPU at 0000:64:00.0 already has VFs enabled.
GPU at 0000:83:00.0 already has VFs enabled.
GPU at 0000:84:00.0 already has VFs enabled.
GPU at 0000:e3:00.0 already has VFs enabled.
GPU at 0000:e4:00.0 already has VFs enabled.
+ return 0
+ pgrep nvidia-vgpu-mgr
+ nvidia-vgpud
+ echo 'Restarting nvidia-vgpu-mgr after previously killed'
+ nvidia-vgpu-mgr
Restarting nvidia-vgpu-mgr after previously killed
+ set +x
Done, now waiting for signal
ERROR: nvidia-vgpu-mgr daemon is no longer running. Exiting.
/usr/local/bin/nvidia-driver: line 1: popd: directory stack empty
- Dmesg log
Direct firmware load for nvidia/550.54.10/gsp_ga10x.bin failed with error -2

- Kernel Version

- Check GSP Firmware Version (N/A Value)
for gpu in /proc/driver/nvidia/gpus/*/information; do
echo "File: $gpu"
cat "$gpu"
echo "-----------------------------"
done
File: /proc/driver/nvidia/gpus/0000:03:00.0/information
Model: NVIDIA L40S
IRQ: 94
GPU UUID: GPU-d03ff6db-34c7-dc00-484c-3adc1cc61b03
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:03:00.0
Device Minor: 4
GPU Firmware: N/A
GPU Excluded: No
-----------------------------
File: /proc/driver/nvidia/gpus/0000:04:00.0/information
Model: NVIDIA L40S
IRQ: 58
GPU UUID: GPU-0418f843-80fe-7d93-cb41-72ecf0a117de
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:04:00.0
Device Minor: 5
GPU Firmware: N/A
GPU Excluded: No
-----------------------------
File: /proc/driver/nvidia/gpus/0000:63:00.0/information
Model: NVIDIA L40S
IRQ: 91
GPU UUID: GPU-6c383654-2e10-7167-a1e0-fb8e8ba4b7bc
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:63:00.0
Device Minor: 2
GPU Firmware: N/A
GPU Excluded: No
-----------------------------
File: /proc/driver/nvidia/gpus/0000:64:00.0/information
Model: NVIDIA L40S
IRQ: 51
GPU UUID: GPU-6d0940a2-6511-6aa0-2255-ce36f96b530b
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:64:00.0
Device Minor: 3
GPU Firmware: N/A
GPU Excluded: No
-----------------------------
File: /proc/driver/nvidia/gpus/0000:83:00.0/information
Model: NVIDIA L40S
IRQ: 890
GPU UUID: GPU-9a36f396-473f-b5b7-ba8c-b6e6c2cfd93e
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:83:00.0
Device Minor: 6
GPU Firmware: N/A
GPU Excluded: No
-----------------------------
File: /proc/driver/nvidia/gpus/0000:84:00.0/information
Model: NVIDIA L40S
IRQ: 70
GPU UUID: GPU-f6916d2a-2c75-c840-5106-af1e5b80f25c
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:84:00.0
Device Minor: 7
GPU Firmware: N/A
GPU Excluded: No
-----------------------------
File: /proc/driver/nvidia/gpus/0000:e3:00.0/information
Model: NVIDIA L40S
IRQ: 889
GPU UUID: GPU-71d50d8a-31d7-2028-4bca-e728fe84441c
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:e3:00.0
Device Minor: 0
GPU Firmware: N/A
GPU Excluded: No
-----------------------------
File: /proc/driver/nvidia/gpus/0000:e4:00.0/information
Model: NVIDIA L40S
IRQ: 44
GPU UUID: GPU-e580ef25-9fe7-74f7-33c9-03bfa563ebb2
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:e4:00.0
Device Minor: 1
GPU Firmware: N/A
GPU Excluded: No
-----------------------------
1. Quick Debug Information
2. Issue or feature description
When I install the vgpu-manager, the error appears to be
/usr/local/bin/nvidia-driver: line 1: popd: directory stack empty. I check dmesg log in the node it gives meDirect firmware load for nvidia/550.54.10/gsp_ga10x.bin failed with error -2. It looks like the firmware not loaded properlyCan somebody help me how to resolve this error with step by step reference?
3. Information to attach (optional if deemed irrelevant)
nvidia-vgpu-manager-daemonset