How to Resolve "Failed to initialize NVML: Unknown Error" in Docker with NVIDIA GPUs

When running Docker containers that utilize NVIDIA GPUs, you may encounter the following error:

Failed to initialize NVML: Unknown Error

This error indicates a problem with NVIDIA's Management Library (NVML) initialization inside your Docker container. This guide will help you understand the cause of this error and provide step-by-step solutions to resolve it.

Understanding the Cause

The error occurs when the host system performs a daemon-reload or a similar operation that reloads system daemons. If your Docker container uses systemd to manage cgroups (control groups), reloading the daemons on the host can trigger a reload of any unit files that reference NVIDIA GPUs. This action causes the container to lose access to the GPU references, leading to the NVML initialization error.

Verifying the Issue

To confirm that this is the issue affecting your container:

Ensure GPU Access in Container

First, verify that your container currently has access to the GPU:

nvidia-smi

This command should display the GPU information if everything is working correctly.

Reload Daemons on Host

From the host system (outside the container), run:

sudo systemctl daemon-reload

Check GPU Access Again

Return to your container and run:

nvidia-smi

If you now see the error Failed to initialize NVML: Unknown Error, it confirms that the issue is caused by the host's daemon-reload affecting the container's GPU access.
If the GPU information is still displayed correctly, then your issue may have a different cause.

Solution 1: Disable cgroups in Docker

Disabling cgroups in Docker can prevent the container from losing GPU access when the host reloads daemons. This solution is suitable if your containers do not rely heavily on cgroups for resource management.

Steps:

Edit Docker Daemon Configuration

Open the Docker daemon configuration file with your preferred text editor:

sudo nano /etc/docker/daemon.json

Add cgroup Driver Configuration

Add the following parameter to the JSON configuration:

{
  "exec-opts": ["native.cgroupdriver=cgroupfs"],
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "args": []
    }
  }
}

Ensure proper JSON syntax by including commas where necessary.
If the daemon.json file already contains configurations, insert the "exec-opts" line appropriately.
Save and Close the File
Press Ctrl+O to save the file.
Press Ctrl+X to exit the editor.
Restart Docker Service

Apply the changes by restarting the Docker service:

sudo systemctl restart docker

Test the Configuration

Run a test Docker container to verify GPU access:

sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.0-base nvidia-smi

You should see the GPU information without any errors.

Note:

Impact on cgroups: Disabling cgroups may affect how resources are allocated and managed in your containers. Ensure that this change does not negatively impact your applications.

Solution 2: Modify NVIDIA Container Runtime Configuration

If you prefer not to disable cgroups globally, you can adjust the NVIDIA container runtime configuration.

Steps:

Edit NVIDIA Container Runtime Configuration

Open the configuration file:

sudo vim /etc/nvidia-container-runtime/config.toml

Set no-cgroups to false

Find the line that reads no-cgroups = true and change it to:

no-cgroups = false

Save and Close the File
In vim, press Esc, type :wq, and press Enter to save and exit.
Restart Docker Service

Restart the Docker daemon to apply changes:

sudo systemctl restart docker

Test the Configuration

Run the test command:

sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.0-base nvidia-smi

The GPU information should display correctly without the NVML error.

Note:

This method keeps cgroups enabled but adjusts how the NVIDIA runtime interacts with them.

Additional Notes

Awaiting Official Fix: NVIDIA has acknowledged this issue and mentioned plans to release a formal fix. Check the NVIDIA/nvidia-docker GitHub issue #1730 for updates.
Containers with Own NVIDIA Drivers: If your container includes its own NVIDIA driver installation, these solutions may not resolve the issue. Refer to the GitHub issue linked above for alternative approaches.
Impact on Systemd and cgroups: Be cautious when modifying cgroup settings, as it may affect other services or containers that rely on them.

Conclusion

The "Failed to initialize NVML: Unknown Error" in Docker containers using NVIDIA GPUs is commonly caused by the host system reloading daemons, which affects GPU references in containers. By either disabling cgroups in Docker or adjusting the NVIDIA container runtime configuration, you can resolve this issue.

Choose the solution that best fits your environment and testing requirements. Always ensure that changes do not adversely affect other services or container functionalities.

References: