generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 54
Open
Description
CDI mode is a new framework for connecting containers with hardware devices that is supported by NVIDIA. We should look into using this in the AL2023 GPU ECS-optimized AMI.
There have been some reports that not using CDI mode might increase the chance of seeing CUDA_DEVICE_NOT_FOUND errors in the AL2023 AMI, which is possibly related to the systemd cgroup driver used by cgroupsv2.
At a high level this would involve something like:
- Get rid of existing nvidia and oci-add-hooks setup: https://github.com/aws/amazon-ecs-ami/blob/main/scripts/enable-ecs-agent-gpu-support-al2023.sh#L51-L83
- Configure docker to run the following commands on startup, possibly in Amazon Linux docker's
/usr/libexec/docker/docker-setup-runtimes.sh
script. The reason to run it there is that the CDI generate step needs to be run every time docker is started, because it changes depending on the installed NVIDIA version (so it needs to run again after an in-place update). Theruntime configure
command also modifies /etc/docker/daemon.json, so we would want to run this after customer userdata in cloud-init runs.
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
nvidia-ctk config --in-place --set nvidia-container-runtime.mode=cdi
nvidia-ctk runtime configure --runtime=docker
References:
- CDI Mode (we would probably want to 'set it explicitly'): https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html
- Known issue resolved by CDI mode: docker: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" NVIDIA/nvidia-container-toolkit#857 (comment)
harishxr
Metadata
Metadata
Assignees
Labels
No labels