Skip to content

Use CDI mode for the AL2023 GPU ECS-optimized AMI #483

@sparrc

Description

@sparrc

CDI mode is a new framework for connecting containers with hardware devices that is supported by NVIDIA. We should look into using this in the AL2023 GPU ECS-optimized AMI.

There have been some reports that not using CDI mode might increase the chance of seeing CUDA_DEVICE_NOT_FOUND errors in the AL2023 AMI, which is possibly related to the systemd cgroup driver used by cgroupsv2.

At a high level this would involve something like:

  1. Get rid of existing nvidia and oci-add-hooks setup: https://github.com/aws/amazon-ecs-ami/blob/main/scripts/enable-ecs-agent-gpu-support-al2023.sh#L51-L83
  2. Configure docker to run the following commands on startup, possibly in Amazon Linux docker's /usr/libexec/docker/docker-setup-runtimes.sh script. The reason to run it there is that the CDI generate step needs to be run every time docker is started, because it changes depending on the installed NVIDIA version (so it needs to run again after an in-place update). The runtime configure command also modifies /etc/docker/daemon.json, so we would want to run this after customer userdata in cloud-init runs.
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
nvidia-ctk config --in-place --set nvidia-container-runtime.mode=cdi
nvidia-ctk runtime configure --runtime=docker

References:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions