This is a basically a guided demo for getting started with RHAIIS on AWS with an NVIDIA GPU, take it as a ITLTRTD demo script... BTW if you're not that lazy here you are the documentation of the product where you can find all the steps necessary for getting started with RHAIIS on NVIDIA and AMD.
This guide will help you deploy vLLM (the inference engine behind RHAIIS) and run an LLM model, like Llama, Granite, etc. You can expose it to the public internet and use a Chatbot UI like (WebUI or AnythingLLM) or keep it local and test it from AWS itself, it's up to you.
[NOTE] You are going to provision a medium-small GPU for this demo, please take into account that you need an AWS account with enough permissions.
You have to provision a g5.4xlarge instance that will consume ~$1.7/h (go to AWS On-Demand Pricing site for the real/updated value).
Now let's privision that VM:
- Go to EC2 home, then click on "Launch instance".
- Name your instance as you wish, for instance "rhaiis-demo".
- Select "Red Hat" on "Quick Start" tab.
- Select "Red Hat Enterprise Linux 9 (HVM), SSD Volume Type"
- Choose Instance Type "g5.4xlarge"
- Create or choose a "Key pair" and download the private part of it.
- Make sure that "Allow SSH traffic from [0.0.0.0 Anywhere]" is selected
- Change to a different AZ (Availability Zone) if necessary to avoid the scarcity of GPU resources. For this you may need to create a VPC for your "not-by-default" AZ.
- If you want to access from your desktop don't forget that your VPC has to be public.
- Make the root volume 200GiB and make sure gp3 is selected.
Finally click on "Launch instance"
Click on "View all instances" and wait until your instance is ready.
In the documentation it reminds us to:
- have installed Podman or Docker
- have access to a Linux server with NVIDIA or AMD GPUs and are logged in as a user with root privileges
Those are granted bt default in the AMI we have selected.
Additionally it says the if we want to use an NVIDIA GPU we have to:
- Install NVIDIA drivers
- Install the NVIDIA Container Toolkit
Let's get started, first of all log in your instance. You should have downloaded your private key part of the key pair to your desktop (in our case the file is named rhaiis.pem). You will need the DNS name or public IP address of your instance (i.e. ec2-W-X-Y-Z.AZ.compute.amazonaws.com). With that:
$ chmod 400 rhaiis.pem
$ ssh -i rhaiis.pem [email protected]
The authenticity of host 'ec2-W-X-Y-Z.AZ.compute.amazonaws.com (W.X.Y.Z)' can't be established.
ED25519 key fingerprint is SHA256:2M5iKSLOUQZnUbsQtDooWMAYNjhoVYDWM8uPSdsHr8o.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'ec2-W-X-Y-Z.AZ.compute.amazonaws.com' (ED25519) to the list of known hosts.
X11 forwarding request failed on channel 0
Register this system with Red Hat Insights: rhc connect
Example:
# rhc connect --activation-key <key> --organization <org>
The rhc client and Red Hat Insights will enable analytics and additional
management capabilities on your system.
View your connected systems at https://console.redhat.com/insights
You can learn more about how to register your system
using rhc at https://red.ht/registration
[ec2-user@ip-10-0-117-174 ~]$ Change to root.
[ec2-user@ip-10-0-117-174 ~]$ sudo -iNow let's continue by installing the driver
This is a simplified script of the whole documentation you can find here.
Let's check our linux is the one expected:
[root@ip-10-0-117-174 ~]# hostnamectl
Static hostname: ip-10-0-117-174.eu-central-1.compute.internal
Icon name: computer-vm
Chassis: vm 🖴
Machine ID: ec25997305d1554Wq6f1794fc9d38716
Boot ID: ff25fd91ce054aeFgfa222928e5c9a9c
Virtualization: amazon
Operating System: Red Hat Enterprise Linux 9.6 (Plow)
CPE OS Name: cpe:/o:redhat:enterprise_linux:9::baseos
Kernel: Linux 5.14.0-570.22.1.el9_6.x86_64
Architecture: x86-64
Hardware Vendor: Amazon EC2
Hardware Model: g5.4xlarge
Firmware Version: 1.0[root@ip-10-0-117-174 ~]# uname -a
Linux ip-10-0-117-174.eu-central-1.compute.internal 5.14.0-570.22.1.el9_6.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Jun 8 05:17:37 EDT 2025 x86_64 x86_64 x86_64 GNU/LinuxInstall headers, this is needed to build the dmks module later...
dnf install kernel-devel-matched kernel-headersSatisfy third-party package dependencies:
dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpmEnable the network repository:
distro=rhel9
arch=x86_64
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-$distro.repoClean DNF repository cache:
dnf clean expire-cacheWith RHEL 9 and later we have to install Open Kernel Modules:
DNF module enablement This is required for distributions where content is not distributed as a flat repository but as a repository containing DNF module streams.
dnf module enable nvidia-driver:open-dkmsDriver Installation
dnf install nvidia-openBuilding modules:
dkms autoinstallCheck:
# lsmod | grep nvidia
nvidia_uvm 4087808 0
nvidia_drm 151552 0
nvidia_modeset 1732608 1 nvidia_drm
nvidia 11583488 2 nvidia_uvm,nvidia_modeset
video 77824 1 nvidia_modeset
drm_ttm_helper 16384 1 nvidia_drm
drm_kms_helper 266240 2 drm_ttm_helper,nvidia_drm
drm 811008 6 drm_kms_helper,nvidia,drm_ttm_helper,nvidia_drm,ttm# nvidia-smi
Wed Jul 23 15:36:13 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 26C P8 11W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+This is a simplified script of the whole documentation you can find here.
Let's install the NVIDIA Container Toolkit packages:
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
sudo dnf install -y \
nvidia-container-toolkit-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
nvidia-container-toolkit-base-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container-tools-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container1-${NVIDIA_CONTAINER_TOOLKIT_VERSION}Configuring Podman For Podman, NVIDIA recommends using CDI for accessing NVIDIA devices in containers.
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yamlWhich results in an output like:
INFO[0000] Using /usr/lib64/libnvidia-ml.so.575.57.08
INFO[0000] Auto-detected mode as 'nvml'
INFO[0000] Selecting /dev/nvidia0 as /dev/nvidia0
...
WARN[0000] Could not locate nvidia_drv.so: pattern nvidia_drv.so not found
WARN[0000] Could not locate libglxserver_nvidia.so.575.57.08: pattern libglxserver_nvidia.so.575.57.08 not found
INFO[0000] Generated CDI spec with version 0.8.0 Check:
# nvidia-ctk cdi list
INFO[0000] Found 3 CDI devices
nvidia.com/gpu=0
nvidia.com/gpu=GPU-cb2732e5-cd47-331e-2671-88077f88b986
nvidia.com/gpu=allEnsure nvidia-uvm auto-loads at boot or you will experience "":
echo nvidia-uvm | sudo tee -a /etc/modules-load.d/nvidia.confCheck:
# ls -l /dev/nvidia*
crw-rw-rw-. 1 root root 195, 0 Jul 23 15:36 /dev/nvidia0
crw-rw-rw-. 1 root root 195, 255 Jul 23 15:36 /dev/nvidiactl
crw-rw-rw-. 1 root root 235, 0 Jul 23 15:36 /dev/nvidia-uvm
crw-rw-rw-. 1 root root 235, 1 Jul 23 15:36 /dev/nvidia-uvm-tools
/dev/nvidia-caps:
total 0
cr--------. 1 root root 238, 1 Jul 23 15:36 nvidia-cap1
cr--r--r--. 1 root root 238, 2 Jul 23 15:36 nvidia-cap2Testing a workload:
podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -LYou should see:
# podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L
Resolved "ubuntu" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/ubuntu:latest...
Getting image source signatures
Copying blob 32f112e3802c done |
Copying config 65ae7a6f35 done |
Writing manifest to image destination
GPU 0: NVIDIA A10G (UUID: GPU-cb2732e5-cd47-331e-2671-88077f88b986)Another test:
podman run --rm -it \
--security-opt=label=disable \
--device nvidia.com/gpu=all \
nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
nvidia-smiCongratulations you have all the NVIDIA drivers installed.
Pulling vLLM image
Let's pull the vLLM image, in order to do this you need a Red Hat Network account.
First, log in:
podman login registry.redhat.ioNext, pull the image:
podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0Let's prepare the file cache:
mkdir -p rhaiis-cache
chmod g+rwX rhaiis-cacheSetting the Hugging Face token environment variable is needed in order to download the weight of the LLM you will use later.
Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.
$ echo "export HF_TOKEN=<your_HF_token>" > private.env$ source private.envStart the container
Start the vLLM container.
podman run --rm -it \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
--shm-size=4g -p 8000:8000 \
--userns=keep-id:uid=1001 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
--env=VLLM_NO_USAGE_STATS=1 \
-v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 11 Required for systems where SELinux is enabled. --security-opt=label=disable prevents SELinux from relabeling files in the volume mount. If you choose not to use this argument, your container might not successfully run. 2 If you experience an issue with shared memory, increase --shm-size to 8GB. 3 Maps the host UID to the effective UID of the vLLM process in the container. You can also pass --user=0, but this less secure than the --userns option. Setting --user=0 runs vLLM as root inside the container. 4 Set and export HF_TOKEN with your Hugging Face API access token 5 Required for systems where SELinux is enabled. On Debian or Ubuntu operating systems, or when using Docker without SELinux, the :Z suffix is not available. 6 Set --tensor-parallel-size to match the number of GPUs when running the AI Inference Server container on multiple GPUs.
Super basic test:
curl -X POST -H "Content-Type: application/json" -d '{
"prompt": "What is the capital of France?",
"max_tokens": 50
}' http://localhost:8000/v1/completions | jqYou should get:
{
"id": "cmpl-49fa8f5e45c345918e2fe2a40f92ec4a",
"object": "text_completion",
"created": 1753287408,
"model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
"choices": [
{
"index": 0,
"text": " Paris\nThe capital of France is Paris. Paris is the most populous city in France, known for its rich history, art, fashion, and cuisine. It is also home to the Eiffel Tower, the Louvre Museum, and Notre Dame",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"prompt_logprobs": null
}
],
"usage": {
"prompt_tokens": 8,
"total_tokens": 58,
"completion_tokens": 50,
"prompt_tokens_details": null
},
"kv_transfer_params": null
}Connect as ec2-user, no need to sudo.
Prepare the Python environment:
python -m ensurepip --upgrade
python -m pip install --upgrade pip
pip install vllm pandas datasetsInstall git if necessary:
sudo dnf install gitClone repo:
git clone https://github.com/vllm-project/vllm.gitExecute the benchmark:
python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random --random-input 1024 --random-output 512 --port 8000Expected result:
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 14.27
Total input tokens: 102300
Total generated tokens: 40739
Request throughput (req/s): 7.01
Output token throughput (tok/s): 2855.01
Total Token throughput (tok/s): 10024.26
---------------Time to First Token----------------
Mean TTFT (ms): 2388.80
Median TTFT (ms): 2307.74
P99 TTFT (ms): 4861.15
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 35.39
Median TPOT (ms): 23.96
P99 TPOT (ms): 103.90
---------------Inter-token Latency----------------
Mean ITL (ms): 23.23
Median ITL (ms): 19.08
P99 ITL (ms): 105.38
==================================================git clone vllm...
For some reason loading the module uvm is not enough...
# podman run --rm -it --device nvidia.com/gpu=all --security-opt=label=disable --shm-size=4g -p 8000:8000 --userns=keep-id:uid=1001 --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" --env "HF_HUB_OFFLINE=0" --env=VLLM_NO_USAGE_STATS=1 -v ./rhaiis-cache:/opt/app-root/src/.cache:Z registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0 --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --tensor-parallel-size 1
Error: setting up CDI devices: failed to inject devices: failed to stat CDI host device "/dev/nvidia-uvm": no such file or directory
[root@ip-10-0-117-174 ~]# lsmod | grep nvidia
nvidia_uvm 4087808 0
nvidia_drm 151552 0
nvidia_modeset 1732608 1 nvidia_drm
video 77824 1 nvidia_modeset
drm_ttm_helper 16384 1 nvidia_drm
drm_kms_helper 266240 2 drm_ttm_helper,nvidia_drm
nvidia 11583488 2 nvidia_uvm,nvidia_modeset
drm 811008 6 drm_kms_helper,nvidia,drm_ttm_helper,nvidia_drm,ttm
[root@ip-10-0-117-174 ~]# dkms autoinstall
[root@ip-10-0-117-174 ~]# podman run --rm -it --device nvidia.com/gpu=all --security-opt=label=disable --shm-size=4g -p 8000:8000 --userns=keep-id:uid=1001 --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" --env "HF_HUB_OFFLINE=0" --env=VLLM_NO_USAGE_STATS=1 -v ./rhaiis-cache:/opt/app-root/src/.cache:Z registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0 --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --tensor-parallel-size 1
Error: setting up CDI devices: failed to inject devices: failed to stat CDI host device "/dev/nvidia-uvm": no such file or directory
[root@ip-10-0-117-174 ~]# nvidia-smi
Wed Jul 23 16:14:24 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 27C P8 10W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+