Skip to content

alpha-hack-program/rhaiis-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Red Hat AI Inference Server Getting Started Demo

This is a basically a guided demo for getting started with RHAIIS on AWS with an NVIDIA GPU, take it as a ITLTRTD demo script... BTW if you're not that lazy here you are the documentation of the product where you can find all the steps necessary for getting started with RHAIIS on NVIDIA and AMD.

This guide will help you deploy vLLM (the inference engine behind RHAIIS) and run an LLM model, like Llama, Granite, etc. You can expose it to the public internet and use a Chatbot UI like (WebUI or AnythingLLM) or keep it local and test it from AWS itself, it's up to you.

Provisioning the VM

[NOTE] You are going to provision a medium-small GPU for this demo, please take into account that you need an AWS account with enough permissions.

You have to provision a g5.4xlarge instance that will consume ~$1.7/h (go to AWS On-Demand Pricing site for the real/updated value).

Now let's privision that VM:

  • Go to EC2 home, then click on "Launch instance".
  • Name your instance as you wish, for instance "rhaiis-demo".
  • Select "Red Hat" on "Quick Start" tab.
  • Select "Red Hat Enterprise Linux 9 (HVM), SSD Volume Type"
  • Choose Instance Type "g5.4xlarge"
  • Create or choose a "Key pair" and download the private part of it.
  • Make sure that "Allow SSH traffic from [0.0.0.0 Anywhere]" is selected
  • Change to a different AZ (Availability Zone) if necessary to avoid the scarcity of GPU resources. For this you may need to create a VPC for your "not-by-default" AZ.
  • If you want to access from your desktop don't forget that your VPC has to be public.
  • Make the root volume 200GiB and make sure gp3 is selected.

Finally click on "Launch instance"

Click on "View all instances" and wait until your instance is ready.

Preparing Your RHEL instance for RHAIIS

In the documentation it reminds us to:

  • have installed Podman or Docker
  • have access to a Linux server with NVIDIA or AMD GPUs and are logged in as a user with root privileges

Those are granted bt default in the AMI we have selected.

Additionally it says the if we want to use an NVIDIA GPU we have to:

  • Install NVIDIA drivers
  • Install the NVIDIA Container Toolkit

Let's get started, first of all log in your instance. You should have downloaded your private key part of the key pair to your desktop (in our case the file is named rhaiis.pem). You will need the DNS name or public IP address of your instance (i.e. ec2-W-X-Y-Z.AZ.compute.amazonaws.com). With that:

$ chmod 400 rhaiis.pem
$ ssh -i rhaiis.pem [email protected]
The authenticity of host 'ec2-W-X-Y-Z.AZ.compute.amazonaws.com (W.X.Y.Z)' can't be established.
ED25519 key fingerprint is SHA256:2M5iKSLOUQZnUbsQtDooWMAYNjhoVYDWM8uPSdsHr8o.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'ec2-W-X-Y-Z.AZ.compute.amazonaws.com' (ED25519) to the list of known hosts.
X11 forwarding request failed on channel 0
Register this system with Red Hat Insights: rhc connect

Example:
# rhc connect --activation-key <key> --organization <org>

The rhc client and Red Hat Insights will enable analytics and additional
management capabilities on your system.
View your connected systems at https://console.redhat.com/insights

You can learn more about how to register your system 
using rhc at https://red.ht/registration
[ec2-user@ip-10-0-117-174 ~]$ 

Change to root.

[ec2-user@ip-10-0-117-174 ~]$ sudo -i

Now let's continue by installing the driver

Install NVIDIA driver

This is a simplified script of the whole documentation you can find here.

Let's check our linux is the one expected:

[root@ip-10-0-117-174 ~]# hostnamectl
 Static hostname: ip-10-0-117-174.eu-central-1.compute.internal
       Icon name: computer-vm
         Chassis: vm 🖴
      Machine ID: ec25997305d1554Wq6f1794fc9d38716
         Boot ID: ff25fd91ce054aeFgfa222928e5c9a9c
  Virtualization: amazon
Operating System: Red Hat Enterprise Linux 9.6 (Plow)          
     CPE OS Name: cpe:/o:redhat:enterprise_linux:9::baseos
          Kernel: Linux 5.14.0-570.22.1.el9_6.x86_64
    Architecture: x86-64
 Hardware Vendor: Amazon EC2
  Hardware Model: g5.4xlarge
Firmware Version: 1.0
[root@ip-10-0-117-174 ~]# uname -a
Linux ip-10-0-117-174.eu-central-1.compute.internal 5.14.0-570.22.1.el9_6.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Jun 8 05:17:37 EDT 2025 x86_64 x86_64 x86_64 GNU/Linux

Install headers, this is needed to build the dmks module later...

dnf install kernel-devel-matched kernel-headers

Satisfy third-party package dependencies:

dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm

Enable the network repository:

distro=rhel9
arch=x86_64
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-$distro.repo

Clean DNF repository cache:

dnf clean expire-cache

With RHEL 9 and later we have to install Open Kernel Modules:

DNF module enablement This is required for distributions where content is not distributed as a flat repository but as a repository containing DNF module streams.

dnf module enable nvidia-driver:open-dkms

Driver Installation

dnf install nvidia-open

Building modules:

dkms autoinstall

Check:

# lsmod | grep nvidia
nvidia_uvm           4087808  0
nvidia_drm            151552  0
nvidia_modeset       1732608  1 nvidia_drm
nvidia              11583488  2 nvidia_uvm,nvidia_modeset
video                  77824  1 nvidia_modeset
drm_ttm_helper         16384  1 nvidia_drm
drm_kms_helper        266240  2 drm_ttm_helper,nvidia_drm
drm                   811008  6 drm_kms_helper,nvidia,drm_ttm_helper,nvidia_drm,ttm
# nvidia-smi 
Wed Jul 23 15:36:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   26C    P8             11W /  300W |       0MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Install CUDA driver

This is a simplified script of the whole documentation you can find here.

Let's install the NVIDIA Container Toolkit packages:

export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
sudo dnf install -y \
    nvidia-container-toolkit-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
    nvidia-container-toolkit-base-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
    libnvidia-container-tools-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
    libnvidia-container1-${NVIDIA_CONTAINER_TOOLKIT_VERSION}

Configuring Podman For Podman, NVIDIA recommends using CDI for accessing NVIDIA devices in containers.

nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

Which results in an output like:

INFO[0000] Using /usr/lib64/libnvidia-ml.so.575.57.08   
INFO[0000] Auto-detected mode as 'nvml'                 
INFO[0000] Selecting /dev/nvidia0 as /dev/nvidia0       
...
WARN[0000] Could not locate nvidia_drv.so: pattern nvidia_drv.so not found 
WARN[0000] Could not locate libglxserver_nvidia.so.575.57.08: pattern libglxserver_nvidia.so.575.57.08 not found 
INFO[0000] Generated CDI spec with version 0.8.0 

Check:

# nvidia-ctk cdi list
INFO[0000] Found 3 CDI devices                          
nvidia.com/gpu=0
nvidia.com/gpu=GPU-cb2732e5-cd47-331e-2671-88077f88b986
nvidia.com/gpu=all

Ensure nvidia-uvm auto-loads at boot or you will experience "":

echo nvidia-uvm | sudo tee -a /etc/modules-load.d/nvidia.conf

Check:

# ls -l /dev/nvidia*
crw-rw-rw-. 1 root root 195,   0 Jul 23 15:36 /dev/nvidia0
crw-rw-rw-. 1 root root 195, 255 Jul 23 15:36 /dev/nvidiactl
crw-rw-rw-. 1 root root 235,   0 Jul 23 15:36 /dev/nvidia-uvm
crw-rw-rw-. 1 root root 235,   1 Jul 23 15:36 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
cr--------. 1 root root 238, 1 Jul 23 15:36 nvidia-cap1
cr--r--r--. 1 root root 238, 2 Jul 23 15:36 nvidia-cap2

Testing a workload:

podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L

You should see:

# podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L
Resolved "ubuntu" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/ubuntu:latest...
Getting image source signatures
Copying blob 32f112e3802c done   | 
Copying config 65ae7a6f35 done   | 
Writing manifest to image destination
GPU 0: NVIDIA A10G (UUID: GPU-cb2732e5-cd47-331e-2671-88077f88b986)

Another test:

podman run --rm -it \
--security-opt=label=disable \
--device nvidia.com/gpu=all \
nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
nvidia-smi

Congratulations you have all the NVIDIA drivers installed.

Getting Started with RHAIIS

Pulling vLLM image

Let's pull the vLLM image, in order to do this you need a Red Hat Network account.

First, log in:

podman login registry.redhat.io

Next, pull the image:

podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0

Let's prepare the file cache:

mkdir -p rhaiis-cache
chmod g+rwX rhaiis-cache

Setting the Hugging Face token environment variable is needed in order to download the weight of the LLM you will use later.

Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.

$ echo "export HF_TOKEN=<your_HF_token>" > private.env
$ source private.env

Start the container

Start the vLLM container.

podman run --rm -it \
  --device nvidia.com/gpu=all \
  --security-opt=label=disable \
  --shm-size=4g -p 8000:8000 \
  --userns=keep-id:uid=1001 \
  --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
  --env "HF_HUB_OFFLINE=0" \
  --env=VLLM_NO_USAGE_STATS=1 \
  -v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
  registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0 \
  --model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
  --tensor-parallel-size 1

1 Required for systems where SELinux is enabled. --security-opt=label=disable prevents SELinux from relabeling files in the volume mount. If you choose not to use this argument, your container might not successfully run. 2 If you experience an issue with shared memory, increase --shm-size to 8GB. 3 Maps the host UID to the effective UID of the vLLM process in the container. You can also pass --user=0, but this less secure than the --userns option. Setting --user=0 runs vLLM as root inside the container. 4 Set and export HF_TOKEN with your Hugging Face API access token 5 Required for systems where SELinux is enabled. On Debian or Ubuntu operating systems, or when using Docker without SELinux, the :Z suffix is not available. 6 Set --tensor-parallel-size to match the number of GPUs when running the AI Inference Server container on multiple GPUs.

Testing vLLM running

Super basic test:

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://localhost:8000/v1/completions | jq

You should get:

{
  "id": "cmpl-49fa8f5e45c345918e2fe2a40f92ec4a",
  "object": "text_completion",
  "created": 1753287408,
  "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
  "choices": [
    {
      "index": 0,
      "text": " Paris\nThe capital of France is Paris. Paris is the most populous city in France, known for its rich history, art, fashion, and cuisine. It is also home to the Eiffel Tower, the Louvre Museum, and Notre Dame",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 58,
    "completion_tokens": 50,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Benchmarking

Connect as ec2-user, no need to sudo.

Prepare the Python environment:

python -m ensurepip --upgrade
python -m pip install --upgrade pip
pip install vllm pandas datasets

Install git if necessary:

sudo dnf install git

Clone repo:

git clone https://github.com/vllm-project/vllm.git

Execute the benchmark:

python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random  --random-input 1024 --random-output 512 --port 8000

Expected result:

============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  14.27     
Total input tokens:                      102300    
Total generated tokens:                  40739     
Request throughput (req/s):              7.01      
Output token throughput (tok/s):         2855.01   
Total Token throughput (tok/s):          10024.26  
---------------Time to First Token----------------
Mean TTFT (ms):                          2388.80   
Median TTFT (ms):                        2307.74   
P99 TTFT (ms):                           4861.15   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          35.39     
Median TPOT (ms):                        23.96     
P99 TPOT (ms):                           103.90    
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.23     
Median ITL (ms):                         19.08     
P99 ITL (ms):                            105.38    
==================================================

Tool calling

git clone vllm...

Troubleshooting

For some reason loading the module uvm is not enough...

# podman run --rm -it   --device nvidia.com/gpu=all   --security-opt=label=disable   --shm-size=4g -p 8000:8000   --userns=keep-id:uid=1001   --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN"   --env "HF_HUB_OFFLINE=0"   --env=VLLM_NO_USAGE_STATS=1   -v ./rhaiis-cache:/opt/app-root/src/.cache:Z   registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0   --model RedHatAI/Llama-3.2-1B-Instruct-FP8   --tensor-parallel-size 1
Error: setting up CDI devices: failed to inject devices: failed to stat CDI host device "/dev/nvidia-uvm": no such file or directory
[root@ip-10-0-117-174 ~]# lsmod | grep nvidia
nvidia_uvm           4087808  0
nvidia_drm            151552  0
nvidia_modeset       1732608  1 nvidia_drm
video                  77824  1 nvidia_modeset
drm_ttm_helper         16384  1 nvidia_drm
drm_kms_helper        266240  2 drm_ttm_helper,nvidia_drm
nvidia              11583488  2 nvidia_uvm,nvidia_modeset
drm                   811008  6 drm_kms_helper,nvidia,drm_ttm_helper,nvidia_drm,ttm
[root@ip-10-0-117-174 ~]# dkms autoinstall
[root@ip-10-0-117-174 ~]# podman run --rm -it   --device nvidia.com/gpu=all   --security-opt=label=disable   --shm-size=4g -p 8000:8000   --userns=keep-id:uid=1001   --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN"   --env "HF_HUB_OFFLINE=0"   --env=VLLM_NO_USAGE_STATS=1   -v ./rhaiis-cache:/opt/app-root/src/.cache:Z   registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0   --model RedHatAI/Llama-3.2-1B-Instruct-FP8   --tensor-parallel-size 1
Error: setting up CDI devices: failed to inject devices: failed to stat CDI host device "/dev/nvidia-uvm": no such file or directory
[root@ip-10-0-117-174 ~]# nvidia-smi 
Wed Jul 23 16:14:24 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   27C    P8             10W /  300W |       0MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages