This guide provides a demonstration of how to get up and running with llm-d
on RHOAI based on https://access.redhat.com/articles/7131048.
- Red Hat OpenShift AI 2.24
- OpenShift 4.18 - see
ocp-4-18-setup
for manual installation ofllm-d
dependencies - OpenShift 4.19 - dependencies needed for
llm-d
are shipped in OCP 4.19
RHOAI 2.x leverages Knative Serving by default. The following configurations disable Knative.
- Set the
serviceMesh.managementState
to removed, as shown in the following example (this requires an admin role):
serviceMesh:
...
managementState: Removed
- You can do this through the RHOAI UI as shown below:
- Create a data science cluster (
DSC
) with the following information set inkserve
andserving
:
kserve:
defaultDeploymentMode: RawDeployment
managementState: Managed
...
serving:
...
managementState: Removed
...
- You can create the
DSC
through the RHOAI UI as shown below, using thedsc.yaml
provided in this repo:
llm-d
leverages Gateway API Inference Extension.
As described in Getting Started with Gateway API for the Ingress Operator, we can can deploy a GatewayClass
and Gateway
named
named openshift-ai-inference
in the openshift-ingress
namespace:
oc apply -f gateway.yaml
We can see the Gateway is deployed:
oc get gateways -n openshift-ingress
>> NAME CLASS ADDRESS PROGRAMMED AGE
>> openshift-ai-inference istio openshift-ai-inference-istio.openshift-ingress.svc.cluster.local True 9d
With the gateway deployed, we can now deploy an LLMInferenceService
using KServe, which creates an infernece pool of vLLM servers and an end-point-picker (EPP) for smart scheduling across the vLLM servers.
The deployment.yaml
contains a sample manifest for deploying:
oc create ns llm-test
oc apply -f deployment.yaml -n llm-test
- We can see the
llminferenceservice
is deployed ...
oc get llminferenceservice -n llm-test
>> NAME URL READY REASON AGE
>> qwen True 9m44s
- ... and that the
router-scheduler
andvllm
pods are ready to go:
oc get pods -n llm-test
>> NAME READY STATUS RESTARTS AGE
>> qwen-kserve-c59dbf75-5ztf2 1/1 Running 0 9m15s
>> qwen-kserve-c59dbf75-dlfj6 1/1 Running 0 9m15s
>> qwen-kserve-router-scheduler-67dbbfb947-hn7ln 1/1 Running 0 9m15s
- We can query the model at the gateway's address:
curl -X POST http://openshift-ai-inference-istio.openshift-ingress.svc.cluster.local/llm-test/qwen/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"prompt": "Explain the difference between supervised and unsupervised learning in machine learning. Include examples of algorithms used in each type.",
"max_tokens": 200,
"temperature": 0.7,
"top_p": 0.9
}'
oc delete llminferenceservice qwen -n llm-test