Skip to content

Commit 37b2ea4

Browse files
committed
update readme
1 parent 7bf4dca commit 37b2ea4

File tree

1 file changed

+11
-8
lines changed

1 file changed

+11
-8
lines changed

README.md

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -361,18 +361,12 @@ triton profile -m llama-3.1-8b-instruct --service-kind openai --endpoint-type ch
361361

362362
## Serving a HuggingFace LLM Model with LLM API
363363

364-
> [!NOTE]
365-
> LLM API has not yet been integrated into the official triton server tensorrt_llm backend image yet.
366-
> To start the LLM API functionality, the user will only
367-
368364
The LLM API is a high-level Python API and designed for Tensorrt LLM workflows. It could
369365
convert a LLM model in Hugging Face format into a Tensorrt LLM engine and serve the engine with a unified Python API without invoking different
370366
engine build and converting scripts.
371367
To use the LLM API with Triton CLI, import the model with `--backend llmapi`
372368
```bash
373-
export MODEL_NAME="llama-3.1-8b-instruct"
374-
export HF_ID="meta-llama/Llama-3.1-8B-Instruct"
375-
triton import -m $MODEL_NAME --source "hf:$HF_ID" --backend llmapi
369+
triton import -m "llama-3.1-8b-instruct" --backend llmapi
376370
```
377371

378372
Huggingface models will be downloaded at runtime when starting the LLM API engine if not found
@@ -383,6 +377,15 @@ startup time. tensorrt_llm>=0.18.0 is required.
383377
#### Example
384378

385379
```bash
380+
docker run -ti \
381+
--gpus all \
382+
--network=host \
383+
--shm-size=1g --ulimit memlock=-1 \
384+
-v /tmp:/tmp \
385+
-v ${HOME}/models:/root/models \
386+
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
387+
nvcr.io/nvidia/tritonserver:25.03-trtllm-python-py3
388+
386389
# Install the Triton CLI
387390
pip install git+https://github.com/triton-inference-server/triton_cli.git@main
388391

@@ -394,7 +397,7 @@ triton remove -m all
394397
triton import -m llama-3.1-8b-instruct --backend llmapi
395398

396399
# Start Triton pointing at the default model repository
397-
triton start --frontend openai --mode docker
400+
triton start --frontend openai
398401

399402
# Interact with model at http://localhost:9000
400403
curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{

0 commit comments

Comments
 (0)