@@ -361,18 +361,12 @@ triton profile -m llama-3.1-8b-instruct --service-kind openai --endpoint-type ch
361
361
362
362
## Serving a HuggingFace LLM Model with LLM API
363
363
364
- > [ !NOTE]
365
- > LLM API has not yet been integrated into the official triton server tensorrt_llm backend image yet.
366
- > To start the LLM API functionality, the user will only
367
-
368
364
The LLM API is a high-level Python API and designed for Tensorrt LLM workflows. It could
369
365
convert a LLM model in Hugging Face format into a Tensorrt LLM engine and serve the engine with a unified Python API without invoking different
370
366
engine build and converting scripts.
371
367
To use the LLM API with Triton CLI, import the model with ` --backend llmapi `
372
368
``` bash
373
- export MODEL_NAME=" llama-3.1-8b-instruct"
374
- export HF_ID=" meta-llama/Llama-3.1-8B-Instruct"
375
- triton import -m $MODEL_NAME --source " hf:$HF_ID " --backend llmapi
369
+ triton import -m " llama-3.1-8b-instruct" --backend llmapi
376
370
```
377
371
378
372
Huggingface models will be downloaded at runtime when starting the LLM API engine if not found
@@ -383,6 +377,15 @@ startup time. tensorrt_llm>=0.18.0 is required.
383
377
#### Example
384
378
385
379
``` bash
380
+ docker run -ti \
381
+ --gpus all \
382
+ --network=host \
383
+ --shm-size=1g --ulimit memlock=-1 \
384
+ -v /tmp:/tmp \
385
+ -v ${HOME} /models:/root/models \
386
+ -v ${HOME} /.cache/huggingface:/root/.cache/huggingface \
387
+ nvcr.io/nvidia/tritonserver:25.03-trtllm-python-py3
388
+
386
389
# Install the Triton CLI
387
390
pip install git+https://github.com/triton-inference-server/triton_cli.git@main
388
391
@@ -394,7 +397,7 @@ triton remove -m all
394
397
triton import -m llama-3.1-8b-instruct --backend llmapi
395
398
396
399
# Start Triton pointing at the default model repository
397
- triton start --frontend openai --mode docker
400
+ triton start --frontend openai
398
401
399
402
# Interact with model at http://localhost:9000
400
403
curl -s http://localhost:9000/v1/chat/completions -H ' Content-Type: application/json' -d ' {
0 commit comments