Skip to content

Commit b7c839c

Browse files
authored
Update prefix plugin guide to use vllm as default to be consistent (#1078)
1 parent 070cbfb commit b7c839c

File tree

1 file changed

+3
-4
lines changed

1 file changed

+3
-4
lines changed

site-src/guides/epp-configuration/prefix-aware.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,6 @@ shows a detailed analysis on how to estimate this.
5151
```
5252
max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token
5353
lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size
54-
lru_indexer_capacity_total = max_num_servers * lru_indexer_capacity_per_server
5554
```
5655

5756
Let's take an example:
@@ -78,9 +77,9 @@ Use the following reference command to install an inferencepool with the prefix
7877
cache plugin environment variable configurations:
7978

8079
```txt
81-
$ helm install triton-llama3-8b-instruct \
82-
--set inferencePool.modelServers.matchLabels.app=triton-llama3-8b-instruct \
83-
--set inferencePool.modelServerType=triton-tensorrt-llm \
80+
$ helm install vllm-llama3-8b-instruct \
81+
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
82+
--set inferencePool.modelServerType=vllm \
8483
--set provider.name=[none|gke] \
8584
--set inferenceExtension.env.EXPERIMENTAL_USE_SCHEDULER_V2=true \
8685
--set inferenceExtension.env.ENABLE_PREFIX_CACHE_SCHEDULING=true \

0 commit comments

Comments
 (0)