Skip to content

Commit 09be4c9

Browse files
authored
Update README.md (#3494)
1 parent 1ebda4e commit 09be4c9

File tree

1 file changed

+1
-0
lines changed

1 file changed

+1
-0
lines changed

examples/cpu/llm/inference/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,7 @@ python run.py --help # for more detailed usages
113113
| token latency | enable "--token-latency" to print out the first or next token latency |
114114
| generation iterations | use "--num-iter" and "--num-warmup" to control the repeated iterations of generation, default: 100-iter/10-warmup |
115115
| streaming mode output | greedy search only (work with "--greedy"), use "--streaming" to enable the streaming generation output |
116+
| KV Cache dtype | default: auto, use "--kv-cache-dtype=fp8_e5m2" to enable e5m2 KV Cache. More information refer to [vLLM FP8 E5M2 KV Cache](https://docs.vllm.ai/en/v0.6.6/quantization/fp8_e5m2_kvcache.html) |
116117

117118
*Note:* You may need to log in your HuggingFace account to access the model files. Please refer to [HuggingFace login](https://huggingface.co/docs/huggingface_hub/quick-start#login).
118119

0 commit comments

Comments
 (0)