guide: llama-cli help reformatted, organized, fleshed out and examples added #15709
rosmur
started this conversation in
Show and tell
Replies: 3 comments 4 replies
-
is kind of ... disconnected? How do I make use of the information in the red box? |
Beta Was this translation helpful? Give feedback.
1 reply
-
Thank you, llama.cpp changes faster than my brain can afford. |
Beta Was this translation helpful? Give feedback.
1 reply
-
Maybe highlight
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
As a non-SWE, I think the
--help
could be made cleaner and clearer to drive adoption. Here's the help information reformatted, organized into clear sections, parameter ranges and usage examples addedLlama CLI User Guide
A comprehensive guide to using the llama-cli command-line tool for text generation and chat conversations with Large Language Models.
llama-cli Version
This guide is current for version: 6310 (c8d0d14)
Quick Start
Basic Commands
Usage
Essential Parameters
-m, --model
-m models/llama-2-7b.gguf
-p, --prompt
-p "Hello, how are you?"
-n, --predict
-n 100
-sys, --system-prompt
-sys "You are a helpful AI"
-c, --ctx-size
4096
Basic Info and Logging
-h, --help
--version
-v, --verbose
false
Model Download Options
--hf-repo
--hf-repo unsloth/phi-4-GGUF:q4_k_m
--hf-file
--hf-file model-q4_k_m.gguf
--hf-token
--hf-token your_token_here
--offline
--offline
Model Adapters
--lora
--lora-scaled
--control-vector
Chat Configuration
-cnv, --conversation
-no-cnv, --no-conversation
false
-i, --interactive
false
-if, --interactive-first
false
-st, --single-turn
false
--jinja
--chat-template
--chat-template-file
Available Built-in Chat Templates
List here: https://github.com/ggml-org/llama.cpp/tree/master/models/templates
Input/Output Control
--in-prefix
--in-suffix
--in-prefix-bos
-r, --reverse-prompt
Text Generation Parameters
Basic Generation Control
-n, --predict
-1
(infinite)--keep
0
0
to context size--ignore-eos
false
Context Management
-c, --ctx-size
4096
--no-context-shift
false
-b, --batch-size
2048
-ub, --ubatch-size
512
Sampling and Creativity Control
Temperature and Randomness
--temp
0.8
0.1-2.0
-s, --seed
-1
(random)--dynatemp-range
0.0
0.0-1.0
Token Selection Methods
--top-k
40
1-100
--top-p
0.9
0.0-1.0
--min-p
0.1
0.0-1.0
--typical
1.0
0.0-1.0
Repetition Control
--repeat-penalty
1.0
--repeat-last-n
64
--presence-penalty
0.0
--frequency-penalty
0.0
Advanced Sampling
DRY (Don't Repeat Yourself) Sampling
--dry-multiplier
0.0
--dry-base
1.75
--dry-allowed-length
2
Mirostat Sampling
--mirostat
0
--mirostat-lr
0.1
--mirostat-ent
5.0
Performance and Hardware
CPU Configuration
-t, --threads
-1
(auto)-tb, --threads-batch
--threads
--cpu-mask
""
--cpu-range
GPU Configuration
-ngl, --gpu-layers
0
-sm, --split-mode
layer
-mg, --main-gpu
0
-ts, --tensor-split
GPU Split Modes
none
: Single GPU onlylayer
: Split by layers (recommended)row
: Split by tensor rowsMemory Management
--mlock
--no-mmap
--numa
NUMA Options
distribute
: Spread across all nodesisolate
: Use current node onlynumactl
: Use numactl CPU mapAdvanced Features
Structured Generation
--grammar
--grammar "root ::= [a-z]+"
--grammar-file
--grammar-file grammar.bnf
-j, --json-schema
-j '{"type": "object"}'
--json-schema-file
--json-schema-file schema.json
Reasoning and Thinking
--reasoning-format
none
,deepseek
,auto
--reasoning-budget
-1
(unlimited),0
(disabled)Caching
--prompt-cache
--prompt-cache-all
--prompt-cache-ro
Logging and Debugging
--log-file
--log-colors
--log-timestamps
--log-verbosity
--no-perf
Environment Variables
Many parameters can be set via environment variables:
LLAMA_ARG_MODEL
-m, --model
LLAMA_ARG_CTX_SIZE
-c, --ctx-size
LLAMA_ARG_THREADS
-t, --threads
LLAMA_ARG_N_PREDICT
-n, --predict
LLAMA_ARG_N_GPU_LAYERS
-ngl, --gpu-layers
HF_TOKEN
--hf-token
LLAMA_OFFLINE
--offline
More Examples
1. Chat
With temperature setting and conversation mode:
Hugging Face Integration
Chat Templates
2. Technical Q&A Assistant
3. Creative Writing with High Randomness
4. Structured JSON Output
5. Multi-GPU Setup
# Use multiple GPUs with layer splitting llama-cli -m large-model.gguf -ngl 40 -sm layer -ts 3,1 --main-gpu 0
6. High-Performance CPU Setup
# Optimize for CPU performance llama-cli -m model.gguf -t 8 --cpu-range 0-7 --mlock --numa distribute
7. Conversation with Custom Template
8. Constrained Generation with Grammar
9. Batch Processing with Caching
# Process multiple prompts with caching llama-cli -m model.gguf --prompt-cache prompts.cache --prompt-cache-all -f input-prompts.txt
10. Debug and Development
Tips for Beginners
-m
,-p
, and-n
--ctx-size
for longer conversations or documents--gpu-layers
to speed up inference significantlyCommon Issues and Solutions
Performance Issues
--gpu-layers
or--threads
--ctx-size
or--batch-size
--threads
or use--cpu-range
Generation Quality
--repeat-penalty
or enable DRY sampling--temp
or adjust--top-p
--temp
and--top-p
Model Loading
--mlock
to keep model in RAMThis guide covers the essential features of llama-cli. For the most up-to-date information, always refer to
llama-cli --help
.Beta Was this translation helpful? Give feedback.
All reactions