guide: llama-cli help reformatted, organized, fleshed out and examples added #15709

rosmur · 2025-09-01T03:07:32Z

rosmur
Sep 1, 2025

As a non-SWE, I think the --help could be made cleaner and clearer to drive adoption. Here's the help information reformatted, organized into clear sections, parameter ranges and usage examples added

Llama CLI User Guide

A comprehensive guide to using the llama-cli command-line tool for text generation and chat conversations with Large Language Models.

llama-cli Version

This guide is current for version: 6310 (c8d0d14)

Llama CLI User Guide

Quick Start

Basic Commands

# Simple chat
llama-cli -m your_model.gguf -p "I believe the meaning of life is"

# Get help
llama-cli --help

# Check version
llama-cli --version

Usage

Essential Parameters

Parameter	Description	Example
`-m, --model`	Path to your model file	`-m models/llama-2-7b.gguf`
`-p, --prompt`	Initial prompt text	`-p "Hello, how are you?"`
`-n, --predict`	Number of tokens to generate	`-n 100`
`-sys, --system-prompt`	System message for chat	`-sys "You are a helpful AI"`
`-c, --ctx-size`	Context window size	`4096`

Basic Info and Logging

Parameter	Description	Default
`-h, --help`	Show help and exit	-
`--version`	Show version information	-
`-v, --verbose`	Enable verbose logging	`false`

Model Download Options

Parameter	Description	Example
`--hf-repo`	Hugging Face repository	`--hf-repo unsloth/phi-4-GGUF:q4_k_m`
`--hf-file`	Specific file in HF repo	`--hf-file model-q4_k_m.gguf`
`--hf-token`	HF access token	`--hf-token your_token_here`
`--offline`	Use cached models only	`--offline`

Model Adapters

Parameter	Description
`--lora`	Add LoRA adapter
`--lora-scaled`	Add LoRA with custom scaling
`--control-vector`	Add control vector

Chat Configuration

Parameter	Description	Default
`-cnv, --conversation`	Enable conversation mode	Auto-enabled for chat models
`-no-cnv, --no-conversation`	Disable conversation mode	`false`
`-i, --interactive`	Interactive mode	`false`
`-if, --interactive-first`	Wait for input immediately	`false`
`-st, --single-turn`	Single conversation turn	`false`
`--jinja`	Use jinja template for chat	disabled
`--chat-template`	set custom jinja chat template	template taken from model
`--chat-template-file`	set custom jinja chat template file	template taken from model

Available Built-in Chat Templates

List here: https://github.com/ggml-org/llama.cpp/tree/master/models/templates

Input/Output Control

Parameter	Description
`--in-prefix`	Prefix for user inputs
`--in-suffix`	Suffix for user inputs
`--in-prefix-bos`	Add BOS token to input prefix
`-r, --reverse-prompt`	Stop generation at specific text

Text Generation Parameters

Basic Generation Control

Parameter	Description	Default	Range
`-n, --predict`	Tokens to generate	`-1` (infinite)	Any integer
`--keep`	Tokens to keep from prompt	`0`	`0` to context size
`--ignore-eos`	Ignore end-of-stream token	`false`	-

Context Management

Parameter	Description	Default
`-c, --ctx-size`	Context window size	`4096`
`--no-context-shift`	Disable context shifting	`false`
`-b, --batch-size`	Logical batch size	`2048`
`-ub, --ubatch-size`	Physical batch size	`512`

Sampling and Creativity Control

Temperature and Randomness

Parameter	Description	Default	Recommended Range
`--temp`	Temperature (creativity)	`0.8`	`0.1-2.0`
`-s, --seed`	Random seed	`-1` (random)	Any integer
`--dynatemp-range`	Dynamic temperature range	`0.0`	`0.0-1.0`

Token Selection Methods

Parameter	Description	Default	Range
`--top-k`	Top-K sampling	`40`	`1-100`
`--top-p`	Top-P (nucleus) sampling	`0.9`	`0.0-1.0`
`--min-p`	Min-P sampling	`0.1`	`0.0-1.0`
`--typical`	Locally typical sampling	`1.0`	`0.0-1.0`

Repetition Control

Parameter	Description	Default
`--repeat-penalty`	Repetition penalty	`1.0`
`--repeat-last-n`	Tokens to consider for penalty	`64`
`--presence-penalty`	Presence penalty	`0.0`
`--frequency-penalty`	Frequency penalty	`0.0`

Advanced Sampling

DRY (Don't Repeat Yourself) Sampling

Parameter	Description	Default
`--dry-multiplier`	DRY sampling strength	`0.0`
`--dry-base`	DRY base value	`1.75`
`--dry-allowed-length`	Allowed repetition length	`2`

Mirostat Sampling

Parameter	Description	Default
`--mirostat`	Mirostat mode (0/1/2)	`0`
`--mirostat-lr`	Learning rate	`0.1`
`--mirostat-ent`	Target entropy	`5.0`

Performance and Hardware

CPU Configuration

Parameter	Description	Default
`-t, --threads`	CPU threads for generation	`-1` (auto)
`-tb, --threads-batch`	CPU threads for batching	Same as `--threads`
`--cpu-mask`	CPU affinity mask	`""`
`--cpu-range`	CPU range for affinity	-

GPU Configuration

Parameter	Description	Default
`-ngl, --gpu-layers`	Layers to offload to GPU	`0`
`-sm, --split-mode`	GPU split mode	`layer`
`-mg, --main-gpu`	Primary GPU index	`0`
`-ts, --tensor-split`	GPU memory distribution	-

GPU Split Modes

none: Single GPU only
layer: Split by layers (recommended)
row: Split by tensor rows

Memory Management

Parameter	Description
`--mlock`	Keep model in RAM
`--no-mmap`	Disable memory mapping
`--numa`	NUMA optimizations

NUMA Options

distribute: Spread across all nodes
isolate: Use current node only
numactl: Use numactl CPU map

Advanced Features

Structured Generation

Parameter	Description	Example
`--grammar`	BNF grammar constraint	`--grammar "root ::= [a-z]+"`
`--grammar-file`	Grammar from file	`--grammar-file grammar.bnf`
`-j, --json-schema`	JSON schema constraint	`-j '{"type": "object"}'`
`--json-schema-file`	JSON schema from file	`--json-schema-file schema.json`

Reasoning and Thinking

Parameter	Description	Options
`--reasoning-format`	Thought extraction format	`none`, `deepseek`, `auto`
`--reasoning-budget`	Thinking token budget	`-1` (unlimited), `0` (disabled)

Caching

Parameter	Description
`--prompt-cache`	Cache file for prompts
`--prompt-cache-all`	Cache user input and generations
`--prompt-cache-ro`	Read-only cache mode

Logging and Debugging

Parameter	Description
`--log-file`	Log to file
`--log-colors`	Colored logging
`--log-timestamps`	Timestamp in logs
`--log-verbosity`	Verbosity level
`--no-perf`	Disable performance timings

Environment Variables

Many parameters can be set via environment variables:

Variable	Parameter Equivalent
`LLAMA_ARG_MODEL`	`-m, --model`
`LLAMA_ARG_CTX_SIZE`	`-c, --ctx-size`
`LLAMA_ARG_THREADS`	`-t, --threads`
`LLAMA_ARG_N_PREDICT`	`-n, --predict`
`LLAMA_ARG_N_GPU_LAYERS`	`-ngl, --gpu-layers`
`HF_TOKEN`	`--hf-token`
`LLAMA_OFFLINE`	`--offline`

More Examples

1. Chat

# Load a local GGUF model
llama-cli -m /path/to/your/model.gguf -p "Once upon a time"

With temperature setting and conversation mode:

# Generate 100 tokens of creative text
llama-cli -m model.gguf -p "Once upon a time" -n 100 --temp 0.9 -no-cnv

Hugging Face Integration

# Download and use a model from Hugging Face
llama-cli --hf-repo microsoft/Phi-3-mini-4k-instruct-gguf:q4_k_m

# Use a specific file from HF repo
llama-cli --hf-repo microsoft/Phi-3-mini-4k-instruct-gguf --hf-file Phi-3-mini-4k-instruct-q4_k_m.gguf

Chat Templates

# Use built-in chat template
llama-cli -m model.gguf --chat-template llama3

# Custom chat template file
llama-cli -m model.gguf --jinja --chat-template-file my_template.jinja

2. Technical Q&A Assistant

# Start a technical assistant with lower temperature
llama-cli -m model.gguf -sys "You are a technical expert. Provide accurate, detailed answers." --temp 0.3 -cnv

3. Creative Writing with High Randomness

# Creative writing with diverse sampling
llama-cli -m model.gguf -p "Write a short story about" --temp 1.2 --top-p 0.95 --top-k 50 -n 500

4. Structured JSON Output

# Generate valid JSON
llama-cli -m model.gguf -p "Generate a user profile:" -j '{"type":"object","properties":{"name":{"type":"string"},"age":{"type":"number"}}}' -n 100

5. Multi-GPU Setup

# Use multiple GPUs with layer splitting
llama-cli -m large-model.gguf -ngl 40 -sm layer -ts 3,1 --main-gpu 0

6. High-Performance CPU Setup

# Optimize for CPU performance
llama-cli -m model.gguf -t 8 --cpu-range 0-7 --mlock --numa distribute

7. Conversation with Custom Template

# Chat with custom system prompt and template
llama-cli -m model.gguf --chat-template chatml -sys "You are a helpful coding assistant" -cnv

8. Constrained Generation with Grammar

# Generate Python function with grammar
llama-cli -m model.gguf -p "def fibonacci(n):" --grammar-file python-function.bnf -n 200

9. Batch Processing with Caching

# Process multiple prompts with caching
llama-cli -m model.gguf --prompt-cache prompts.cache --prompt-cache-all -f input-prompts.txt

10. Debug and Development

# Verbose output for debugging
llama-cli -m model.gguf -p "Test prompt" -v --log-colors --log-timestamps --no-warmup

Tips for Beginners

Start Simple: Begin with basic text generation using just -m, -p, and -n
Model Size: Larger models are more capable but require more resources
Temperature: Lower values (0.1-0.5) for factual tasks, higher (0.8-1.2) for creative tasks
Context Size: Increase --ctx-size for longer conversations or documents
GPU Acceleration: Use --gpu-layers to speed up inference significantly
Experiment: Try different sampling parameters to find what works for your use case

Common Issues and Solutions

Performance Issues

Slow generation: Increase --gpu-layers or --threads
Out of memory: Reduce --ctx-size or --batch-size
High CPU usage: Lower --threads or use --cpu-range

Generation Quality

Repetitive text: Increase --repeat-penalty or enable DRY sampling
Incoherent output: Lower --temp or adjust --top-p
Too conservative: Increase --temp and --top-p

Model Loading

Model not found: Check file path and permissions
Unsupported format: Ensure model is in GGUF format
Slow loading: Use --mlock to keep model in RAM

This guide covers the essential features of llama-cli. For the most up-to-date information, always refer to llama-cli --help.

dagbdagb · 2025-09-01T20:16:56Z

dagbdagb
Sep 1, 2025

Good job. This bit:

is kind of ... disconnected? How do I make use of the information in the red box?

1 reply

rosmur Sep 1, 2025
Author

Ah thanks for pointing out - the chat configuration section preceding it didn't have the --chat-template options. I've added it.

P.S: The examples section does have examples for using the templates

clort81 · 2025-09-01T20:32:07Z

clort81
Sep 1, 2025

Thank you, llama.cpp changes faster than my brain can afford.

1 reply

rosmur Sep 1, 2025
Author

haha yup - to that point, I just added version info to the top of the doc.

Will revisit this in a few months and might make into a repo with some github action automation if it becomes outdated

CISC · 2025-09-02T07:31:42Z

CISC
Sep 2, 2025
Collaborator

Maybe highlight --jinja better as it's an important option to get the full features of a model, also update this:

Chat Templates

# Use built-in chat template
llama-cli -m model.gguf --chat-template llama3

# Custom chat template file
llama-cli -m model.gguf --jinja --chat-template-file my_template.jinja

2 replies

rosmur Sep 4, 2025
Author

done!

CISC Sep 4, 2025
Collaborator

Thanks for fixing the latter, no rush on the former. :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

guide: llama-cli help reformatted, organized, fleshed out and examples added #15709

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

guide: llama-cli help reformatted, organized, fleshed out and examples added #15709

Uh oh!

Uh oh!

rosmur Sep 1, 2025

Llama CLI User Guide

llama-cli Version

Quick Start

Basic Commands

Usage

Essential Parameters

Basic Info and Logging

Model Download Options

Model Adapters

Chat Configuration

Available Built-in Chat Templates

Input/Output Control

Text Generation Parameters

Basic Generation Control

Context Management

Sampling and Creativity Control

Temperature and Randomness

Token Selection Methods

Repetition Control

Advanced Sampling

DRY (Don't Repeat Yourself) Sampling

Mirostat Sampling

Performance and Hardware

CPU Configuration

GPU Configuration

GPU Split Modes

Memory Management

NUMA Options

Advanced Features

Structured Generation

Reasoning and Thinking

Caching

Logging and Debugging

Environment Variables

More Examples

Tips for Beginners

Common Issues and Solutions

Performance Issues

Generation Quality

Model Loading

Replies: 3 comments · 4 replies

Uh oh!

dagbdagb Sep 1, 2025

Uh oh!

rosmur Sep 1, 2025 Author

Uh oh!

clort81 Sep 1, 2025

Uh oh!

rosmur Sep 1, 2025 Author

Uh oh!

CISC Sep 2, 2025 Collaborator

Uh oh!

rosmur Sep 4, 2025 Author

Uh oh!

CISC Sep 4, 2025 Collaborator

rosmur
Sep 1, 2025

Replies: 3 comments 4 replies

dagbdagb
Sep 1, 2025

rosmur Sep 1, 2025
Author

clort81
Sep 1, 2025

rosmur Sep 1, 2025
Author

CISC
Sep 2, 2025
Collaborator

rosmur Sep 4, 2025
Author

CISC Sep 4, 2025
Collaborator