Multi-backend LLM inference with comprehensive performance testing: Ollama, OpenVINO, and llama.cpp
A comprehensive toolkit for running Large Language Models locally with three inference backends: Ollama (recommended), OpenVINO GenAI, and llama.cpp. Includes extensive benchmarking tools and performance comparisons to help you choose the best setup for your hardware.
- π Evidence-Based: 11 configurations tested across 5 models with detailed performance data
- β‘ Multiple Backends: Compare Ollama, OpenVINO (GPU/CPU), and llama.cpp
- π― Honest Results: Nuanced findings show GPU isn't always faster on integrated Intel Xe
- πΌ Practical Guidance: Clear recommendations based on actual hardware testing
- π Privacy: Run models locally without cloud dependencies
- π Complete Toolkit: Setup scripts, benchmarking tools, and comprehensive documentation
Tested on Intel Core i7-1185G7 + Iris Xe Graphics (11 configurations across 5 models):
| Model | Speed | Capabilities | Best For |
|---|---|---|---|
| Qwen3-VL 8B | 5.14 tok/s | Vision + Text | General use π |
| Llama 3.1 8B | 4.50 tok/s | Tool calling | Function calling β |
| Mistral 7B | 5.86 tok/s | Text only | Speed fallback |
Setup: ollama pull qwen3-vl:8b-instruct - Single command, no drivers needed
| Model | Ollama CPU | OpenVINO CPU | OpenVINO GPU | Winner |
|---|---|---|---|---|
| Mistral 7B | 5.86 tok/s | 9.5 tok/s | 9.4 tok/s | OpenVINO CPU (1.6x) |
| Llama 3.1 8B | 4.50 tok/s | 3.4 tok/s | 4.3 tok/s | Ollama (1.3x) |
Key Findings:
- β‘ Framework performance is model-specific - No single framework wins for all models
- π₯οΈ GPU provides NO meaningful advantage on Intel Iris Xe integrated graphics
- β Ollama recommended for simplicity + competitive performance
- π§ OpenVINO CPU worth setup complexity only for maximum Mistral 7B speed
π Full analysis: COMPREHENSIVE_PERFORMANCE_COMPARISON.md
- COMPREHENSIVE_PERFORMANCE_COMPARISON.md - Complete test results & analysis
- llm-benchmark-charts.html - Performance comparison charts
- GETTING_STARTED.md - Complete beginner's guide
- BENCHMARK_GUIDE.md - Performance testing methodology
- QUICK_PERFORMANCE_REFERENCE.md - Quick performance guide
- CONTRIBUTING.md - How to contribute
- GPU: Intel Xe Graphics (TigerLake, Alderlake, or newer)
- Examples: Intel Iris Xe, Intel Arc Graphics
- OS: Ubuntu 22.04+ (or compatible Debian-based distributions)
- RAM: 8GB+ recommended
- Python: 3.8+ (for OpenVINO GenAI)
-
Ollama Setup β Recommended - Start Here
curl -fsSL https://ollama.com/install.sh | sh ollama pull qwen3-vl:8b-instruct- Simplest setup (single binary, no drivers)
- Best performance for Llama 3.1 8B and Qwen3-VL
- Extensive model library via
ollama pull
-
setup-intel-gpu-llm.sh(Optional - For Maximum Mistral 7B Speed)- Sets up OpenVINO GenAI with Intel GPU/CPU support
- Creates isolated Python virtual environment
- Installs Intel compute runtime drivers
- Use if: You need maximum Mistral 7B performance (9.5 tok/s vs 5.86 tok/s)
-
setup-llama-cpp.sh(Optional - For Benchmarking)- Builds llama.cpp for CPU-only inference
- Used for performance comparison baseline
- Creates convenience wrapper scripts
-
activate-intel-gpu.sh- Helper script to activate the OpenVINO Python environment
- Auto-generated by
setup-intel-gpu-llm.sh
-
test-inference.py- Test individual models with OpenVINO GPU
- Supports Phi-3, Mistral, Llama 3
- Flexible CLI with streaming support
-
test-models.sh- Interactive menu for testing models
- Handles model conversion automatically
-
benchmark.pyβ Performance Comparison- Compare OpenVINO GPU vs llama.cpp CPU
- Measures tokens/second, latency
- Generates detailed performance reports
# Install Ollama (single command)
curl -fsSL https://ollama.com/install.sh | sh
# Pull recommended models
ollama pull qwen3-vl:8b-instruct # Best all-around (vision + text)
ollama pull llama3.1:8b-instruct-q4_0 # Best tool calling
ollama pull mistral:7b-instruct-q4_0 # Speed fallback
# Start using immediately
ollama run qwen3-vl:8b-instruct "Explain quantum computing"Why Ollama?
- β No driver installation needed
- β Best performance for Llama 3.1 8B (4.50 tok/s)
- β Competitive performance for other models
- β Huge model library
- β Works immediately
Only needed if you want maximum Mistral 7B performance (9.5 tok/s vs 5.86 tok/s Ollama):
# Clone repository
git clone <your-repo-url>
cd intel-gpu-llm-inference
git submodule update --init --recursive
# Run setup (one time, ~10 minutes)
./setup-intel-gpu-llm.sh
# If prompted, log out and back in for group changes
# Activate environment
source activate-intel-gpu.sh
# Test with Mistral 7B
python test-inference.py --model mistral --prompt "Your prompt here"# 1. Install Ollama (for comparison)
curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral:7b-instruct-q4_0
# 2. Setup OpenVINO
./setup-intel-gpu-llm.sh
source activate-intel-gpu.sh
# 3. Test same model on both frameworks
time ollama run mistral:7b-instruct-q4_0 "Your prompt" --verbose
python test-inference.py --model mistral --device CPU --prompt "Your prompt"
python test-inference.py --model mistral --device GPU --prompt "Your prompt"
# 4. Run comprehensive benchmarks
./benchmark.py \
--openvino-model mistral_7b_ir \
--llama-model models/mistral-7b-q4.gguf \
--prompt "Explain quantum computing" \
--max-tokens 100- System: Single binary installed to
/usr/local/bin/ollama - Models: Downloaded to
~/.ollama/models/(managed automatically) - Size: ~15GB per 8B model (GGUF Q4_0 quantization)
- System Packages: Intel OpenCL ICD, Intel Level Zero GPU drivers
- Python Environment: Virtual environment in
openvino_env/ - Python Packages:
openvino-genai,optimum-intel[openvino] - Models: Converted to
*_ir/directories (~15GB per 8B model)
If you prefer manual installation:
# Add Intel GPU repository
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
gpg --dearmor | sudo tee /usr/share/keyrings/intel-graphics.gpg > /dev/null
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client" | \
sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
sudo apt update
sudo apt install -y intel-opencl-icd intel-level-zero-gpu level-zerosudo usermod -aG render $USER
# Log out and back in for changes to take effectpython3 -m venv openvino_env
source openvino_env/bin/activate
pip install --upgrade pip
pip install openvino-genai optimum-intel[openvino]# Basic usage
ollama run qwen3-vl:8b-instruct "Explain quantum computing"
# With timing
time ollama run llama3.1:8b-instruct-q4_0 "Your prompt here" --verbose
# Interactive chat
ollama run qwen3-vl:8b-instruct
# List available models
ollama list
# Remove a model
ollama rm mistral:7b-instruct-q4_0# Activate environment first
source activate-intel-gpu.sh
# Test on CPU (recommended)
python test-inference.py --model mistral --device CPU --prompt "Write a story"
# Test on GPU (for comparison)
python test-inference.py --model mistral --device GPU --prompt "Write a story"
# With streaming
python test-inference.py --model mistral --stream --prompt "Write a poem"
# Test Llama 3.1 8B (requires HuggingFace auth)
huggingface-cli login
python test-inference.py --model llama31 --device CPU --prompt "What is AI?"# Quick comparison
./benchmark.py \
--openvino-model phi3_mini_ir \
--llama-model models/phi3-mini-q4.gguf \
--prompt "Write a story about robots"
# Detailed comparison with multiple runs
./benchmark.py --compare \
--openvino-model mistral_7b_ir \
--llama-model models/mistral-7b-q4.gguf \
--runs 3 \
--output results.json
# GPU-only benchmark
./benchmark.py --openvino-model phi3_mini_ir --gpu-onlyimport openvino_genai as ov_genai
# Use CPU (recommended - equals or beats GPU performance)
pipe = ov_genai.LLMPipeline("mistral_7b_ir", "CPU")
# Generate text
prompt = "Explain quantum computing in simple terms:"
response = pipe.generate(prompt, max_new_tokens=200)
print(response)
# For GPU comparison (not recommended on Iris Xe)
pipe_gpu = ov_genai.LLMPipeline("mistral_7b_ir", "GPU")
response_gpu = pipe_gpu.generate(prompt, max_new_tokens=200)import openvino_genai as ov_genai
# CPU recommended
pipe = ov_genai.LLMPipeline("mistral_7b_ir", "CPU")
# Streaming callback
def stream_callback(text):
print(text, end='', flush=True)
config = ov_genai.GenerationConfig()
config.max_new_tokens = 100
pipe.generate("Write a short story:", config, stream_callback)# Browse available models
ollama list | head
# Pull any model from library
ollama pull qwen3-vl:8b-instruct
ollama pull llama3.1:8b-instruct-q4_0
ollama pull mistral:7b-instruct-q4_0
ollama pull phi3:3.8b-mini-instruct-4k-q4_0
# Models automatically quantized and optimized# Activate environment
source activate-intel-gpu.sh
# Export from Hugging Face to OpenVINO IR (int4 quantization)
optimum-cli export openvino \
--model mistralai/Mistral-7B-Instruct-v0.2 \
mistral_7b_ir \
--weight-format int4
# Some models require HuggingFace authentication
huggingface-cli login
optimum-cli export openvino \
--model meta-llama/Llama-3.1-8B-Instruct \
llama31_8b_ir \
--weight-format int4# Check if GPU is visible
lspci | grep -i vga
# Verify device files
ls -la /dev/dri/
# Check OpenCL detection
clinfo -l
# Verify user permissions
groups | grep render# Ensure virtual environment is activated
source openvino_env/bin/activate
# Reinstall if needed
pip install --force-reinstall openvino-genai- Use CPU instead of GPU: On Intel Iris Xe, CPU equals or beats GPU performance
- Try Ollama: Simpler setup, competitive performance
- Monitor usage:
intel_gpu_top(install:sudo apt install intel-gpu-tools) - Check thermal throttling: Monitor temperatures with
sensors - Reduce context length: Use smaller
max_new_tokensvalues
Use Ollama if:
- β You want simplicity (single command install)
- β You're using Llama 3.1 8B (Ollama 1.3x faster than OpenVINO)
- β You want fast model loading (0.22s for Llama 3.1)
- β You don't want to manage GPU drivers
Use OpenVINO if:
- β You need maximum Mistral 7B speed (9.5 tok/s vs 5.86 tok/s Ollama)
- β You want to benchmark GPU vs CPU
- β You need custom model conversions
β οΈ You're willing to invest setup time for 1.6x speedup on one model
Don't bother with GPU on Intel Iris Xe:
- β No meaningful performance advantage over CPU
- β Significantly longer load times (10-15s vs 0.2-6s)
- β Complex driver setup for no benefit
.
βββ README.md # This file
βββ .gitignore # Git ignore rules
β
βββ Setup Scripts
βββ setup-intel-gpu-llm.sh # OpenVINO GenAI setup (GPU)
βββ setup-llama-cpp.sh # llama.cpp build (CPU comparison)
βββ setup-ollama-intel-gpu.sh # Ollama setup (experimental)
βββ activate-intel-gpu.sh # Environment activation helper
β
βββ Testing & Benchmarking
βββ test-inference.py # Test individual models
βββ test-models.sh # Interactive model testing
βββ benchmark.py # Performance comparison tool
β
βββ Models & Environments
βββ openvino_env/ # Python venv (excluded from git)
βββ *_ir/ # OpenVINO IR models (excluded from git)
βββ models/ # GGUF models for llama.cpp (excluded from git)
βββ llama.cpp/ # llama.cpp source & build
-
GPU vs CPU: Intel Iris Xe integrated GPU provides no meaningful advantage over CPU for 4-8B models
- Tested on i7-1185G7: GPU tied or slower than CPU in all tests
- CPU has faster load times (0.2-6s vs 10-15s GPU)
- Recommendation: Use CPU-only inference on integrated Intel GPUs
-
Framework Performance is Model-Specific:
- Mistral 7B: OpenVINO 1.6x faster than Ollama (9.5 vs 5.86 tok/s)
- Llama 3.1 8B: Ollama 1.3x faster than OpenVINO (4.50 vs 3.4 tok/s)
- No single framework wins for all models
-
Model Size Constraints:
- Intel Xe integrated GPUs share system RAM
- 8B models work well, 14B+ models not recommended (memory pressure)
- Tested successfully: 1.1B, 3.8B, 7B, 8B parameter models
-
Best Performance Setup (Based on Testing):
- Primary: Ollama with Qwen3-VL 8B (5.14 tok/s, vision + text)
- Tool calling: Ollama with Llama 3.1 8B (4.50 tok/s, best function calling)
- Max speed: OpenVINO CPU with Mistral 7B (9.5 tok/s, text only)
- Ollama Intel GPU Support: Limited - use CPU mode instead (better performance anyway)
- Driver Support: OpenVINO requires recent Linux kernel (5.15+) for GPU features
- Dedicated GPUs: Results may differ significantly on Intel Arc dedicated GPUs (not tested)
We welcome contributions! See CONTRIBUTING.md for guidelines.
Especially valuable:
- Hardware test reports from different Intel GPU generations (Arc A-series, newer Xe)
- Performance benchmarks on dedicated Intel Arc GPUs (results will likely differ)
- Testing on newer CPU generations (13th/14th gen Intel)
- Framework comparisons with other models (Gemma, Phi-3.5, etc.)
- Bug fixes and documentation improvements
- Model optimization tips and configurations
Note: Current findings are specific to Intel Core i7-1185G7 + Iris Xe integrated graphics. Results on dedicated Arc GPUs or newer CPUs may show different GPU vs CPU performance characteristics.
MIT License - See LICENSE for details.
- Ollama for creating the simplest LLM inference tool
- Intel for OpenVINO toolkit and GPU drivers
- Hugging Face for model hosting and optimization tools
- OpenVINO community for documentation and support
This toolkit demonstrates that Intel GPU acceleration isn't always necessary for local LLM inference. Based on comprehensive testing:
- For most users: Ollama on CPU is the best choice (simple + fast)
- For maximum Mistral 7B speed: OpenVINO CPU is worth the setup (1.6x faster)
- For Intel Iris Xe integrated GPUs: Skip GPU setup entirely (no performance benefit)
The project provides tools to test and validate these findings on your own hardware, as results may vary with different Intel GPU generations.
Note: This is a community project with honest, evidence-based recommendations. Results are specific to tested hardware (Intel Core i7-1185G7 + Iris Xe). For production deployments or different hardware, conduct your own benchmarks using the included tools.