🚀 Intel GPU LLM Inference Toolkit for Linux

Multi-backend LLM inference with comprehensive performance testing: Ollama, OpenVINO, and llama.cpp

A comprehensive toolkit for running Large Language Models locally with three inference backends: Ollama (recommended), OpenVINO GenAI, and llama.cpp. Includes extensive benchmarking tools and performance comparisons to help you choose the best setup for your hardware.

🎯 Why This Project?

📊 Evidence-Based: 11 configurations tested across 5 models with detailed performance data
⚡ Multiple Backends: Compare Ollama, OpenVINO (GPU/CPU), and llama.cpp
🎯 Honest Results: Nuanced findings show GPU isn't always faster on integrated Intel Xe
💼 Practical Guidance: Clear recommendations based on actual hardware testing
🔒 Privacy: Run models locally without cloud dependencies
🎓 Complete Toolkit: Setup scripts, benchmarking tools, and comprehensive documentation

📊 Performance Results

Tested on Intel Core i7-1185G7 + Iris Xe Graphics (11 configurations across 5 models):

Recommended Setup (Ollama on CPU)

Model	Speed	Capabilities	Best For
Qwen3-VL 8B	5.14 tok/s	Vision + Text	General use 🏆
Llama 3.1 8B	4.50 tok/s	Tool calling	Function calling ⭐
Mistral 7B	5.86 tok/s	Text only	Speed fallback

Setup: ollama pull qwen3-vl:8b-instruct - Single command, no drivers needed

Framework Comparison (Same Models)

Model	Ollama CPU	OpenVINO CPU	OpenVINO GPU	Winner
Mistral 7B	5.86 tok/s	9.5 tok/s	9.4 tok/s	OpenVINO CPU (1.6x)
Llama 3.1 8B	4.50 tok/s	3.4 tok/s	4.3 tok/s	Ollama (1.3x)

Key Findings:

⚡ Framework performance is model-specific - No single framework wins for all models
🖥️ GPU provides NO meaningful advantage on Intel Iris Xe integrated graphics
✅ Ollama recommended for simplicity + competitive performance
🔧 OpenVINO CPU worth setup complexity only for maximum Mistral 7B speed

📈 Full analysis: COMPREHENSIVE_PERFORMANCE_COMPARISON.md

📖 Documentation

COMPREHENSIVE_PERFORMANCE_COMPARISON.md - Complete test results & analysis
llm-benchmark-charts.html - Performance comparison charts
GETTING_STARTED.md - Complete beginner's guide
BENCHMARK_GUIDE.md - Performance testing methodology
QUICK_PERFORMANCE_REFERENCE.md - Quick performance guide
CONTRIBUTING.md - How to contribute

🖥️ Hardware Requirements

GPU: Intel Xe Graphics (TigerLake, Alderlake, or newer)
- Examples: Intel Iris Xe, Intel Arc Graphics
OS: Ubuntu 22.04+ (or compatible Debian-based distributions)
RAM: 8GB+ recommended
Python: 3.8+ (for OpenVINO GenAI)

📋 What's Included

Setup Scripts

Ollama Setup ⭐ Recommended - Start Here
```
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3-vl:8b-instruct
```
- Simplest setup (single binary, no drivers)
- Best performance for Llama 3.1 8B and Qwen3-VL
- Extensive model library via ollama pull
setup-intel-gpu-llm.sh (Optional - For Maximum Mistral 7B Speed)
- Sets up OpenVINO GenAI with Intel GPU/CPU support
- Creates isolated Python virtual environment
- Installs Intel compute runtime drivers
- Use if: You need maximum Mistral 7B performance (9.5 tok/s vs 5.86 tok/s)
setup-llama-cpp.sh (Optional - For Benchmarking)
- Builds llama.cpp for CPU-only inference
- Used for performance comparison baseline
- Creates convenience wrapper scripts
activate-intel-gpu.sh
- Helper script to activate the OpenVINO Python environment
- Auto-generated by setup-intel-gpu-llm.sh

Testing & Benchmarking Scripts

test-inference.py
- Test individual models with OpenVINO GPU
- Supports Phi-3, Mistral, Llama 3
- Flexible CLI with streaming support
test-models.sh
- Interactive menu for testing models
- Handles model conversion automatically
benchmark.py ⭐ Performance Comparison
- Compare OpenVINO GPU vs llama.cpp CPU
- Measures tokens/second, latency
- Generates detailed performance reports

🚀 Quick Start

⚡ Recommended: Ollama Setup (Simplest & Fast)

# Install Ollama (single command)
curl -fsSL https://ollama.com/install.sh | sh

# Pull recommended models
ollama pull qwen3-vl:8b-instruct      # Best all-around (vision + text)
ollama pull llama3.1:8b-instruct-q4_0 # Best tool calling
ollama pull mistral:7b-instruct-q4_0  # Speed fallback

# Start using immediately
ollama run qwen3-vl:8b-instruct "Explain quantum computing"

Why Ollama?

✅ No driver installation needed
✅ Best performance for Llama 3.1 8B (4.50 tok/s)
✅ Competitive performance for other models
✅ Huge model library
✅ Works immediately

🔧 Optional: OpenVINO Setup (For Maximum Mistral Speed)

Only needed if you want maximum Mistral 7B performance (9.5 tok/s vs 5.86 tok/s Ollama):

# Clone repository
git clone <your-repo-url>
cd intel-gpu-llm-inference
git submodule update --init --recursive

# Run setup (one time, ~10 minutes)
./setup-intel-gpu-llm.sh

# If prompted, log out and back in for group changes
# Activate environment
source activate-intel-gpu.sh

# Test with Mistral 7B
python test-inference.py --model mistral --prompt "Your prompt here"

📊 Benchmarking Different Frameworks

# 1. Install Ollama (for comparison)
curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral:7b-instruct-q4_0

# 2. Setup OpenVINO
./setup-intel-gpu-llm.sh
source activate-intel-gpu.sh

# 3. Test same model on both frameworks
time ollama run mistral:7b-instruct-q4_0 "Your prompt" --verbose
python test-inference.py --model mistral --device CPU --prompt "Your prompt"
python test-inference.py --model mistral --device GPU --prompt "Your prompt"

# 4. Run comprehensive benchmarks
./benchmark.py \
  --openvino-model mistral_7b_ir \
  --llama-model models/mistral-7b-q4.gguf \
  --prompt "Explain quantum computing" \
  --max-tokens 100

📦 What Gets Installed

Ollama (Recommended)

System: Single binary installed to /usr/local/bin/ollama
Models: Downloaded to ~/.ollama/models/ (managed automatically)
Size: ~15GB per 8B model (GGUF Q4_0 quantization)

OpenVINO Setup (Optional)

System Packages: Intel OpenCL ICD, Intel Level Zero GPU drivers
Python Environment: Virtual environment in openvino_env/
Python Packages: openvino-genai, optimum-intel[openvino]
Models: Converted to *_ir/ directories (~15GB per 8B model)

🔧 Manual Setup

If you prefer manual installation:

1. Install Intel GPU Drivers

# Add Intel GPU repository
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
    gpg --dearmor | sudo tee /usr/share/keyrings/intel-graphics.gpg > /dev/null

echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client" | \
    sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list

sudo apt update
sudo apt install -y intel-opencl-icd intel-level-zero-gpu level-zero

2. Add User to Render Group

sudo usermod -aG render $USER
# Log out and back in for changes to take effect

3. Setup Python Environment

python3 -m venv openvino_env
source openvino_env/bin/activate
pip install --upgrade pip
pip install openvino-genai optimum-intel[openvino]

📚 Usage Examples

Using Ollama (Recommended)

# Basic usage
ollama run qwen3-vl:8b-instruct "Explain quantum computing"

# With timing
time ollama run llama3.1:8b-instruct-q4_0 "Your prompt here" --verbose

# Interactive chat
ollama run qwen3-vl:8b-instruct

# List available models
ollama list

# Remove a model
ollama rm mistral:7b-instruct-q4_0

Using OpenVINO (For Benchmarking)

# Activate environment first
source activate-intel-gpu.sh

# Test on CPU (recommended)
python test-inference.py --model mistral --device CPU --prompt "Write a story"

# Test on GPU (for comparison)
python test-inference.py --model mistral --device GPU --prompt "Write a story"

# With streaming
python test-inference.py --model mistral --stream --prompt "Write a poem"

# Test Llama 3.1 8B (requires HuggingFace auth)
huggingface-cli login
python test-inference.py --model llama31 --device CPU --prompt "What is AI?"

Performance Benchmarking

# Quick comparison
./benchmark.py \
  --openvino-model phi3_mini_ir \
  --llama-model models/phi3-mini-q4.gguf \
  --prompt "Write a story about robots"

# Detailed comparison with multiple runs
./benchmark.py --compare \
  --openvino-model mistral_7b_ir \
  --llama-model models/mistral-7b-q4.gguf \
  --runs 3 \
  --output results.json

# GPU-only benchmark
./benchmark.py --openvino-model phi3_mini_ir --gpu-only

Python Inference Script (OpenVINO)

import openvino_genai as ov_genai

# Use CPU (recommended - equals or beats GPU performance)
pipe = ov_genai.LLMPipeline("mistral_7b_ir", "CPU")

# Generate text
prompt = "Explain quantum computing in simple terms:"
response = pipe.generate(prompt, max_new_tokens=200)
print(response)

# For GPU comparison (not recommended on Iris Xe)
pipe_gpu = ov_genai.LLMPipeline("mistral_7b_ir", "GPU")
response_gpu = pipe_gpu.generate(prompt, max_new_tokens=200)

Streaming Generation (OpenVINO)

import openvino_genai as ov_genai

# CPU recommended
pipe = ov_genai.LLMPipeline("mistral_7b_ir", "CPU")

# Streaming callback
def stream_callback(text):
    print(text, end='', flush=True)

config = ov_genai.GenerationConfig()
config.max_new_tokens = 100

pipe.generate("Write a short story:", config, stream_callback)

Download and Convert Models

Ollama (Easiest)

# Browse available models
ollama list | head

# Pull any model from library
ollama pull qwen3-vl:8b-instruct
ollama pull llama3.1:8b-instruct-q4_0
ollama pull mistral:7b-instruct-q4_0
ollama pull phi3:3.8b-mini-instruct-4k-q4_0

# Models automatically quantized and optimized

OpenVINO (For Custom Models)

# Activate environment
source activate-intel-gpu.sh

# Export from Hugging Face to OpenVINO IR (int4 quantization)
optimum-cli export openvino \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  mistral_7b_ir \
  --weight-format int4

# Some models require HuggingFace authentication
huggingface-cli login
optimum-cli export openvino \
  --model meta-llama/Llama-3.1-8B-Instruct \
  llama31_8b_ir \
  --weight-format int4

🐛 Troubleshooting

GPU Not Detected

# Check if GPU is visible
lspci | grep -i vga

# Verify device files
ls -la /dev/dri/

# Check OpenCL detection
clinfo -l

# Verify user permissions
groups | grep render

OpenVINO Import Errors

# Ensure virtual environment is activated
source openvino_env/bin/activate

# Reinstall if needed
pip install --force-reinstall openvino-genai

Performance Issues

Use CPU instead of GPU: On Intel Iris Xe, CPU equals or beats GPU performance
Try Ollama: Simpler setup, competitive performance
Monitor usage: intel_gpu_top (install: sudo apt install intel-gpu-tools)
Check thermal throttling: Monitor temperatures with sensors
Reduce context length: Use smaller max_new_tokens values

Which Backend Should I Use?

Use Ollama if:

✅ You want simplicity (single command install)
✅ You're using Llama 3.1 8B (Ollama 1.3x faster than OpenVINO)
✅ You want fast model loading (0.22s for Llama 3.1)
✅ You don't want to manage GPU drivers

Use OpenVINO if:

✅ You need maximum Mistral 7B speed (9.5 tok/s vs 5.86 tok/s Ollama)
✅ You want to benchmark GPU vs CPU
✅ You need custom model conversions
⚠️ You're willing to invest setup time for 1.6x speedup on one model

Don't bother with GPU on Intel Iris Xe:

❌ No meaningful performance advantage over CPU
❌ Significantly longer load times (10-15s vs 0.2-6s)
❌ Complex driver setup for no benefit

📁 Directory Structure

.
├── README.md                    # This file
├── .gitignore                   # Git ignore rules
│
├── Setup Scripts
├── setup-intel-gpu-llm.sh       # OpenVINO GenAI setup (GPU)
├── setup-llama-cpp.sh           # llama.cpp build (CPU comparison)
├── setup-ollama-intel-gpu.sh    # Ollama setup (experimental)
├── activate-intel-gpu.sh        # Environment activation helper
│
├── Testing & Benchmarking
├── test-inference.py            # Test individual models
├── test-models.sh               # Interactive model testing
├── benchmark.py                 # Performance comparison tool
│
├── Models & Environments
├── openvino_env/                # Python venv (excluded from git)
├── *_ir/                        # OpenVINO IR models (excluded from git)
├── models/                      # GGUF models for llama.cpp (excluded from git)
└── llama.cpp/                   # llama.cpp source & build

⚠️ Known Limitations & Findings

Hardware-Specific Findings (Intel Iris Xe)

GPU vs CPU: Intel Iris Xe integrated GPU provides no meaningful advantage over CPU for 4-8B models
- Tested on i7-1185G7: GPU tied or slower than CPU in all tests
- CPU has faster load times (0.2-6s vs 10-15s GPU)
- Recommendation: Use CPU-only inference on integrated Intel GPUs
Framework Performance is Model-Specific:
- Mistral 7B: OpenVINO 1.6x faster than Ollama (9.5 vs 5.86 tok/s)
- Llama 3.1 8B: Ollama 1.3x faster than OpenVINO (4.50 vs 3.4 tok/s)
- No single framework wins for all models
Model Size Constraints:
- Intel Xe integrated GPUs share system RAM
- 8B models work well, 14B+ models not recommended (memory pressure)
- Tested successfully: 1.1B, 3.8B, 7B, 8B parameter models
Best Performance Setup (Based on Testing):
- Primary: Ollama with Qwen3-VL 8B (5.14 tok/s, vision + text)
- Tool calling: Ollama with Llama 3.1 8B (4.50 tok/s, best function calling)
- Max speed: OpenVINO CPU with Mistral 7B (9.5 tok/s, text only)

General Limitations

Ollama Intel GPU Support: Limited - use CPU mode instead (better performance anyway)
Driver Support: OpenVINO requires recent Linux kernel (5.15+) for GPU features
Dedicated GPUs: Results may differ significantly on Intel Arc dedicated GPUs (not tested)

🔗 Resources

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Especially valuable:

Hardware test reports from different Intel GPU generations (Arc A-series, newer Xe)
Performance benchmarks on dedicated Intel Arc GPUs (results will likely differ)
Testing on newer CPU generations (13th/14th gen Intel)
Framework comparisons with other models (Gemma, Phi-3.5, etc.)
Bug fixes and documentation improvements
Model optimization tips and configurations

Note: Current findings are specific to Intel Core i7-1185G7 + Iris Xe integrated graphics. Results on dedicated Arc GPUs or newer CPUs may show different GPU vs CPU performance characteristics.

📄 License

MIT License - See LICENSE for details.

🙏 Acknowledgments

Ollama for creating the simplest LLM inference tool
Intel for OpenVINO toolkit and GPU drivers
Hugging Face for model hosting and optimization tools
OpenVINO community for documentation and support

📝 Project Summary

This toolkit demonstrates that Intel GPU acceleration isn't always necessary for local LLM inference. Based on comprehensive testing:

For most users: Ollama on CPU is the best choice (simple + fast)
For maximum Mistral 7B speed: OpenVINO CPU is worth the setup (1.6x faster)
For Intel Iris Xe integrated GPUs: Skip GPU setup entirely (no performance benefit)

The project provides tools to test and validate these findings on your own hardware, as results may vary with different Intel GPU generations.

Note: This is a community project with honest, evidence-based recommendations. Results are specific to tested hardware (Intel Core i7-1185G7 + Iris Xe). For production deployments or different hardware, conduct your own benchmarks using the included tools.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
llama.cpp @ 7f09a68		llama.cpp @ 7f09a68
.gitignore		.gitignore
.gitmodules		.gitmodules
BENCHMARK_GUIDE.md		BENCHMARK_GUIDE.md
CLAUDE.md		CLAUDE.md
COMPREHENSIVE_PERFORMANCE_COMPARISON.md		COMPREHENSIVE_PERFORMANCE_COMPARISON.md
CONTRIBUTING.md		CONTRIBUTING.md
EXPERIMENT_LOG.md		EXPERIMENT_LOG.md
GETTING_STARTED.md		GETTING_STARTED.md
GPT-OSS-20B-EXPERIMENT.md		GPT-OSS-20B-EXPERIMENT.md
INTEL_GPU_ACCELERATION_REFERENCES.md		INTEL_GPU_ACCELERATION_REFERENCES.md
IPEX_LLM_GPU_EXPLORATION.md		IPEX_LLM_GPU_EXPLORATION.md
LICENSE		LICENSE
LLAMA31_8B_PERFORMANCE_RESULTS.md		LLAMA31_8B_PERFORMANCE_RESULTS.md
MISTRAL_7B_PERFORMANCE_RESULTS.md		MISTRAL_7B_PERFORMANCE_RESULTS.md
PERFORMANCE_COMPARISON_SUMMARY.md		PERFORMANCE_COMPARISON_SUMMARY.md
PERFORMANCE_TEST_RESULTS.md		PERFORMANCE_TEST_RESULTS.md
PHI3_PERFORMANCE_RESULTS.md		PHI3_PERFORMANCE_RESULTS.md
QUICK_PERFORMANCE_REFERENCE.md		QUICK_PERFORMANCE_REFERENCE.md
QWEN3_VL_8B_PERFORMANCE_RESULTS.md		QWEN3_VL_8B_PERFORMANCE_RESULTS.md
README.md		README.md
Step-by-step to increase swap.md		Step-by-step to increase swap.md
activate-intel-gpu.sh		activate-intel-gpu.sh
benchmark-gptoss-20b.sh		benchmark-gptoss-20b.sh
benchmark.py		benchmark.py
llm-benchmark-charts.html		llm-benchmark-charts.html
quickstart-example.sh		quickstart-example.sh
setup-intel-gpu-llm.sh		setup-intel-gpu-llm.sh
setup-llama-cpp.sh		setup-llama-cpp.sh
setup-ollama-intel-gpu.sh		setup-ollama-intel-gpu.sh
test-inference.py		test-inference.py
test-models.sh		test-models.sh

Folders and files

Latest commit

History

Repository files navigation