Skip to content

SteveLeve/intel-gpu-llm-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Intel GPU LLM Inference Toolkit for Linux

Multi-backend LLM inference with comprehensive performance testing: Ollama, OpenVINO, and llama.cpp

License: MIT Intel GPU OpenVINO

A comprehensive toolkit for running Large Language Models locally with three inference backends: Ollama (recommended), OpenVINO GenAI, and llama.cpp. Includes extensive benchmarking tools and performance comparisons to help you choose the best setup for your hardware.

🎯 Why This Project?

  • πŸ“Š Evidence-Based: 11 configurations tested across 5 models with detailed performance data
  • ⚑ Multiple Backends: Compare Ollama, OpenVINO (GPU/CPU), and llama.cpp
  • 🎯 Honest Results: Nuanced findings show GPU isn't always faster on integrated Intel Xe
  • πŸ’Ό Practical Guidance: Clear recommendations based on actual hardware testing
  • πŸ”’ Privacy: Run models locally without cloud dependencies
  • πŸŽ“ Complete Toolkit: Setup scripts, benchmarking tools, and comprehensive documentation

πŸ“Š Performance Results

Tested on Intel Core i7-1185G7 + Iris Xe Graphics (11 configurations across 5 models):

Recommended Setup (Ollama on CPU)

Model Speed Capabilities Best For
Qwen3-VL 8B 5.14 tok/s Vision + Text General use πŸ†
Llama 3.1 8B 4.50 tok/s Tool calling Function calling ⭐
Mistral 7B 5.86 tok/s Text only Speed fallback

Setup: ollama pull qwen3-vl:8b-instruct - Single command, no drivers needed

Framework Comparison (Same Models)

Model Ollama CPU OpenVINO CPU OpenVINO GPU Winner
Mistral 7B 5.86 tok/s 9.5 tok/s 9.4 tok/s OpenVINO CPU (1.6x)
Llama 3.1 8B 4.50 tok/s 3.4 tok/s 4.3 tok/s Ollama (1.3x)

Key Findings:

  • ⚑ Framework performance is model-specific - No single framework wins for all models
  • πŸ–₯️ GPU provides NO meaningful advantage on Intel Iris Xe integrated graphics
  • βœ… Ollama recommended for simplicity + competitive performance
  • πŸ”§ OpenVINO CPU worth setup complexity only for maximum Mistral 7B speed

πŸ“ˆ Full analysis: COMPREHENSIVE_PERFORMANCE_COMPARISON.md

πŸ“– Documentation

πŸ–₯️ Hardware Requirements

  • GPU: Intel Xe Graphics (TigerLake, Alderlake, or newer)
    • Examples: Intel Iris Xe, Intel Arc Graphics
  • OS: Ubuntu 22.04+ (or compatible Debian-based distributions)
  • RAM: 8GB+ recommended
  • Python: 3.8+ (for OpenVINO GenAI)

πŸ“‹ What's Included

Setup Scripts

  1. Ollama Setup ⭐ Recommended - Start Here

    curl -fsSL https://ollama.com/install.sh | sh
    ollama pull qwen3-vl:8b-instruct
    • Simplest setup (single binary, no drivers)
    • Best performance for Llama 3.1 8B and Qwen3-VL
    • Extensive model library via ollama pull
  2. setup-intel-gpu-llm.sh (Optional - For Maximum Mistral 7B Speed)

    • Sets up OpenVINO GenAI with Intel GPU/CPU support
    • Creates isolated Python virtual environment
    • Installs Intel compute runtime drivers
    • Use if: You need maximum Mistral 7B performance (9.5 tok/s vs 5.86 tok/s)
  3. setup-llama-cpp.sh (Optional - For Benchmarking)

    • Builds llama.cpp for CPU-only inference
    • Used for performance comparison baseline
    • Creates convenience wrapper scripts
  4. activate-intel-gpu.sh

    • Helper script to activate the OpenVINO Python environment
    • Auto-generated by setup-intel-gpu-llm.sh

Testing & Benchmarking Scripts

  1. test-inference.py

    • Test individual models with OpenVINO GPU
    • Supports Phi-3, Mistral, Llama 3
    • Flexible CLI with streaming support
  2. test-models.sh

    • Interactive menu for testing models
    • Handles model conversion automatically
  3. benchmark.py ⭐ Performance Comparison

    • Compare OpenVINO GPU vs llama.cpp CPU
    • Measures tokens/second, latency
    • Generates detailed performance reports

πŸš€ Quick Start

⚑ Recommended: Ollama Setup (Simplest & Fast)

# Install Ollama (single command)
curl -fsSL https://ollama.com/install.sh | sh

# Pull recommended models
ollama pull qwen3-vl:8b-instruct      # Best all-around (vision + text)
ollama pull llama3.1:8b-instruct-q4_0 # Best tool calling
ollama pull mistral:7b-instruct-q4_0  # Speed fallback

# Start using immediately
ollama run qwen3-vl:8b-instruct "Explain quantum computing"

Why Ollama?

  • βœ… No driver installation needed
  • βœ… Best performance for Llama 3.1 8B (4.50 tok/s)
  • βœ… Competitive performance for other models
  • βœ… Huge model library
  • βœ… Works immediately

πŸ”§ Optional: OpenVINO Setup (For Maximum Mistral Speed)

Only needed if you want maximum Mistral 7B performance (9.5 tok/s vs 5.86 tok/s Ollama):

# Clone repository
git clone <your-repo-url>
cd intel-gpu-llm-inference
git submodule update --init --recursive

# Run setup (one time, ~10 minutes)
./setup-intel-gpu-llm.sh

# If prompted, log out and back in for group changes
# Activate environment
source activate-intel-gpu.sh

# Test with Mistral 7B
python test-inference.py --model mistral --prompt "Your prompt here"

πŸ“Š Benchmarking Different Frameworks

# 1. Install Ollama (for comparison)
curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral:7b-instruct-q4_0

# 2. Setup OpenVINO
./setup-intel-gpu-llm.sh
source activate-intel-gpu.sh

# 3. Test same model on both frameworks
time ollama run mistral:7b-instruct-q4_0 "Your prompt" --verbose
python test-inference.py --model mistral --device CPU --prompt "Your prompt"
python test-inference.py --model mistral --device GPU --prompt "Your prompt"

# 4. Run comprehensive benchmarks
./benchmark.py \
  --openvino-model mistral_7b_ir \
  --llama-model models/mistral-7b-q4.gguf \
  --prompt "Explain quantum computing" \
  --max-tokens 100

πŸ“¦ What Gets Installed

Ollama (Recommended)

  • System: Single binary installed to /usr/local/bin/ollama
  • Models: Downloaded to ~/.ollama/models/ (managed automatically)
  • Size: ~15GB per 8B model (GGUF Q4_0 quantization)

OpenVINO Setup (Optional)

  • System Packages: Intel OpenCL ICD, Intel Level Zero GPU drivers
  • Python Environment: Virtual environment in openvino_env/
  • Python Packages: openvino-genai, optimum-intel[openvino]
  • Models: Converted to *_ir/ directories (~15GB per 8B model)

πŸ”§ Manual Setup

If you prefer manual installation:

1. Install Intel GPU Drivers

# Add Intel GPU repository
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
    gpg --dearmor | sudo tee /usr/share/keyrings/intel-graphics.gpg > /dev/null

echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client" | \
    sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list

sudo apt update
sudo apt install -y intel-opencl-icd intel-level-zero-gpu level-zero

2. Add User to Render Group

sudo usermod -aG render $USER
# Log out and back in for changes to take effect

3. Setup Python Environment

python3 -m venv openvino_env
source openvino_env/bin/activate
pip install --upgrade pip
pip install openvino-genai optimum-intel[openvino]

πŸ“š Usage Examples

Using Ollama (Recommended)

# Basic usage
ollama run qwen3-vl:8b-instruct "Explain quantum computing"

# With timing
time ollama run llama3.1:8b-instruct-q4_0 "Your prompt here" --verbose

# Interactive chat
ollama run qwen3-vl:8b-instruct

# List available models
ollama list

# Remove a model
ollama rm mistral:7b-instruct-q4_0

Using OpenVINO (For Benchmarking)

# Activate environment first
source activate-intel-gpu.sh

# Test on CPU (recommended)
python test-inference.py --model mistral --device CPU --prompt "Write a story"

# Test on GPU (for comparison)
python test-inference.py --model mistral --device GPU --prompt "Write a story"

# With streaming
python test-inference.py --model mistral --stream --prompt "Write a poem"

# Test Llama 3.1 8B (requires HuggingFace auth)
huggingface-cli login
python test-inference.py --model llama31 --device CPU --prompt "What is AI?"

Performance Benchmarking

# Quick comparison
./benchmark.py \
  --openvino-model phi3_mini_ir \
  --llama-model models/phi3-mini-q4.gguf \
  --prompt "Write a story about robots"

# Detailed comparison with multiple runs
./benchmark.py --compare \
  --openvino-model mistral_7b_ir \
  --llama-model models/mistral-7b-q4.gguf \
  --runs 3 \
  --output results.json

# GPU-only benchmark
./benchmark.py --openvino-model phi3_mini_ir --gpu-only

Python Inference Script (OpenVINO)

import openvino_genai as ov_genai

# Use CPU (recommended - equals or beats GPU performance)
pipe = ov_genai.LLMPipeline("mistral_7b_ir", "CPU")

# Generate text
prompt = "Explain quantum computing in simple terms:"
response = pipe.generate(prompt, max_new_tokens=200)
print(response)

# For GPU comparison (not recommended on Iris Xe)
pipe_gpu = ov_genai.LLMPipeline("mistral_7b_ir", "GPU")
response_gpu = pipe_gpu.generate(prompt, max_new_tokens=200)

Streaming Generation (OpenVINO)

import openvino_genai as ov_genai

# CPU recommended
pipe = ov_genai.LLMPipeline("mistral_7b_ir", "CPU")

# Streaming callback
def stream_callback(text):
    print(text, end='', flush=True)

config = ov_genai.GenerationConfig()
config.max_new_tokens = 100

pipe.generate("Write a short story:", config, stream_callback)

Download and Convert Models

Ollama (Easiest)

# Browse available models
ollama list | head

# Pull any model from library
ollama pull qwen3-vl:8b-instruct
ollama pull llama3.1:8b-instruct-q4_0
ollama pull mistral:7b-instruct-q4_0
ollama pull phi3:3.8b-mini-instruct-4k-q4_0

# Models automatically quantized and optimized

OpenVINO (For Custom Models)

# Activate environment
source activate-intel-gpu.sh

# Export from Hugging Face to OpenVINO IR (int4 quantization)
optimum-cli export openvino \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  mistral_7b_ir \
  --weight-format int4

# Some models require HuggingFace authentication
huggingface-cli login
optimum-cli export openvino \
  --model meta-llama/Llama-3.1-8B-Instruct \
  llama31_8b_ir \
  --weight-format int4

πŸ› Troubleshooting

GPU Not Detected

# Check if GPU is visible
lspci | grep -i vga

# Verify device files
ls -la /dev/dri/

# Check OpenCL detection
clinfo -l

# Verify user permissions
groups | grep render

OpenVINO Import Errors

# Ensure virtual environment is activated
source openvino_env/bin/activate

# Reinstall if needed
pip install --force-reinstall openvino-genai

Performance Issues

  • Use CPU instead of GPU: On Intel Iris Xe, CPU equals or beats GPU performance
  • Try Ollama: Simpler setup, competitive performance
  • Monitor usage: intel_gpu_top (install: sudo apt install intel-gpu-tools)
  • Check thermal throttling: Monitor temperatures with sensors
  • Reduce context length: Use smaller max_new_tokens values

Which Backend Should I Use?

Use Ollama if:

  • βœ… You want simplicity (single command install)
  • βœ… You're using Llama 3.1 8B (Ollama 1.3x faster than OpenVINO)
  • βœ… You want fast model loading (0.22s for Llama 3.1)
  • βœ… You don't want to manage GPU drivers

Use OpenVINO if:

  • βœ… You need maximum Mistral 7B speed (9.5 tok/s vs 5.86 tok/s Ollama)
  • βœ… You want to benchmark GPU vs CPU
  • βœ… You need custom model conversions
  • ⚠️ You're willing to invest setup time for 1.6x speedup on one model

Don't bother with GPU on Intel Iris Xe:

  • ❌ No meaningful performance advantage over CPU
  • ❌ Significantly longer load times (10-15s vs 0.2-6s)
  • ❌ Complex driver setup for no benefit

πŸ“ Directory Structure

.
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ .gitignore                   # Git ignore rules
β”‚
β”œβ”€β”€ Setup Scripts
β”œβ”€β”€ setup-intel-gpu-llm.sh       # OpenVINO GenAI setup (GPU)
β”œβ”€β”€ setup-llama-cpp.sh           # llama.cpp build (CPU comparison)
β”œβ”€β”€ setup-ollama-intel-gpu.sh    # Ollama setup (experimental)
β”œβ”€β”€ activate-intel-gpu.sh        # Environment activation helper
β”‚
β”œβ”€β”€ Testing & Benchmarking
β”œβ”€β”€ test-inference.py            # Test individual models
β”œβ”€β”€ test-models.sh               # Interactive model testing
β”œβ”€β”€ benchmark.py                 # Performance comparison tool
β”‚
β”œβ”€β”€ Models & Environments
β”œβ”€β”€ openvino_env/                # Python venv (excluded from git)
β”œβ”€β”€ *_ir/                        # OpenVINO IR models (excluded from git)
β”œβ”€β”€ models/                      # GGUF models for llama.cpp (excluded from git)
└── llama.cpp/                   # llama.cpp source & build

⚠️ Known Limitations & Findings

Hardware-Specific Findings (Intel Iris Xe)

  1. GPU vs CPU: Intel Iris Xe integrated GPU provides no meaningful advantage over CPU for 4-8B models

    • Tested on i7-1185G7: GPU tied or slower than CPU in all tests
    • CPU has faster load times (0.2-6s vs 10-15s GPU)
    • Recommendation: Use CPU-only inference on integrated Intel GPUs
  2. Framework Performance is Model-Specific:

    • Mistral 7B: OpenVINO 1.6x faster than Ollama (9.5 vs 5.86 tok/s)
    • Llama 3.1 8B: Ollama 1.3x faster than OpenVINO (4.50 vs 3.4 tok/s)
    • No single framework wins for all models
  3. Model Size Constraints:

    • Intel Xe integrated GPUs share system RAM
    • 8B models work well, 14B+ models not recommended (memory pressure)
    • Tested successfully: 1.1B, 3.8B, 7B, 8B parameter models
  4. Best Performance Setup (Based on Testing):

    • Primary: Ollama with Qwen3-VL 8B (5.14 tok/s, vision + text)
    • Tool calling: Ollama with Llama 3.1 8B (4.50 tok/s, best function calling)
    • Max speed: OpenVINO CPU with Mistral 7B (9.5 tok/s, text only)

General Limitations

  1. Ollama Intel GPU Support: Limited - use CPU mode instead (better performance anyway)
  2. Driver Support: OpenVINO requires recent Linux kernel (5.15+) for GPU features
  3. Dedicated GPUs: Results may differ significantly on Intel Arc dedicated GPUs (not tested)

πŸ”— Resources

🀝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Especially valuable:

  • Hardware test reports from different Intel GPU generations (Arc A-series, newer Xe)
  • Performance benchmarks on dedicated Intel Arc GPUs (results will likely differ)
  • Testing on newer CPU generations (13th/14th gen Intel)
  • Framework comparisons with other models (Gemma, Phi-3.5, etc.)
  • Bug fixes and documentation improvements
  • Model optimization tips and configurations

Note: Current findings are specific to Intel Core i7-1185G7 + Iris Xe integrated graphics. Results on dedicated Arc GPUs or newer CPUs may show different GPU vs CPU performance characteristics.

πŸ“„ License

MIT License - See LICENSE for details.

πŸ™ Acknowledgments

  • Ollama for creating the simplest LLM inference tool
  • Intel for OpenVINO toolkit and GPU drivers
  • Hugging Face for model hosting and optimization tools
  • OpenVINO community for documentation and support

πŸ“ Project Summary

This toolkit demonstrates that Intel GPU acceleration isn't always necessary for local LLM inference. Based on comprehensive testing:

  • For most users: Ollama on CPU is the best choice (simple + fast)
  • For maximum Mistral 7B speed: OpenVINO CPU is worth the setup (1.6x faster)
  • For Intel Iris Xe integrated GPUs: Skip GPU setup entirely (no performance benefit)

The project provides tools to test and validate these findings on your own hardware, as results may vary with different Intel GPU generations.

Note: This is a community project with honest, evidence-based recommendations. Results are specific to tested hardware (Intel Core i7-1185G7 + Iris Xe). For production deployments or different hardware, conduct your own benchmarks using the included tools.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages