IPEX-LLM GPU Acceleration Exploration

Date: 2025-12-04 Hardware: Intel Core i7-1185G7 + Iris Xe Graphics (TigerLake-LP GT2) Ollama Version: 0.13.1 Status: Exploration phase - NOT RECOMMENDED for implementation

Executive Summary

Explored enabling Intel GPU acceleration for Ollama via IPEX-LLM (Intel Extension for PyTorch). Conclusion: GPU acceleration unlikely to provide benefit on integrated Iris Xe graphics based on existing OpenVINO GPU testing.

Key Finding

Previous OpenVINO GPU testing (documented in CLAUDE.md) demonstrated NO meaningful performance advantage:

Llama 3.1 8B: 4.50 tok/s (Ollama CPU) vs 4.3 tok/s (OpenVINO GPU)
GPU load times: 70x slower (15.4s vs 0.22s)
Similar technology stack: Both use Intel Level Zero + OpenCL drivers

Recommendation

Stick with current Ollama CPU setup - proven fast, simple, reliable.

Hardware Verification

System Check (2025-12-04)

$ lspci | grep -i vga
0000:00:02.0 VGA compatible controller: Intel Corporation TigerLake-LP GT2 [Iris Xe Graphics] (rev 01)

$ ls -la /dev/dri/
crw-rw----+ 1 root video  226,   1 Dec  3 13:19 card1
crw-rw----+ 1 root render 226, 128 Dec  2 09:19 renderD128

$ groups
steve-leve ... render  # ✓ GPU access confirmed

$ clinfo -l
Platform #0: Intel(R) OpenCL Graphics
 `-- Device #0: Intel(R) Iris(R) Xe Graphics  # ✓ OpenCL functional

Result: Hardware is technically supported but unlikely to benefit from GPU acceleration.

Installed Intel Packages

ii  intel-opencl-icd                    25.40.35563.7-1~24.04~ppa1  # GPU compute runtime
ii  level-zero                          1.16.15-881~22.04           # oneAPI Level Zero
ii  intel-oneapi-runtime-dpcpp-cpp      2025.3.1-760               # SYCL runtime
ii  intel-oneapi-runtime-opencl         2025.3.1-760               # OpenCL runtime

Status: Runtime components present, but oneAPI Base Toolkit NOT installed (missing /opt/intel/oneapi/setvars.sh).

IPEX-LLM Overview

What is IPEX-LLM?

Intel Extension for PyTorch - Large Language Models. Provides optimized inference for Intel CPUs and GPUs.

Ollama Integration: Custom IPEX-LLM-optimized Ollama binary that can leverage Intel GPUs.

Technology Stack

oneAPI Base Toolkit: Intel's unified programming model
Level Zero: Low-level GPU interface
SYCL: Cross-platform abstraction layer
IPEX: PyTorch optimizations for Intel hardware

Supported Hardware

Discrete GPUs: Intel Arc A-series (A770, A750, A380)
Integrated GPUs: Iris Xe (11th-13th gen), UHD Graphics (select models)

Note: Iris Xe is supported, but integrated GPUs have fundamental limitations for LLM workloads.

Why GPU Likely Won't Help (Technical Analysis)

1. Existing Empirical Evidence

Your comprehensive OpenVINO testing already proved this:

Model	Framework	Device	Speed	Load Time
Llama 3.1 8B	Ollama	CPU	4.50 tok/s	0.22s
Llama 3.1 8B	OpenVINO	GPU	4.3 tok/s	15.4s
Llama 3.1 8B	OpenVINO	CPU	3.4 tok/s	6.1s

Interpretation: GPU was 5% slower than Ollama CPU, with 70x slower load times.

2. Integrated GPU Architectural Limitations

Memory Bottleneck:

Iris Xe shares system RAM (no dedicated VRAM)
Memory bandwidth: ~50 GB/s (shared with CPU)
LLM inference is memory-bound, not compute-bound
GPU compute advantage negated by memory bandwidth limit

Resource Competition:

CPU and GPU compete for same memory bus
OS/background processes reduce available bandwidth
No performance isolation

Thermal/Power Constraints:

Integrated GPUs throttle under sustained load
Share TDP budget with CPU cores
Cannot sustain peak performance

3. Technology Stack Overlap

IPEX-LLM uses the same driver stack as your OpenVINO setup:

Intel OpenCL drivers (intel-opencl-icd)
Level Zero runtime
Same GPU compute interface

Implication: If OpenVINO GPU didn't help, IPEX-LLM GPU likely won't either.

Installation Requirements (If Pursued)

Full Setup Would Require:

oneAPI Base Toolkit (~3GB download)

wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/...
sudo sh oneapi-basekit-*.sh

IPEX-LLM for Ollama (Conda environment)

conda create -n llm-ollama python=3.11
conda activate llm-ollama
pip install ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

Environment Configuration

export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
export OLLAMA_NUM_GPU=999
source /opt/intel/oneapi/setvars.sh
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1  # Optional, sometimes helps

Run IPEX-LLM Ollama Binary

./ollama serve  # IPEX-LLM custom binary
ollama run llama3.1:8b-instruct-q4_0 "Test prompt"

Estimated Setup Time: 1-2 hours Disk Space: ~5GB (toolkit + Python packages) Complexity: High (environment management, driver compatibility)

Performance Expectations

Realistic Prediction (Based on Existing Data)

Metric	Current (Ollama CPU)	Predicted (IPEX-LLM GPU)
Speed	4.50 tok/s	3.5-4.5 tok/s
Load Time	0.22s	5-15s
Setup Complexity	Low (single binary)	High (toolkit + env)
Reliability	Excellent	Unknown (driver issues?)

Expected Outcome: No improvement or slight degradation.

Why Discrete GPUs Perform Better

Dedicated VRAM: 8-16GB of high-bandwidth memory (500+ GB/s)
Memory isolation: No competition with CPU
Higher TDP: Sustained performance without throttling
Optimized for compute: Tensor cores, matrix units

Arc A770 Example: Would likely see 2-3x speedup over CPU for 8B models.

Scientific Value of Testing

Why You Might Still Want to Test

Documentation completeness: Add IPEX-LLM data to your comprehensive benchmark suite
Community value: Few public benchmarks for Iris Xe + IPEX-LLM + Ollama
Verify hypothesis: Scientifically confirm GPU provides no benefit
Future reference: If you upgrade to Arc GPU, setup already familiar

Recommended Test Methodology

If pursuing:

Install IPEX-LLM with full environment
Run Llama 3.1 8B with identical prompts (100 tokens)
Test with/without SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS
Compare: speed (tok/s), load time, memory usage
Document in IPEX_LLM_PERFORMANCE_RESULTS.md

Control: Use existing Ollama CPU baseline (4.50 tok/s, 0.22s load)

References

IPEX-LLM Ollama Quickstart https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md
- Environment variables: ZES_ENABLE_SYSMAN=1, OLLAMA_NUM_GPU=999
- Requires oneAPI toolkit with setvars.sh
- Conda-based Python environment
Intel Builders Guide (PDF - not extracted) https://builders.intel.com/docs/networkbuilders/running-ollama-with-open-webui-on-intel-hardware-platform-1742810910.pdf
- Likely covers end-to-end setup with Open WebUI
- Could not extract text (binary PDF format)
Existing Project Documentation
- CLAUDE.md: Project overview with performance matrix
- COMPREHENSIVE_PERFORMANCE_COMPARISON.md: Full 12-config benchmark
- OpenVINO GPU testing: Established baseline proving GPU provides no benefit

Decision Matrix

Factor	Ollama CPU (Current)	IPEX-LLM GPU
Performance	⭐⭐⭐⭐⭐ Proven fast	⭐⭐ Likely no improvement
Setup	⭐⭐⭐⭐⭐ Simple	⭐⭐ Complex (toolkit + env)
Reliability	⭐⭐⭐⭐⭐ Rock solid	⭐⭐⭐ Potential driver issues
Load Times	⭐⭐⭐⭐⭐ 0.22s	⭐⭐ 5-15s expected
Documentation	⭐⭐⭐⭐⭐ Complete	⭐ Would add data point

Recommendation: Stick with Ollama CPU unless testing for scientific documentation purposes.

Conclusion

Final Assessment: Your hardware is supported, drivers are functional, but GPU acceleration is unlikely to provide meaningful benefit on integrated Iris Xe graphics.

Rationale:

Empirical evidence from OpenVINO testing shows no GPU advantage
Integrated GPU architectural limitations (shared memory bandwidth)
High setup complexity vs. negligible expected benefit
Current Ollama CPU setup is already optimal for this hardware

If Future Hardware Upgrade: Discrete Intel Arc GPU (A770/A750) would make IPEX-LLM GPU acceleration worthwhile (2-3x speedup expected).

Current Best Practice: Continue using Ollama CPU with proven 4.50 tok/s performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPEX-LLM GPU Acceleration Exploration

Executive Summary

Key Finding

Recommendation

Hardware Verification

System Check (2025-12-04)

Installed Intel Packages

IPEX-LLM Overview

What is IPEX-LLM?

Technology Stack

Supported Hardware

Why GPU Likely Won't Help (Technical Analysis)

1. Existing Empirical Evidence

2. Integrated GPU Architectural Limitations

3. Technology Stack Overlap

Installation Requirements (If Pursued)

Full Setup Would Require:

Performance Expectations

Realistic Prediction (Based on Existing Data)

Why Discrete GPUs Perform Better

Scientific Value of Testing

Why You Might Still Want to Test

Recommended Test Methodology

References

Decision Matrix

Conclusion

FilesExpand file tree

IPEX_LLM_GPU_EXPLORATION.md

Latest commit

History

IPEX_LLM_GPU_EXPLORATION.md

File metadata and controls

IPEX-LLM GPU Acceleration Exploration

Executive Summary

Key Finding

Recommendation

Hardware Verification

System Check (2025-12-04)

Installed Intel Packages

IPEX-LLM Overview

What is IPEX-LLM?

Technology Stack

Supported Hardware

Why GPU Likely Won't Help (Technical Analysis)

1. Existing Empirical Evidence

2. Integrated GPU Architectural Limitations

3. Technology Stack Overlap

Installation Requirements (If Pursued)

Full Setup Would Require:

Performance Expectations

Realistic Prediction (Based on Existing Data)

Why Discrete GPUs Perform Better

Scientific Value of Testing

Why You Might Still Want to Test

Recommended Test Methodology

References

Decision Matrix

Conclusion