ExLlamaV3

ExLlamaV3 is an inference library for running local LLMs on modern consumer GPUs. Headline features:

New EXL3 quantization format based on QTIP
Flexible tensor-parallel and expert-parallel inference for consumer hardware setups
OpenAI-compatible server provided via TabbyAPI
Continuous, dynamic batching
HF Transformers plugin (see here)
HF model support (see supported architectures)
Speculative decoding
2-8 bit cache quantization
Multimodal support

The official and recommended backend server for ExLlamaV3 is TabbyAPI, which provides an OpenAI-compatible API for local or remote inference, with extended features like HF model downloading, embedding model support and support for HF Jinja2 chat templates.

⚠️ Important

Qwen3-Next support is currently experimental and still requires some optimization, so don't expect optimal performance just yet. Flash Linear Attention is required and this in turn requires Triton. causal-conv1d is supported and recommended but not required.
Qwen3-Next currently does not support tensor/expert parallelism.

Architecture support

AFM (ArceeForCausalLM)
Apertus (ApertursForCausalLM)
Command-R etc. (CohereForCausalLM)
Command-A, Command-R7B, Command-R+ etc. (Cohere2ForCausalLM)
DeciLM, Nemotron (DeciLMForCausalLM)
dots.llm1 (Dots1ForCausalLM)
ERNIE 4.5 (Ernie4_5_ForCausalLM, Ernie4_5_MoeForCausalLM)
EXAONE 4.0 (Exaone4ForCausalLM)
Gemma 2 (Gemma2ForCausalLM)
Gemma 3 (Gemma3ForCausalLM, Gemma3ForConditionalGeneration) - multimodal
GLM 4, GLM 4.5, GLM 4.5-Air, (Glm4ForCausalLM, Glm4MoeForCausalLM)
Llama, Llama 2, Llama 3, Llama 3.1-Nemotron etc. (LlamaForCausalLM)
MiMo-RL (MiMoForCausalLM)
Mistral, Mistral 3 etc. (MistralForCausalLM, Mistral3ForConditionalGeneration) - multimodal
Mixtral (MixtralForCausalLM)
Phi3, Phi4 (Phi3ForCausalLM)
Qwen 2, Qwen 2.5 (Qwen2ForCausalLM)
Qwen 3 (Qwen3ForCausalLM, Qwen3MoeForCausalLM)
Qwen 3-Next (Qwen3NextForCausalLM)
Seed-OSS (SeedOssForCausalLM)
SmolLM (SmolLM3ForCausalLM)

Always adding more, stay tuned.

What's missing?

Currently on the to-do list:

Lots of optimization
LoRA support
ROCm support
More sampling functions
More quantization modes (FP4 etc.)

As for what is implemented, expect that some things may be a little broken at first. Please be patient, raise issues and/or contribute. 👉👈

How to?

TabbyAPI has a startup script that manages and installs prerequisites if you want to get started quickly with inference in an OAI-compatible client.

Otherwise, start by making sure you have the appropriate version of PyTorch installed (CUDA 12.4 or later) since the Torch dependency is not automatically handled by pip. Then pick a method below:

Method 1: Installing from prebuilt wheel (recommended if you're unsure)

Pick a wheel from the releases page, then e.g.:

pip install https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.8.0-cp313-cp313-linux_x86_64.whl

Method 2: Installing from PyPi:

pip install exllamav3

Note that the PyPi package does not contain a prebuilt extension and requires the CUDA toolkit and build prerequisites (i.e. VS Build Tools on Windows, gcc on Linux, python-dev headers etc.).

Method 3: Building from source

# Clone the repo
git clone https://github.com/turboderp-org/exllamav3
cd exllamav3

# (Optional) switch to dev branch for latest in-progress features
git checkout dev

# Install requirements (make sure you install Torch separately)
pip install -r requirements.txt

At this point you should be able to run the conversion, eval and example scripts from the main repo directory, e.g. python convert.py -i ...

To install the library for the active venv, run from the repo directory:

pip install .

Relevant env variables for building:

MAX_JOBS: by default ninja may launch too many processes and run out of system memory for compilation. Set this to a reasonable value like 4 in that case.
EXLLAMA_NOCOMPILE: set to install the library without compiling the C++/CUDA extension. Torch will build/load it at runtime instead.

Conversion

To convert a model to EXL3 format, use:

# Convert model
python convert.py -i <input_dir> -o <output_dir> -w <working_dir> -b <bitrate>

# Resume an interrupted quant job
python convert.py -w <working_dir> -r

# More options
python convert.py -h

The working directory is temporary storage for state checkpoints and for storing quantized tensors until the converted model can be compiled. It should have enough free space to store an entire copy of the output model. Note that while EXL2 conversion by default resumes an interrupted job when pointed to an existing folder, EXL3 needs you to explicitly resume with the -r/--resume argument.

See here for more information.

Examples

A number of example scripts are provided to showcase the features of the backend and generator. Some of them have hardcoded model paths and should be edited before you run them, but there is a simple CLI chatbot that you can start with:

python examples/chat.py -m <input_dir> -mode <prompt_mode> 

# E.g.:
python examples/chat.py -m /mnt/models/llama3.1-8b-instruct-exl3 -mode llama3

# Wealth of options
python examples/chat.py -h

EXL3 quantization

Despite their amazing achievements, most SOTA quantization techniques remain cumbersome or even prohibitively expensive to use. For instance, AQLM quantization of a 70B model takes around 720 GPU-hours on an A100 server, costing $850 US at the time of writing. ExLlamaV3 aims to address this with the EXL3 format, which is a streamlined variant of QTIP from Cornell RelaxML. The conversion process is designed to be simple and efficient and requires only an input model (in HF format) and a target bitrate. By computing Hessians on the fly and thanks to a fused Viterbi kernel, the quantizer can convert a model in a single step, taking a couple of minutes for smaller models, up to a few hours for larger ones (70B+) (on a single RTX 4090 or equivalent GPU.)

The Marlin-inspired GEMM kernel achieves roughly memory-bound latency under optimal conditions (4bpw, RTX 4090), though it still needs some work to achieve the same efficiency on Ampere GPUs and to remain memory-bound at lower bitrates.

Since converted models largely retain the original file structure (unlike EXL2 which renames some tensors in its quest to turn every model into a Llama variant), it will be possible to extend EXL3 support to other frameworks like HF Transformers and vLLM.

There are some benchmark results here, and a full writeup on the format is coming soon.

Fun fact: Llama-3.1-70B-EXL3 is coherent at 1.6 bpw. With the output layer quantized to 3 bpw and a 4096-token cache, inference is possible in under 16 GB of VRAM.

Community

You are always welcome to join the ExLlama discord server ←🎮

🤗 HuggingFace repos

A selection of EXL3-quantized models is available here. Also shout out the following lovely people:

Acknowledgements

This project owes its existence to a wonderful community of FOSS developers and some very generous supporters (🐈❤️!) The following projects in particular deserve a special mention:

Name		Name	Last commit message	Last commit date
Latest commit History 550 Commits
.github		.github
doc		doc
eval		eval
examples		examples
exllamav3		exllamav3
science		science
tests		tests
util		util
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
convert.py		convert.py
requirements.txt		requirements.txt
requirements_eval.txt		requirements_eval.txt
requirements_examples.txt		requirements_examples.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

ExLlamaV3

⚠️ Important

Architecture support

What's missing?

How to?

Method 1: Installing from prebuilt wheel (recommended if you're unsure)

Method 2: Installing from PyPi:

Method 3: Building from source

Conversion

Examples

EXL3 quantization

Community

🤗 HuggingFace repos

Acknowledgements

About

Uh oh!

Releases 11

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 9

Languages

Uh oh!

License

turboderp-org/exllamav3

Folders and files

Latest commit

History

Repository files navigation

ExLlamaV3

⚠️ Important

Architecture support

What's missing?

How to?

Method 1: Installing from prebuilt wheel (recommended if you're unsure)

Method 2: Installing from PyPi:

Method 3: Building from source

Conversion

Examples

EXL3 quantization

Community

🤗 HuggingFace repos

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 9

Languages

Packages