Skip to content

Initial implementation for the docs site and setup for LLM Compressor #1436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 18, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ wheels/
.installed.cfg
*.egg
MANIFEST
.cache/*

# PyInstaller
# Usually these files are written by a python script from a template
Expand Down
22 changes: 22 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Set the OS, Python version, and other tools you might need
build:
os: ubuntu-24.04
tools:
python: "3.12"

# Build documentation with Mkdocs
mkdocs:
configuration: mkdocs.yml

python:
install:
- method: pip
path: .
extra_requirements:
- dev
77 changes: 77 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# LLM Compressor Code of Conduct

## Our Pledge

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

## Our Standards

Examples of behavior that contributes to a positive environment for our community include:

- Demonstrating empathy and kindness toward other people
- Being respectful of differing opinions, viewpoints, and experiences
- Giving and gracefully accepting constructive feedback
- Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
- Focusing on what is best not just for us as individuals, but for the overall community

Examples of unacceptable behavior include:

- The use of sexualized language or imagery, and sexual attention or advances of any kind
- Trolling, insulting or derogatory comments, and personal or political attacks
- Public or private harassment
- Publishing others’ private information, such as a physical or email address, without their explicit permission
- Other conduct which could reasonably be considered inappropriate in a professional setting

## Enforcement Responsibilities

Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.

Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.

## Scope

This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement through GitHub, Slack, or Email. All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the reporter of any incident.

## Enforcement Guidelines

Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:

### 1. Correction

**Community Impact**: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.

**Consequence**: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.

### 2. Warning

**Community Impact**: A violation through a single incident or series of actions.

**Consequence**: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.

### 3. Temporary Ban

**Community Impact**: A serious violation of community standards, including sustained inappropriate behavior.

**Consequence**: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.

### 4. Permanent Ban

**Community Impact**: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.

**Consequence**: A permanent ban from any sort of public interaction within the community.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 2.1, available at https://www.contributor-covenant.org/version/2/1/code_of_conduct.html.

Community Impact Guidelines were inspired by [Mozilla’s code of conduct enforcement ladder](https://github.com/mozilla/diversity).

For answers to common questions about this code of conduct, see the FAQ at https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations.
Binary file added docs/assets/llmcompressor-icon-white.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/llmcompressor-icon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/llmcompressor-user-flows.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
39 changes: 39 additions & 0 deletions docs/developer/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
weight: -3
---

# Developer

Welcome to the Developer section of LLM Compressor! This area provides essential resources for developers who want to contribute to or extend LLM Compressor. Whether you're interested in fixing bugs, adding new features, improving documentation, or understanding the project's governance, you'll find comprehensive guides to help you get started.

LLM Compressor is an open-source project that values community contributions. We maintain high standards for code quality, documentation, and community interactions to ensure that LLM Compressor remains a robust, reliable, and user-friendly tool for compressing large language models.

## Developer Resources

<div class="grid cards" markdown>

- :material-handshake:{ .lg .middle } Code of Conduct

---

Our community guidelines ensure that participation in the LLM Compressor project is a positive, inclusive, and respectful experience for everyone.

[:octicons-arrow-right-24: Code of Conduct](code-of-conduct.md)

- :material-source-pull:{ .lg .middle } Contributing Guide

---

Learn how to effectively contribute to LLM Compressor, including reporting bugs, suggesting features, improving documentation, and submitting code.

[:octicons-arrow-right-24: Contributing Guide](contributing.md)

- :material-tools:{ .lg .middle } Development Guide

---

Detailed instructions for setting up your development environment, implementing changes, and adhering to the project's coding standards and best practices.

[:octicons-arrow-right-24: Development Guide](developing.md)

</div>
9 changes: 9 additions & 0 deletions docs/examples/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
weight: -4
---

# Examples

Welcome to the LLM Compressor examples section! Here, you'll find practical demonstrations showing how to use LLM Compressor to optimize large language models for faster and more efficient deployment with vLLM. These examples will help you understand the various compression techniques and functionalities available in LLM Compressor, making it easier to apply them to your own models.

To explore the examples, you can either navigate through the list provided in the sidebar or click next to see the next example in the series. Each example is designed to be self-contained, with clear instructions and code snippets that you can run directly.
67 changes: 67 additions & 0 deletions docs/getting-started/compress.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
weight: -8
---

# Compress Your Model

LLM Compressor provides a straightforward way to compress your models using various optimization techniques. This guide will walk you through the process of compressing a model using different quantization methods.

## Prerequisites

Before you begin, ensure you have the following prerequisites:
- **Operating System:** Linux (recommended for GPU support)
- **Python Version:** 3.9 or newer
- **Available GPU:** For optimal performance, it's recommended to use a GPU. LLM Compressor supports the latest PyTorch and CUDA versions for compatability with NVIDIA GPUs.

## Select a Model and Dataset

Before you start compressing, select the model you'd like to compress and a calibration dataset that is representative of your use case. LLM Compressor supports a variety of models and integrates natively with Hugging Face Transformers and Model Hub, so a great starting point is to use a model from the Hugging Face Model Hub. LLM Compressor also supports many datasets from the Hugging Face Datasets library, making it easy to find a suitable dataset for calibration.

For this guide, we'll use the `TinyLlama` model and the `open_platypus` dataset for calibration. You can replace these with your own model and dataset as needed.

## Select a Quantization Method and Scheme

LLM Compressor supports several quantization methods and schemes, each with its own strengths and weaknesses. The choice of method and scheme will depend on your specific use case, hardware capabilities, and desired trade-offs between model size, speed, and accuracy.

Some common quantization schemes include:

| Scheme | Description | Hardware Compatibility |
|--------|-------------|------------------------|
| **FP W8A8** | 8-bit floating point (FP8) quantization for weights and activations, providing ~2X smaller weights with 8-bit arithmetic operations. Good for general performance and compression, especially for server and batch inference. | Latest NVIDIA GPUs (Ada Lovelace, Hopper, and later) and latest AMD GPUs |
| **INT W8A8** | 8-bit integer (INT8) quantization for weights and activations, providing ~2X smaller weights with 8-bit arithmetic operations. Good for general performance and compression, especially for server and batch inference. | All NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators |
| **W4A16** | 4-bit integer (INT4) weights with 16-bit floating point (FP16) activations, providing ~3.7X smaller weights but requiring 16-bit arithmetic operations. Maximum compression for latency-sensitive applications with limited memory. | All NVIDIA GPUs, AMD GPUs, TPUs, CPUs, and other accelerators |

Some common quantization methods include:

| Method | Description | Accuracy Recovery vs. Time |
|--------|-------------|----------------------------|
| **GPTQ** | Utilizes second-order layer-wise optimizations to prioritize important weights/activations and enables updates to remaining weights | High accuracy recovery but more expensive/slower to run |
| **AWQ** | Uses channelwise scaling to better preserve important outliers in weights and activations | Moderate accuracy recovery with faster runtime than GPTQ |
| **SmoothQuant** | Smooths outliers in activations by folding them into weights, ensuring better accuracy for weight and activation quantized models | Good accuracy recovery with minimal calibration time; composable with other methods |

For this guide, we'll use `GPTQ` composed with `SmoothQuant` to create an `INT W8A8` quantized model. This combination provides a good balance for performance, accuracy, and compatability across a wide range of hardware.

## Apply the Recipe

LLM Compressor provides the `oneshot` API for simple and straightforward model compression. This API allows you to apply a pre-defined recipe to your model and dataset, making it easy to get started with compression. To apply what we discussed above, we'll import the necessary modifiers and create a recipe to apply to our model and dataset:

```python
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor import oneshot

recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
GPTQModifier(scheme="W8A8", targets="Linear", ignore=["lm_head"]),
]
oneshot(
model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
dataset="open_platypus",
recipe=recipe,
output_dir="TinyLlama-1.1B-Chat-v1.0-INT8",
max_seq_length=2048,
num_calibration_samples=512,
)
```

Once the above code is run, it will save the compressed model to the specified output directory: `TinyLlama-1.1B-Chat-v1.0-INT8`. You can then load this model using the Hugging Face Transformers library or vLLM for inference and testing.
57 changes: 57 additions & 0 deletions docs/getting-started/deploy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
weight: -6
---

# Deploy with vLLM

Once you've compressed your model using LLM Compressor, you can deploy it for efficient inference using vLLM. This guide walks you through the deployment process, using the output from the [Compress Your Model](compress.md) guide. If you haven't completed that step, change the model arguments in the code snippets below to point to your desired model.

vLLM is a high-performance inference engine designed for large language models, providing support for various quantization formats and optimized for both single and multi-GPU setups. It also offers an OpenAI-compatible API for easy integration with existing applications.

## Prerequisites

Before deploying your model, ensure you have the following prerequisites:
- **Operating System:** Linux (recommended for GPU support)
- **Python Version:** 3.9 or newer
- **Available GPU:** For optimal performance, it's recommended to use a GPU. vLLM supports a range of accelerators, including NVIDIA GPUs, AMD GPUs, TPUs, and other accelerators.
- **vLLM Installed:** Ensure you have vLLM installed. You can install it using pip:
```bash
pip install vllm
```

## Python API

vLLM provides a Python API for easy integration with your applications, enabling you to load and use your compressed model directly in your Python code. To test the compressed model, use the following code:

```python
from vllm import LLM

model = LLM("./TinyLlama-1.1B-Chat-v1.0-INT8")
output = model.generate("What is machine learning?", max_tokens=256)
print(output)
```

After running the above code, you should see the generated output from your compressed model. This confirms that the model is loaded and ready for inference.

## HTTP Server

vLLM also provides an HTTP server for serving your model via a RESTful API that is compatible with OpenAI's API definitions. This allows you to easily integrate your model into existing applications or services.
To start the HTTP server, use the following command:

```bash
vllm serve "./TinyLlama-1.1B-Chat-v1.0-INT8"
```

By default, the server will run on `localhost:8000`. You can change the host and port by using the `--host` and `--port` flags. Now that the server is running, you can send requests to it using any HTTP client. For example, you can use `curl` to send a request:

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama-1.1B-Chat-v1.0-INT8",
"messages": [{"role": "user", "content": "What is machine learning?"}],
"max_tokens": 256
}'
```

This will return a JSON response with the generated text from your model. You can also use any HTTP client library in your programming language of choice to send requests to the server.
41 changes: 41 additions & 0 deletions docs/getting-started/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
weight: -10
---

# Getting Started

Welcome to LLM Compressor! This section will guide you through the process of installing the library, compressing your first model, and deploying it with vLLM for faster, more efficient inference.

LLM Compressor makes it simple to optimize large language models for deployment, offering various quantization techniques that help you find the perfect balance between model quality, performance, and resource efficiency.

## Quick Start Guides

Follow the guides below to get started with LLM Compressor and optimize your models for production deployment.

<div class="grid cards" markdown>

- :material-package-variant:{ .lg .middle } Installation

---

Learn how to install LLM Compressor using pip or from source.

[:octicons-arrow-right-24: Installation Guide](install.md)

- :material-memory:{ .lg .middle } Compress Your Model

---

Learn how to apply quantization to your models using different algorithms and formats.

[:octicons-arrow-right-24: Compression Guide](compress.md)

- :material-rocket-launch:{ .lg .middle } Deploy with vLLM

---

Deploy your compressed model for efficient inference using vLLM.

[:octicons-arrow-right-24: Deployment Guide](deploy.md)

</div>
67 changes: 67 additions & 0 deletions docs/getting-started/install.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
weight: -10
---

# Installation

LLM Compressor can be installed using several methods depending on your requirements. Below are the detailed instructions for each installation pathway.

## Prerequisites

Before installing LLM Compressor, ensure you have the following prerequisites:

- **Operating System:** Linux (recommended for GPU support)
- **Python Version:** 3.9 or newer
- **Pip Version:** Ensure you have the latest version of pip installed. You can upgrade pip using the following command:

```bash
python -m pip install --upgrade pip
```

## Installation Methods

### Install from PyPI

The simplest way to install LLM Compressor is via pip from the Python Package Index (PyPI):

```bash
pip install llmcompressor
```

This will install the latest stable release of LLM Compressor.

### Install a Specific Version from PyPI

If you need a specific version of LLM Compressor, you can specify the version number during installation:

```bash
pip install llmcompressor==0.5.1
```

Replace `0.1.0` with your desired version number.

### Install from Source

To install the latest development version of LLM Compressor from the main branch, use the following command:

```bash
pip install git+https://github.com/vllm-project/llm-compressor.git
```

This will clone the repository and install LLM Compressor directly from the main branch.

### Install from a Local Clone

If you have cloned the LLM Compressor repository locally and want to install it, navigate to the repository directory and run:

```bash
pip install .
```

For development purposes, you can install it in editable mode with the `dev` extra:

```bash
pip install -e .[dev]
```

This allows you to make changes to the source code and have them reflected immediately without reinstalling.
Loading