Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file not shown.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"method": "HOOKS","mode": "MEASURE","observer": "maxabs","allowlist": {"types": [], "names": []},"blocklist": {"types": [], "names": []},"quantize_weight": false,"dump_stats_path": "/eager_output/llama-4-maverick-17b-128e-instruct/g3/inc_output"}
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"mode": "QUANTIZE","observer": "maxabs","scale_method": "maxabs_hw","allowlist": {"types": [],"names": []},"blocklist": {"types": [],"names": []},"dump_stats_path": "./g3/inc_output"}
41 changes: 41 additions & 0 deletions .static_quant/1.22.0/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Static Quantization
The below steps are given for an model Llama-4-Maverick-17B-128E-Instruct as example

## Configuration

1. Locate the file `maxabs_quant_g3.json` inside the `model quantization` folder.
2. Edit it and set the parameter `dump_stats_path` to the absolute path where the repository is cloned.

Example:

```json
"dump_stats_path": "/root/vllm-fork/.static_quant/1.22.0/Llama-4-Maverick-17B-128E-Instruct/g3/inc_output"
```

## Environment Variable
Export the environment variable QUANT_CONFIG before running the server. It must point to the location of maxabs_quant_g3.json.

Example:

```bash
export QUANT_CONFIG='/root/vllm-fork/.static_quant/1.22.0/Llama-4-Maverick-17B-128E-Instruct/maxabs_quant_g3.json'
```

## Run vLLM Server

Start the vLLM server with quantization enabled:

```bash
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--quantization inc \
--kv-cache-dtype fp8_inc \
--weights-load-device cpu \
--tensor-parallel-size 8 \
--max-model-len 2048
```

## Notes

1. The dump_stats_path in maxabs_quant_g3.json must be an absolute path.
2. QUANT_CONFIG must be exported before running vllm serve.
3. Adjust --tensor-parallel-size and --max-model-len according to your system resources.