Auto-Tune vLLM

A distributed hyperparameter optimization framework for vLLM serving, built with Ray and Optuna.

Features

🚀 Distributed Optimization: Scale across multiple GPUs and nodes using Ray
🎯 Flexible Backends: Run locally or on Ray clusters
📊 Rich Benchmarking: Built-in GuideLLM support + custom benchmark providers
🗄️ Centralized Storage: PostgreSQL for trials, metrics, and logs
⚙️ Easy Configuration: YAML-based study and parameter configuration
📈 Multi-Objective: Support for throughput vs latency trade-offs
🔧 Extensible: Plugin system for custom benchmarks

Quick Start (5 minutes)

For a detailed starter guide, see the Quick Start Guide.

Installation

git clone https://github.com/openshift-psap/auto-tuning-vllm.git
cd auto-tuning-vllm
pip install -e .

Basic Usage

CLI Interface

# Run optimization study
auto-tune-vllm optimize --config config.yaml --max-concurrent 2

# Stream live logs  
auto-tune-vllm logs --study-id 42 --trial-number 15

# Resume interrupted study
auto-tune-vllm resume --study-name study_35884

Documentation

Ray Cluster Setup - Important for distributed optimization
Configuration Reference

Requirements

Python 3.10+
NVIDIA GPU with CUDA support
PostgreSQL database

All ML dependencies (vLLM, Ray, GuideLLM, BoTorch) are included automatically.

Known Issues

Ray Cluster Concurrency Validation

Issue: The --max-concurrent parameter is not validated against available Ray cluster resources.

Details: When using Ray backend, the system doesn't check if the requested concurrency level is feasible given the cluster's GPU/CPU resources. For example, setting --max-concurrent 10 on a cluster with only 4 GPUs will not warn the user that only 4 trials can actually run concurrently.

Reason: There is not a clear answer if all the trials would use the exact same number of GPUs. For example, we might have different parallelism related tunings for different trials which might result in different number of GPUs being required for the trial.

Current Behavior:

Excess trials are queued by Ray until resources become available
No warning or guidance is provided to users
May lead to confusion about why trials aren't running as expected

Workaround:

Use auto-tune-vllm check-env --ray-cluster to inspect available resources
Set concurrency based on available GPUs (typically 1 GPU per trial)
Monitor Ray dashboard at http://<head-node>:8265 for resource utilization

Example:

# Check cluster resources first
auto-tune-vllm check-env --ray-cluster

# Set realistic concurrency (e.g., if you have 4 GPUs)
auto-tune-vllm optimize --config study.yaml --max-concurrent 4

License

Apache License 2.0 - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github/workflows		.github/workflows
auto_tune_vllm		auto_tune_vllm
docs		docs
examples		examples
scripts		scripts
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_installation.py		test_installation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Auto-Tune vLLM

Features

Quick Start (5 minutes)

Installation

Basic Usage

CLI Interface

Documentation

Requirements

Known Issues

Ray Cluster Concurrency Validation

License

About

Uh oh!

Releases 1

Packages

Contributors 5

Languages

License

openshift-psap/auto-tuning-vllm

Folders and files

Latest commit

History

Repository files navigation

Auto-Tune vLLM

Features

Quick Start (5 minutes)

Installation

Basic Usage

CLI Interface

Documentation

Requirements

Known Issues

Ray Cluster Concurrency Validation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 5

Languages

Packages