A distributed hyperparameter optimization framework for vLLM serving, built with Ray and Optuna.
- 🚀 Distributed Optimization: Scale across multiple GPUs and nodes using Ray
 - 🎯 Flexible Backends: Run locally or on Ray clusters
 - 📊 Rich Benchmarking: Built-in GuideLLM support + custom benchmark providers
 - 🗄️ Centralized Storage: PostgreSQL for trials, metrics, and logs
 - ⚙️ Easy Configuration: YAML-based study and parameter configuration
 - 📈 Multi-Objective: Support for throughput vs latency trade-offs
 - 🔧 Extensible: Plugin system for custom benchmarks
 
For a detailed starter guide, see the Quick Start Guide.
git clone https://github.com/openshift-psap/auto-tuning-vllm.git
cd auto-tuning-vllm
pip install -e .# Run optimization study
auto-tune-vllm optimize --config config.yaml --max-concurrent 2
# Stream live logs  
auto-tune-vllm logs --study-id 42 --trial-number 15
# Resume interrupted study
auto-tune-vllm resume --study-name study_35884- Ray Cluster Setup - Important for distributed optimization
 - Configuration Reference
 
- Python 3.10+
 - NVIDIA GPU with CUDA support
 - PostgreSQL database
 
All ML dependencies (vLLM, Ray, GuideLLM, BoTorch) are included automatically.
Issue: The --max-concurrent parameter is not validated against available Ray cluster resources.
Details: When using Ray backend, the system doesn't check if the requested concurrency level is feasible given the cluster's GPU/CPU resources. For example, setting --max-concurrent 10 on a cluster with only 4 GPUs will not warn the user that only 4 trials can actually run concurrently.
Reason: There is not a clear answer if all the trials would use the exact same number of GPUs. For example, we might have different parallelism related tunings for different trials which might result in different number of GPUs being required for the trial.
Current Behavior:
- Excess trials are queued by Ray until resources become available
 - No warning or guidance is provided to users
 - May lead to confusion about why trials aren't running as expected
 
Workaround:
- Use 
auto-tune-vllm check-env --ray-clusterto inspect available resources - Set concurrency based on available GPUs (typically 1 GPU per trial)
 - Monitor Ray dashboard at 
http://<head-node>:8265for resource utilization 
Example:
# Check cluster resources first
auto-tune-vllm check-env --ray-cluster
# Set realistic concurrency (e.g., if you have 4 GPUs)
auto-tune-vllm optimize --config study.yaml --max-concurrent 4Apache License 2.0 - see LICENSE file for details.