This repository stores the source code for the paper: "Design of a GPU Dynamic LLM Inference Task Scheduling Architecture Based on KubeAI".
The rapid advancement of artificial intelligence has recently spurred the growth of diverse technological fields. Since the launch of Large Language Models (LLMs) like OpenAI's ChatGPT in late 2022, there has been a surge in demand for various generative AI applications, such as programming assistants, image generation, Retrieval-Augmented Generation (RAG), and AI agent systems. In response, enterprises are increasingly building these applications in-house, often leveraging on-premises Kubernetes infrastructure to address security concerns. However, setting up a dedicated environment for generative AI inference is a complex process. The open-source project KubeAI simplifies this by allowing developers to quickly establish an inference environment by deploying model configurations.
Nevertheless, KubeAI utilizes the default Kubernetes scheduler, which is designed for general-purpose workloads and bases its scheduling decisions on metrics like CPU and memory usage, without considering GPU utilization. This limitation makes it inadequate for dynamic GPU resource scheduling. Consequently, KubeAI cannot dynamically allocate GPU resources based on the real-time demands of LLM inference tasks across the cluster, relying instead on static GPU allocations defined at deployment. This can lead to significant performance degradation when the assigned GPU resources become insufficient during task execution.
This study, therefore, introduces the design of a dynamic GPU scheduling architecture for LLM inference tasks, built upon KubeAI. By developing a specialized KubeAI LLM GPU Scheduler, we aim to improve upon KubeAI's native GPU scheduling capabilities, enabling more efficient resource allocation and enhanced inference performance.
Follow these instructions to get a copy of the project up and running on your local machine for development and testing purposes.
- Python 3.10+
make
utility
-
Clone the repository
git clone https://github.com/leoho0722/gpu-delegater-based-on-kubeai.git cd gpu-delegater-based-on-kubeai
-
Create and activate the virtual environment
The following command creates a Python virtual environment and installs all necessary dependencies from
src/requirements.txt
.make build-env source .venv/bin/activate
This command starts the main scheduler server, which listens for incoming inference requests.
# Make sure the virtual environment is activated
# source .venv/bin/activate
make serve
This script executes a series of tests against the running scheduler to validate its functionality. It uses predefined configurations and prompts.
# Make sure the virtual environment is activated
# source .venv/bin/activate
python testing/testing_cli.py \
--config-file testing/testing-config.yaml \
--prompt-file testing/testing-prompts.json
A brief overview of the key directories in this project:
.
├── Makefile # Shortcuts for common tasks (e.g., `make build-env`, `make serve`)
├── README.md # Project documentation (English)
├── README_zh-TW.md # Project documentation (Traditional Chinese)
├── assets/ # Diagrams and images used in the documentation
├── deploy/ # Kubernetes manifests and deployment configurations
│ ├── dockerfiles/ # Dockerfiles for building various KubeAI LLM Model images
│ ├── kubeai/ # KubeAI-specific Kubernetes resources
│ │ ├── helm/ # Helm charts for deploying KubeAI and related components
│ │ ├── jobs/ # Kubernetes Jobs for pre-loading models into storage
│ │ ├── models/ # KubeAI custom resource definitions for models
│ │ └── pvc/ # Persistent Volume Claim for model storage
│ └── ollama-worker.yaml # Example deployment for a standalone Ollama worker
├── scripts/ # Helper shell scripts for development and deployment
├── src/ # Main application source code
│ ├── config/ # Configuration parsing and management (config.yaml)
│ ├── controllers/ # API endpoint logic (FastAPI routers)
│ ├── k8s/ # Kubernetes API client for interacting with the cluster
│ │ └── kubeai/ # KubeAI custom resource client
│ ├── llms/ # Wrappers for different LLM runtimes
│ │ ├── ollama/ # Ollama runtime client
│ │ ├── openai/ # OpenAI-compatible runtime client
│ │ └── vllm/ # vLLM runtime client
│ ├── models/ # Data models and API models
│ │ ├── apis/ # API request/response models
│ │ └── errors/ # Custom error models
│ ├── routes/ # API route definitions
│ ├── scheduler/ # The core logic for the KubeAI LLM GPU Scheduler
│ │ ├── dispatcher/ # GPU resource allocation logic
│ │ └── monitoring/ # Cluster GPU resource monitoring logic
│ ├── services/ # Business logic services (e.g., LLM service, network service)
│ ├── utils/ # Shared utility functions
│ │ └── constants/ # Application-wide constants
│ ├── requirements.txt # Python package dependencies
│ ├── server.py # Application entry point (FastAPI server)
│ └── supported-model.yaml # A list of this research's experimental LLM models
└── testing/ # Test scripts, configurations, and sample prompts
├── testing_apis/ # API testing clients
└── testing_llms/ # LLM runtime testing clients
The experiments were conducted on a Kubernetes cluster with the following hardware/software specifications.
The cluster consists of one master node and two worker nodes.
Category | Specification |
---|---|
Hardware | |
CPU | Intel Core i7 7700 4C/8T 3.6GHz |
Memory | 32GB DDR4 |
Storage | 2.5" 1TB SATA3 SSD |
Software | |
OS | Ubuntu 22.04.3 LTS |
Kubernetes | 1.31.3 |
Container Runtime | containerd 1.7.24 |
CNI | Flannel 0.26.1 |
Monitoring | Prometheus 3.1.0 |
Tracing | Langtrace 3.8.17 |
Category | Specification |
---|---|
Hardware | |
CPU | Intel Core i7 7700 4C/8T 3.6GHz |
Memory | 32GB DDR4 |
Storage | 2.5" 1TB SATA3 SSD |
GPU | 1x NVIDIA GeForce RTX 3070Ti 8GB GDDR6X |
Software | |
OS | Ubuntu 22.04.3 LTS |
Kubernetes | 1.31.3 |
Container Runtime | containerd 1.7.24 |
CNI | Flannel 0.26.1 |
NVIDIA Driver | 550.54.15 |
CUDA | 12.4 |
cuDNN | 9.1.70 |
NVIDIA Container Toolkit | 1.17.2 |
NVIDIA GPU Operator | 24.9.1 |
KubeAI | 0.18.0 |
Category | Specification |
---|---|
Hardware | |
CPU | Intel Core i7 13700 16C/24T 2.1GHz |
Memory | 128GB DDR4 |
Storage | 2.5" 1TB SATA3 SSD |
GPU | 2x NVIDIA GeForce RTX 4070 12GB GDDR6X |
Software | |
OS | Ubuntu 22.04.3 LTS |
Kubernetes | 1.31.3 |
Container Runtime | containerd 1.7.24 |
CNI | Flannel 0.26.1 |
NVIDIA Driver | 535.129.03 |
CUDA | 12.2 |
cuDNN | 9.1.70 |
NVIDIA Container Toolkit | 1.17.2 |
NVIDIA GPU Operator | 24.9.1 |
KubeAI | 0.18.0 |
Model Family | Models |
---|---|
Gemma 2 | gemma2:2b , gemma2:9b , gemma2:27b |
Gemma 3 | gemma3:4b , gemma3:12b , gemma3:27b |
Llama 3 | llama3.1:8b , llama3.2:3b |
Model Family | Models |
---|---|
Gemma 2 | google/gemma-2-2b-it |
Gemma 3 | google/gemma-3-1b-it |