leoho0722 / llm-gpu-scheduler Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Design of a GPU Dynamic LLM Inference Task Scheduling Architecture Based on KubeAI

0 stars 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github/workflows		.github/workflows
analysis		analysis
assets		assets
deploy		deploy
scripts		scripts
src		src
testing		testing
traces		traces
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
README_zh-TW.md		README_zh-TW.md

Repository files navigation

KubeAI LLM GPU Scheduler

繁體中文版 | English Version

This repository stores the source code for the paper: "Design of a GPU Dynamic LLM Inference Task Scheduling Architecture Based on KubeAI".

📖 Abstract

The rapid advancement of artificial intelligence has recently spurred the growth of diverse technological fields. Since the launch of Large Language Models (LLMs) like OpenAI's ChatGPT in late 2022, there has been a surge in demand for various generative AI applications, such as programming assistants, image generation, Retrieval-Augmented Generation (RAG), and AI agent systems. In response, enterprises are increasingly building these applications in-house, often leveraging on-premises Kubernetes infrastructure to address security concerns. However, setting up a dedicated environment for generative AI inference is a complex process. The open-source project KubeAI simplifies this by allowing developers to quickly establish an inference environment by deploying model configurations.

Nevertheless, KubeAI utilizes the default Kubernetes scheduler, which is designed for general-purpose workloads and bases its scheduling decisions on metrics like CPU and memory usage, without considering GPU utilization. This limitation makes it inadequate for dynamic GPU resource scheduling. Consequently, KubeAI cannot dynamically allocate GPU resources based on the real-time demands of LLM inference tasks across the cluster, relying instead on static GPU allocations defined at deployment. This can lead to significant performance degradation when the assigned GPU resources become insufficient during task execution.

This study, therefore, introduces the design of a dynamic GPU scheduling architecture for LLM inference tasks, built upon KubeAI. By developing a specialized KubeAI LLM GPU Scheduler, we aim to improve upon KubeAI's native GPU scheduling capabilities, enabling more efficient resource allocation and enhanced inference performance.

🏗️ System Architecture

KubeAI LLM GPU Scheduler Architecture

GPU Monitor Architecture

🚀 Getting Started

Follow these instructions to get a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Python 3.10+
make utility

Installation

Clone the repository

git clone https://github.com/leoho0722/gpu-delegater-based-on-kubeai.git
cd gpu-delegater-based-on-kubeai

Create and activate the virtual environment

The following command creates a Python virtual environment and installs all necessary dependencies from src/requirements.txt.
```
make build-env
source .venv/bin/activate
```

💻 Usage

Start the Scheduler Server

This command starts the main scheduler server, which listens for incoming inference requests.

# Make sure the virtual environment is activated
# source .venv/bin/activate

make serve

Run Tests

This script executes a series of tests against the running scheduler to validate its functionality. It uses predefined configurations and prompts.

# Make sure the virtual environment is activated
# source .venv/bin/activate

python testing/testing_cli.py \
    --config-file testing/testing-config.yaml \
    --prompt-file testing/testing-prompts.json

📂 Project Structure

A brief overview of the key directories in this project:

.
├── Makefile                 # Shortcuts for common tasks (e.g., `make build-env`, `make serve`)
├── README.md                # Project documentation (English)
├── README_zh-TW.md          # Project documentation (Traditional Chinese)
├── assets/                  # Diagrams and images used in the documentation
├── deploy/                  # Kubernetes manifests and deployment configurations
│   ├── dockerfiles/         # Dockerfiles for building various KubeAI LLM Model images
│   ├── kubeai/              # KubeAI-specific Kubernetes resources
│   │   ├── helm/            # Helm charts for deploying KubeAI and related components
│   │   ├── jobs/            # Kubernetes Jobs for pre-loading models into storage
│   │   ├── models/          # KubeAI custom resource definitions for models
│   │   └── pvc/             # Persistent Volume Claim for model storage
│   └── ollama-worker.yaml   # Example deployment for a standalone Ollama worker
├── scripts/                 # Helper shell scripts for development and deployment
├── src/                     # Main application source code
│   ├── config/              # Configuration parsing and management (config.yaml)
│   ├── controllers/         # API endpoint logic (FastAPI routers)
│   ├── k8s/                 # Kubernetes API client for interacting with the cluster
│   │   └── kubeai/          # KubeAI custom resource client
│   ├── llms/                # Wrappers for different LLM runtimes
│   │   ├── ollama/          # Ollama runtime client
│   │   ├── openai/          # OpenAI-compatible runtime client
│   │   └── vllm/            # vLLM runtime client
│   ├── models/              # Data models and API models
│   │   ├── apis/            # API request/response models
│   │   └── errors/          # Custom error models
│   ├── routes/              # API route definitions
│   ├── scheduler/           # The core logic for the KubeAI LLM GPU Scheduler
│   │   ├── dispatcher/      # GPU resource allocation logic
│   │   └── monitoring/      # Cluster GPU resource monitoring logic
│   ├── services/            # Business logic services (e.g., LLM service, network service)
│   ├── utils/               # Shared utility functions
│   │   └── constants/       # Application-wide constants
│   ├── requirements.txt     # Python package dependencies
│   ├── server.py            # Application entry point (FastAPI server)
│   └── supported-model.yaml # A list of this research's experimental LLM models
└── testing/                 # Test scripts, configurations, and sample prompts
    ├── testing_apis/        # API testing clients
    └── testing_llms/        # LLM runtime testing clients

🛠️ Experiments Environment

The experiments were conducted on a Kubernetes cluster with the following hardware/software specifications.

Kubernetes Cluster

The cluster consists of one master node and two worker nodes.

Master Node

Category	Specification
Hardware
CPU	Intel Core i7 7700 4C/8T 3.6GHz
Memory	32GB DDR4
Storage	2.5" 1TB SATA3 SSD
Software
OS	Ubuntu 22.04.3 LTS
Kubernetes	1.31.3
Container Runtime	containerd 1.7.24
CNI	Flannel 0.26.1
Monitoring	Prometheus 3.1.0
Tracing	Langtrace 3.8.17

Worker Node 1

Category	Specification
Hardware
CPU	Intel Core i7 7700 4C/8T 3.6GHz
Memory	32GB DDR4
Storage	2.5" 1TB SATA3 SSD
GPU	1x NVIDIA GeForce RTX 3070Ti 8GB GDDR6X
Software
OS	Ubuntu 22.04.3 LTS
Kubernetes	1.31.3
Container Runtime	containerd 1.7.24
CNI	Flannel 0.26.1
NVIDIA Driver	550.54.15
CUDA	12.4
cuDNN	9.1.70
NVIDIA Container Toolkit	1.17.2
NVIDIA GPU Operator	24.9.1
KubeAI	0.18.0

Worker Node 2

Category	Specification
Hardware
CPU	Intel Core i7 13700 16C/24T 2.1GHz
Memory	128GB DDR4
Storage	2.5" 1TB SATA3 SSD
GPU	2x NVIDIA GeForce RTX 4070 12GB GDDR6X
Software
OS	Ubuntu 22.04.3 LTS
Kubernetes	1.31.3
Container Runtime	containerd 1.7.24
CNI	Flannel 0.26.1
NVIDIA Driver	535.129.03
CUDA	12.2
cuDNN	9.1.70
NVIDIA Container Toolkit	1.17.2
NVIDIA GPU Operator	24.9.1
KubeAI	0.18.0

Tested LLM Models and Runtimes

Ollama

Model Family	Models
Gemma 2	`gemma2:2b`, `gemma2:9b`, `gemma2:27b`
Gemma 3	`gemma3:4b`, `gemma3:12b`, `gemma3:27b`
Llama 3	`llama3.1:8b`, `llama3.2:3b`

vLLM

Model Family	Models
Gemma 2	`google/gemma-2-2b-it`
Gemma 3	`google/gemma-3-1b-it`