Skip to content

leoho0722/llm-gpu-scheduler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KubeAI LLM GPU Scheduler

繁體中文版 | English Version

Python 3.10

This repository stores the source code for the paper: "Design of a GPU Dynamic LLM Inference Task Scheduling Architecture Based on KubeAI".

📖 Abstract

The rapid advancement of artificial intelligence has recently spurred the growth of diverse technological fields. Since the launch of Large Language Models (LLMs) like OpenAI's ChatGPT in late 2022, there has been a surge in demand for various generative AI applications, such as programming assistants, image generation, Retrieval-Augmented Generation (RAG), and AI agent systems. In response, enterprises are increasingly building these applications in-house, often leveraging on-premises Kubernetes infrastructure to address security concerns. However, setting up a dedicated environment for generative AI inference is a complex process. The open-source project KubeAI simplifies this by allowing developers to quickly establish an inference environment by deploying model configurations.

Nevertheless, KubeAI utilizes the default Kubernetes scheduler, which is designed for general-purpose workloads and bases its scheduling decisions on metrics like CPU and memory usage, without considering GPU utilization. This limitation makes it inadequate for dynamic GPU resource scheduling. Consequently, KubeAI cannot dynamically allocate GPU resources based on the real-time demands of LLM inference tasks across the cluster, relying instead on static GPU allocations defined at deployment. This can lead to significant performance degradation when the assigned GPU resources become insufficient during task execution.

This study, therefore, introduces the design of a dynamic GPU scheduling architecture for LLM inference tasks, built upon KubeAI. By developing a specialized KubeAI LLM GPU Scheduler, we aim to improve upon KubeAI's native GPU scheduling capabilities, enabling more efficient resource allocation and enhanced inference performance.

🏗️ System Architecture

System Architecture

KubeAI LLM GPU Scheduler Architecture

KubeAI LLM GPU Scheduler Module Architecture

GPU Monitor Architecture

GPU Monitor Module Architecture

🚀 Getting Started

Follow these instructions to get a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

  • Python 3.10+
  • make utility

Installation

  1. Clone the repository

    git clone https://github.com/leoho0722/gpu-delegater-based-on-kubeai.git
    cd gpu-delegater-based-on-kubeai
  2. Create and activate the virtual environment

    The following command creates a Python virtual environment and installs all necessary dependencies from src/requirements.txt.

    make build-env
    source .venv/bin/activate

💻 Usage

Start the Scheduler Server

This command starts the main scheduler server, which listens for incoming inference requests.

# Make sure the virtual environment is activated
# source .venv/bin/activate

make serve

Run Tests

This script executes a series of tests against the running scheduler to validate its functionality. It uses predefined configurations and prompts.

# Make sure the virtual environment is activated
# source .venv/bin/activate

python testing/testing_cli.py \
    --config-file testing/testing-config.yaml \
    --prompt-file testing/testing-prompts.json

📂 Project Structure

A brief overview of the key directories in this project:

.
├── Makefile                 # Shortcuts for common tasks (e.g., `make build-env`, `make serve`)
├── README.md                # Project documentation (English)
├── README_zh-TW.md          # Project documentation (Traditional Chinese)
├── assets/                  # Diagrams and images used in the documentation
├── deploy/                  # Kubernetes manifests and deployment configurations
│   ├── dockerfiles/         # Dockerfiles for building various KubeAI LLM Model images
│   ├── kubeai/              # KubeAI-specific Kubernetes resources
│   │   ├── helm/            # Helm charts for deploying KubeAI and related components
│   │   ├── jobs/            # Kubernetes Jobs for pre-loading models into storage
│   │   ├── models/          # KubeAI custom resource definitions for models
│   │   └── pvc/             # Persistent Volume Claim for model storage
│   └── ollama-worker.yaml   # Example deployment for a standalone Ollama worker
├── scripts/                 # Helper shell scripts for development and deployment
├── src/                     # Main application source code
│   ├── config/              # Configuration parsing and management (config.yaml)
│   ├── controllers/         # API endpoint logic (FastAPI routers)
│   ├── k8s/                 # Kubernetes API client for interacting with the cluster
│   │   └── kubeai/          # KubeAI custom resource client
│   ├── llms/                # Wrappers for different LLM runtimes
│   │   ├── ollama/          # Ollama runtime client
│   │   ├── openai/          # OpenAI-compatible runtime client
│   │   └── vllm/            # vLLM runtime client
│   ├── models/              # Data models and API models
│   │   ├── apis/            # API request/response models
│   │   └── errors/          # Custom error models
│   ├── routes/              # API route definitions
│   ├── scheduler/           # The core logic for the KubeAI LLM GPU Scheduler
│   │   ├── dispatcher/      # GPU resource allocation logic
│   │   └── monitoring/      # Cluster GPU resource monitoring logic
│   ├── services/            # Business logic services (e.g., LLM service, network service)
│   ├── utils/               # Shared utility functions
│   │   └── constants/       # Application-wide constants
│   ├── requirements.txt     # Python package dependencies
│   ├── server.py            # Application entry point (FastAPI server)
│   └── supported-model.yaml # A list of this research's experimental LLM models
└── testing/                 # Test scripts, configurations, and sample prompts
    ├── testing_apis/        # API testing clients
    └── testing_llms/        # LLM runtime testing clients

🛠️ Experiments Environment

The experiments were conducted on a Kubernetes cluster with the following hardware/software specifications.

Kubernetes Cluster

The cluster consists of one master node and two worker nodes.

Master Node

Category Specification
Hardware
CPU Intel Core i7 7700 4C/8T 3.6GHz
Memory 32GB DDR4
Storage 2.5" 1TB SATA3 SSD
Software
OS Ubuntu 22.04.3 LTS
Kubernetes 1.31.3
Container Runtime containerd 1.7.24
CNI Flannel 0.26.1
Monitoring Prometheus 3.1.0
Tracing Langtrace 3.8.17

Worker Node 1

Category Specification
Hardware
CPU Intel Core i7 7700 4C/8T 3.6GHz
Memory 32GB DDR4
Storage 2.5" 1TB SATA3 SSD
GPU 1x NVIDIA GeForce RTX 3070Ti 8GB GDDR6X
Software
OS Ubuntu 22.04.3 LTS
Kubernetes 1.31.3
Container Runtime containerd 1.7.24
CNI Flannel 0.26.1
NVIDIA Driver 550.54.15
CUDA 12.4
cuDNN 9.1.70
NVIDIA Container Toolkit 1.17.2
NVIDIA GPU Operator 24.9.1
KubeAI 0.18.0

Worker Node 2

Category Specification
Hardware
CPU Intel Core i7 13700 16C/24T 2.1GHz
Memory 128GB DDR4
Storage 2.5" 1TB SATA3 SSD
GPU 2x NVIDIA GeForce RTX 4070 12GB GDDR6X
Software
OS Ubuntu 22.04.3 LTS
Kubernetes 1.31.3
Container Runtime containerd 1.7.24
CNI Flannel 0.26.1
NVIDIA Driver 535.129.03
CUDA 12.2
cuDNN 9.1.70
NVIDIA Container Toolkit 1.17.2
NVIDIA GPU Operator 24.9.1
KubeAI 0.18.0

Tested LLM Models and Runtimes

Ollama

Model Family Models
Gemma 2 gemma2:2b, gemma2:9b, gemma2:27b
Gemma 3 gemma3:4b, gemma3:12b, gemma3:27b
Llama 3 llama3.1:8b, llama3.2:3b

vLLM

Model Family Models
Gemma 2 google/gemma-2-2b-it
Gemma 3 google/gemma-3-1b-it

About

Design of a GPU Dynamic LLM Inference Task Scheduling Architecture Based on KubeAI

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 2

  •  
  •