InferStack-rs

A high-performance, production-ready inference server written in Rust that supports model versioning, A/B testing, caching, and comprehensive monitoring.

Features

Model Versioning: Support multiple model versions with traffic allocation
A/B Testing: Configure traffic distribution across different model versions
Caching: Redis-based caching for inference results
Rate Limiting: Configurable rate limiting per client
Monitoring: Comprehensive metrics via Prometheus and Grafana dashboards
Batch Processing: Efficient handling of batch inference requests
Input Validation: Configurable input size limits and validation
Health Checks: Built-in health monitoring endpoints
Graceful Shutdown: Proper shutdown handling with cleanup

Quick Start

Prerequisites

Rust (latest stable version)
Redis (optional, for caching)
Docker and Docker Compose (optional, for containerization)

Installation

# Clone the repository
git clone https://github.com/Pewpenguin/inferstack-rs
cd inferstack-rs

# Build the project
cargo build --release

Configuration

Set up the server using environment variables defined in a .env file.

Use the .env.example file as a reference for the required structure and variable names. Ensure all necessary variables are defined before starting the server.

Running

# Run directly
cargo run --release

# Or using Docker
docker-compose up -d

API Endpoints

Health Check

GET /health

Inference

POST /inference
Content-Type: application/json

{
  "input": [[1.0, 2.0, 3.0]],
  "model_version": "v1"  // optional
}

Batch Inference

POST /inference
Content-Type: application/json

{
  "input": [
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0]
  ],
  "batch": true,
  "model_version": "v1"  // optional
}

Monitoring

Access metrics at /metrics endpoint. Key metrics include:

inferstack_inference_total: Total inference requests
inferstack_model_version_usage_total: Usage by model version
inferstack_inference_duration_seconds: Inference latency
inferstack_cache_operations_total: Cache operation statistics
inferstack_batch_throughput_items_per_second: Batch processing performance

Grafana Dashboard

A pre-configured Grafana dashboard is available in monitoring/grafana/dashboards/.

Development

Contributing

Fork the repository
Create your feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
monitoring		monitoring
src		src
.env.example		.env.example
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InferStack-rs

Features

Quick Start

Prerequisites

Installation

Configuration

Running

API Endpoints

Health Check

Inference

Batch Inference

Monitoring

Grafana Dashboard

Development

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

Pewpenguin/inferstack-rs

Folders and files

Latest commit

History

Repository files navigation

InferStack-rs

Features

Quick Start

Prerequisites

Installation

Configuration

Running

API Endpoints

Health Check

Inference

Batch Inference

Monitoring

Grafana Dashboard

Development

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages