Aggressive LLM switching operator for resource-starved Kubernetes clusters.
snipsnap manages a single GPU, loading one model at a time. When a different model is requested via the OpenAI-compatible API, the current model is immediately terminated and the new one is spun up. Persistent volume caching avoids re-downloading model weights.
Client (OpenAI SDK)
|
v
snipsnap Proxy (:8000) <-- OpenAI-compatible API
|
v
Workspace Controller <-- Detects model mismatch, kills old pod, creates new
|
v
Inference Pod (Ollama/vLLM) <-- Mounts cache PVC, claims GPU
|
v
GPU (RTX 4090)
- Kubernetes cluster with GPU support (NVIDIA device plugin)
kubectlconfigured- Helm 3+
CRDs ship inside the chart at charts/snipsnap/crds/, so Helm installs them on the first apply:
helm install snipsnap charts/snipsnap --namespace snipsnap --create-namespaceFor a local dev cluster, use the values-dev.yaml overrides (always-pull image, metrics on, sample model pre-seeded) via the convenience target:
make dev-deployThe chart templates Model CRs from the models: array in your values file. Add entries inline:
models:
- name: llama3
url: "ollama://llama3"
engine: OLlama
cache:
enabled: true
storageSize: "20Gi"
resources:
limits:
nvidia.com/gpu: "1"Then helm upgrade snipsnap charts/snipsnap -f your-values.yaml to apply.
# The proxy auto-switches models. First request to llama3 will load it:
curl http://snipsnap-api.snipsnap:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama3", "messages": [{"role": "user", "content": "Hello!"}]}'
# Requesting mistral-7b will kill llama3 and load mistral:
curl http://snipsnap-api.snipsnap:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "mistral-7b", "messages": [{"role": "user", "content": "Hello!"}]}'
# List available models:
curl http://snipsnap-api.snipsnap:8000/v1/modelsDefines an LLM that can be loaded. Supports Ollama and vLLM engines.
apiVersion: snipsnap.xgeeks.com/v1
kind: Model
metadata:
name: llama3
spec:
url: "ollama://llama3"
engine: OLlama
cache:
enabled: true
storageSize: "20Gi"
resources:
limits:
nvidia.com/gpu: "1"Tracks which model is currently active on the GPU.
apiVersion: snipsnap.xgeeks.com/v1
kind: Workspace
metadata:
name: default
spec:
activeModel: "llama3"# Generate CRDs and code
make generate manifests
# Run locally against a cluster
make run
# Run tests
make test
# Build container image
make docker-build IMG=snipsnap:devApache License 2.0