Skip to content

Commit ee082b4

Browse files
committed
add pvc to k8s & update docs
Signed-off-by: JaredforReal <[email protected]>
1 parent 3d11e5f commit ee082b4

File tree

8 files changed

+111
-92
lines changed

8 files changed

+111
-92
lines changed

deploy/kubernetes/README.md

Lines changed: 27 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,11 @@ The deployment consists of:
2828

2929
### Standard Kubernetes Deployment
3030

31+
First-time apply (creates PVC via storage overlay):
32+
3133
```bash
32-
kubectl apply -k deploy/kubernetes/
34+
kubectl apply -k deploy/kubernetes/overlays/storage
35+
kubectl apply -k deploy/kubernetes/overlays/core # or overlays/llm-katan
3336

3437
# Check deployment status
3538
kubectl get pods -l app=semantic-router -n vllm-semantic-router-system
@@ -39,6 +42,12 @@ kubectl get services -l app=semantic-router -n vllm-semantic-router-system
3942
kubectl logs -l app=semantic-router -n vllm-semantic-router-system -f
4043
```
4144

45+
Day-2 updates (do not touch PVC):
46+
47+
```bash
48+
kubectl apply -k deploy/kubernetes/overlays/core # or overlays/llm-katan
49+
```
50+
4251
### Kind (Kubernetes in Docker) Deployment
4352

4453
For local development and testing, you can deploy to a kind cluster with optimized resource settings.
@@ -86,6 +95,10 @@ kubectl wait --for=condition=Ready nodes --all --timeout=300s
8695
**Step 2: Deploy the application**
8796

8897
```bash
98+
# First-time storage (PVC)
99+
kubectl apply -k deploy/kubernetes/overlays/storage
100+
101+
# Deploy app
89102
kubectl apply -k deploy/kubernetes/
90103

91104
# Wait for deployment to be ready
@@ -298,7 +311,7 @@ kubectl logs -n semantic-router -l app=semantic-router -c model-downloader
298311
# Check resource usage
299312
kubectl top pods -n semantic-router
300313

301-
# Adjust resource limits in deployment.yaml if needed
314+
# Adjust resource limits in base/deployment.yaml if needed
302315
```
303316

304317
### Storage sizing
@@ -314,23 +327,23 @@ For different environments, you can adjust resource requirements:
314327
- **Testing**: 4Gi memory, 1 CPU
315328
- **Production**: 8Gi+ memory, 2+ CPU
316329

317-
Edit the `resources` section in `deployment.yaml` accordingly.
330+
Edit the `resources` section in `base/deployment.yaml` accordingly.
318331

319332
## Files Overview
320333

321334
### Kubernetes Manifests (`deploy/kubernetes/`)
322335

323-
- `base/` - Shared resources (Namespace, PVC, Service, ConfigMap)
324-
- `overlays/core/` - Core deployment (no llm-katan)
325-
- `overlays/llm-katan/` - Deployment with llm-katan sidecar
326-
- `deployment.yaml` - Plain deployment (used by core overlay)
327-
- `deployment.katan.yaml` - Sidecar deployment (used by llm-katan overlay)
328-
- `service.yaml` - gRPC, HTTP API, and metrics services
329-
- `pvc.yaml` - Persistent volume claim for model storage
330-
- `namespace.yaml` - Dedicated namespace for the application
331-
- `config.yaml` - Application configuration (defaults to qwen3 @ 127.0.0.1:8002)
332-
- `tools_db.json` - Tools database for semantic routing
333-
- `kustomization.yaml` - Root entry (defaults to core overlay)
336+
- `base/` - Shared resources (Namespace, Service, ConfigMap, Deployment)
337+
- `namespace.yaml` - Dedicated namespace for the application
338+
- `service.yaml` - gRPC, HTTP API, and metrics services
339+
- `deployment.yaml` - App deployment (init downloads by default; imagePullPolicy IfNotPresent)
340+
- `config.yaml` - Application configuration (defaults to qwen3 @ 127.0.0.1:8002)
341+
- `tools_db.json` - Tools database for semantic routing
342+
- `pv.yaml` - OPTIONAL hostPath PV for local models (edit path as needed)
343+
- `overlays/core/` - Core deployment (no llm-katan), references `base/`
344+
- `overlays/llm-katan/` - Adds llm-katan sidecar via local patch (no parent file references)
345+
- `overlays/storage/` - PVC only (self-contained `namespace.yaml` + `pvc.yaml`), run once to create storage
346+
- `kustomization.yaml` - Root entry (defaults to `overlays/core`)
334347

335348
### Development Tools
336349

deploy/kubernetes/base/kustomization.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@ kind: Kustomization
33

44
resources:
55
- ./namespace.yaml
6-
- ./pvc.yaml
76
- ./service.yaml
87
- ./deployment.yaml
98

File renamed without changes.
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
4+
resources:
5+
- ./namespace.yaml
6+
- ./pvc.yaml
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
apiVersion: v1
2+
kind: Namespace
3+
metadata:
4+
name: vllm-semantic-router-system
File renamed without changes.

website/docs/installation/kubernetes.md

Lines changed: 29 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -31,11 +31,11 @@ kind create cluster --name semantic-router-cluster --config tools/kind/kind-conf
3131
kubectl wait --for=condition=Ready nodes --all --timeout=300s
3232
```
3333

34-
**Note**: The kind configuration provides sufficient resources (8GB+ RAM, 4+ CPU cores) for running the semantic router and AI gateway components.
34+
Note: The kind configuration provides sufficient resources (8GB+ RAM, 4+ CPU cores).
3535

3636
## Step 2: Deploy vLLM Semantic Router
3737

38-
Configure the semantic router by editing `deploy/kubernetes/config.yaml`. This file contains the vLLM configuration, including model config, endpoints, and policies. The repository provides two Kustomize overlays similar to docker-compose profiles:
38+
Edit `deploy/kubernetes/config.yaml` (models, endpoints, policies). Two overlays are provided:
3939

4040
- core (default): only the semantic-router
4141
- Path: `deploy/kubernetes/overlays/core` (root `deploy/kubernetes/` points here by default)
@@ -49,41 +49,50 @@ deploy/kubernetes/
4949
base/
5050
kustomization.yaml # base kustomize: namespace, PVC, service, deployment
5151
namespace.yaml # Namespace for all resources
52-
pvc.yaml # PVC for models (storageClass and size adjustable)
5352
service.yaml # Service exposing gRPC/metrics/HTTP ports
5453
deployment.yaml # Semantic Router Deployment (init downloads by default)
5554
config.yaml # Router config (mounted via ConfigMap)
5655
tools_db.json # Tools DB (mounted via ConfigMap)
57-
pv.example.yaml # OPTIONAL: hostPath PV example for local models
56+
pv.yaml # OPTIONAL: hostPath PV for local models (edit path as needed)
5857
overlays/
5958
core/
6059
kustomization.yaml # Uses only base
6160
llm-katan/
6261
kustomization.yaml # Patches base to add llm-katan sidecar
6362
patch-llm-katan.yaml # Strategic-merge patch injecting sidecar
63+
storage/
64+
kustomization.yaml # PVC only; run once to create storage, not for day-2 updates
65+
namespace.yaml # Local copy for self-contained apply
66+
pvc.yaml # PVC definition
6467
kustomization.yaml # Root points to overlays/core by default
6568
README.md # Additional notes
6669
namespace.yaml, pvc.yaml, service.yaml (top-level shortcuts kept for backward compat)
6770
```
6871

6972
Notes:
7073

71-
- The base deployment includes an initContainer that downloads required models on first run.
72-
- If your cluster has limited egress, prefer mounting local models via a PV/PVC and skip downloads:
73-
- Copy `base/pv.example.yaml` to `base/pv.yaml`, apply it, and ensure `base/pvc.yaml` is bound to that PV.
74-
- Mount point remains `/app/models` in the pod.
75-
- See “Network Tips” for details on hostPath PV, image mirrors, and preloading images.
76-
77-
Important notes before you apply manifests:
78-
79-
- `vllm_endpoints.address` must be an IP address (not hostname) reachable from inside the cluster. If your LLM backends run as K8s Services, use the ClusterIP (for example `10.96.0.10`) and set `port` accordingly. Do not include protocol or path.
80-
- The PVC in `deploy/kubernetes/pvc.yaml` uses `storageClassName: standard`. On some clouds or local clusters, the default StorageClass name may differ (e.g., `standard-rwo`, `gp2`, or a provisioner like local-path). Adjust as needed.
81-
- Default PVC size is 30Gi. Size it to at least 2–3x of your total model footprint to leave room for indexes and updates.
82-
- The initContainer downloads several models from Hugging Face on first run and writes them into the PVC. Ensure outbound egress to Hugging Face is allowed and there is at least ~6–8 GiB free space for the models specified.
83-
- Per mode, the init container downloads differ:
84-
- core: classifiers + the embedding model `sentence-transformers/all-MiniLM-L12-v2` into `/app/models/all-MiniLM-L12-v2`.
85-
- llm-katan: everything in core, plus `Qwen/Qwen3-0.6B` into `/app/models/Qwen/Qwen3-0.6B`.
86-
- The default `config.yaml` points to `qwen3` at `127.0.0.1:8002`, which matches the llm-katan overlay. If you use core (no sidecar), either change `vllm_endpoints` to your actual backend Service IP:Port, or deploy the llm-katan overlay.
74+
- Base downloads models on first run (initContainer).
75+
- In restricted networks, prefer local models via PV/PVC; see Network Tips for hostPath PV, mirrors, and image preload. Mount point is `/app/models`.
76+
77+
First-time apply (creates PVC):
78+
79+
```bash
80+
kubectl apply -k deploy/kubernetes/overlays/storage
81+
kubectl apply -k deploy/kubernetes/overlays/core # or overlays/llm-katan
82+
```
83+
84+
Day-2 updates (do not touch PVC):
85+
86+
```bash
87+
kubectl apply -k deploy/kubernetes/overlays/core # or overlays/llm-katan
88+
```
89+
90+
Important:
91+
92+
- `vllm_endpoints.address` must be an IP reachable inside the cluster (no scheme/path).
93+
- PVC default size is 30Gi; adjust to model footprint. StorageClass name may differ by cluster.
94+
- core downloads classifiers + `all-MiniLM-L12-v2`; llm-katan also prepares `Qwen/Qwen3-0.6B`.
95+
- Default config uses `[email protected]:8002` (matches llm-katan); if using core, update endpoints accordingly.
8796

8897
Deploy the semantic router service with all required components (core mode by default):
8998

website/docs/troubleshooting/network-tips.md

Lines changed: 45 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -180,10 +180,43 @@ Container runtimes on Kubernetes nodes do not automatically reuse the host Docke
180180

181181
### 5.1 Configure containerd or CRI mirrors
182182

183-
- For clusters backed by containerd (Kind, k3s, kubeadm), edit `/etc/containerd/config.toml` or use Kind’s `containerdConfigPatches` to add regional mirror endpoints for registries such as `docker.io`, `ghcr.io`, or `quay.io`.
183+
- For clusters backed by containerd (Kind, k3s, kubeadm), edit `/etc/containerd/config.toml` or use Kind’s `containerdConfigPatches` to add regional mirror endpoints for registries such as `docker.io`, `ghcr.io`, or `registry.k8s.io`.
184184
- Restart containerd and kubelet after changes so the new mirrors take effect.
185185
- Avoid pointing mirrors to loopback proxies unless every node can reach that proxy address.
186186

187+
Example `/etc/containerd/config.toml` mirrors (China):
188+
189+
```toml
190+
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
191+
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
192+
endpoint = [
193+
"https://docker.m.daocloud.io",
194+
"https://mirror.ccs.tencentyun.com",
195+
"https://mirror.baidubce.com",
196+
"https://docker.mirrors.ustc.edu.cn",
197+
"https://hub-mirror.c.163.com"
198+
]
199+
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."ghcr.io"]
200+
endpoint = [
201+
"https://ghcr.nju.edu.cn",
202+
"https://ghcr.dockerproxy.com",
203+
"https://ghcr.bj.bcebos.com"
204+
]
205+
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry.k8s.io"]
206+
endpoint = [
207+
"https://k8s.m.daocloud.io",
208+
"https://mirror.ccs.tencentyun.com",
209+
"https://registry.aliyuncs.com"
210+
]
211+
```
212+
213+
Apply and restart:
214+
215+
```bash
216+
sudo systemctl restart containerd
217+
sudo systemctl restart kubelet
218+
```
219+
187220
### 5.2 Preload or sideload images
188221

189222
- Build required images locally, then push them into the cluster runtime. For Kind, run `kind load docker-image --name <cluster> <image:tag>`; for other clusters, use `crictl pull` or `ctr -n k8s.io images import` on each node.
@@ -203,74 +236,29 @@ Container runtimes on Kubernetes nodes do not automatically reuse the host Docke
203236
- Use `kubectl describe pod <name>` or `kubectl get events` to confirm pull errors disappear.
204237
- Check that services such as `semantic-router-metrics` now expose endpoints and respond via port-forward (`kubectl port-forward svc/<service> <local-port>:<service-port>`).
205238

206-
### 5.6 Mount local models via hostPath PV (no external HF)
239+
### 5.6 Mount local models via PV/PVC (no external HF)
207240

208-
When you already have models under `./models` locally, you can mount them into the Pod without any download:
241+
When you already have models under `./models` locally, mount them into the Pod and skip downloads:
209242

210-
1. Create a hostPath PV and a matching PVC (example paths assume Kind; for other clusters, pick a node path visible to kubelet):
211-
212-
```yaml
213-
apiVersion: v1
214-
kind: PersistentVolume
215-
metadata:
216-
name: semantic-router-models-pv
217-
spec:
218-
capacity:
219-
storage: 50Gi
220-
accessModes: ["ReadWriteOnce"]
221-
storageClassName: standard
222-
persistentVolumeReclaimPolicy: Retain
223-
hostPath:
224-
path: /tmp/hostpath-provisioner/models
225-
type: DirectoryOrCreate
226-
---
227-
apiVersion: v1
228-
kind: PersistentVolumeClaim
229-
metadata:
230-
name: semantic-router-models
231-
spec:
232-
accessModes: ["ReadWriteOnce"]
233-
resources:
234-
requests:
235-
storage: 30Gi
236-
storageClassName: standard
237-
volumeName: semantic-router-models-pv
238-
```
243+
1. Create a PV (optional; edit `deploy/kubernetes/base/pv.yaml` hostPath to your node path and apply it). If you use a dynamic StorageClass, you can skip the PV.
239244

240-
2. Copy your local models into the node path (Kind example):
245+
2. Create the PVC once via the storage overlay:
241246

242247
```bash
243-
docker cp ./models semantic-router-cluster-control-plane:/tmp/hostpath-provisioner/
248+
kubectl apply -k deploy/kubernetes/overlays/storage
244249
```
245250

246-
3. Ensure the Deployment mounts the PVC at `/app/models` and set `imagePullPolicy: IfNotPresent`:
251+
3. Copy your local models to the node path (hostPath example for kind):
247252

248-
```yaml
249-
volumes:
250-
- name: models-volume
251-
persistentVolumeClaim:
252-
claimName: semantic-router-models
253-
containers:
254-
- name: semantic-router
255-
imagePullPolicy: IfNotPresent
256-
volumeMounts:
257-
- name: models-volume
258-
mountPath: /app/models
253+
```bash
254+
docker cp ./models semantic-router-cluster-control-plane:/tmp/hostpath-provisioner/
259255
```
260256

261-
4. If the PV is tied to a specific node path, schedule the Pod onto that node via `nodeSelector` or add tolerations if you untainted the control-plane node:
257+
4. Ensure the Deployment mounts the PVC at `/app/models` and set `imagePullPolicy: IfNotPresent` (already configured in `base/deployment.yaml`).
262258

263-
```yaml
264-
spec:
265-
nodeSelector:
266-
kubernetes.io/hostname: semantic-router-cluster-control-plane
267-
tolerations:
268-
- key: "node-role.kubernetes.io/control-plane"
269-
effect: "NoSchedule"
270-
operator: "Exists"
271-
```
259+
5. If the PV is tied to a specific node path, pin the Pod to that node using `nodeSelector` or add tolerations if you untainted the control-plane node.
272260

273-
This approach completely avoids Hugging Face downloads inside the cluster and is the most reliable in restricted networks.
261+
This path avoids Hugging Face downloads and is the most reliable in restricted networks.
274262

275263
## 6. Troubleshooting
276264

0 commit comments

Comments
 (0)