You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Note**: The kind configuration provides sufficient resources (8GB+ RAM, 4+ CPU cores) for running the semantic router and AI gateway components.
34
+
Note: The kind configuration provides sufficient resources (8GB+ RAM, 4+ CPU cores).
35
35
36
36
## Step 2: Deploy vLLM Semantic Router
37
37
38
-
Configure the semantic router by editing `deploy/kubernetes/config.yaml`. This file contains the vLLM configuration, including model config, endpoints, and policies. The repository provides two Kustomize overlays similar to docker-compose profiles:
38
+
Edit `deploy/kubernetes/config.yaml` (models, endpoints, policies). Two overlays are provided:
39
39
40
40
- core (default): only the semantic-router
41
41
- Path: `deploy/kubernetes/overlays/core` (root `deploy/kubernetes/` points here by default)
@@ -49,41 +49,50 @@ deploy/kubernetes/
49
49
base/
50
50
kustomization.yaml # base kustomize: namespace, PVC, service, deployment
51
51
namespace.yaml # Namespace for all resources
52
-
pvc.yaml # PVC for models (storageClass and size adjustable)
53
52
service.yaml # Service exposing gRPC/metrics/HTTP ports
54
53
deployment.yaml # Semantic Router Deployment (init downloads by default)
55
54
config.yaml # Router config (mounted via ConfigMap)
56
55
tools_db.json # Tools DB (mounted via ConfigMap)
57
-
pv.example.yaml # OPTIONAL: hostPath PV example for local models
56
+
pv.yaml # OPTIONAL: hostPath PV for local models (edit path as needed)
58
57
overlays/
59
58
core/
60
59
kustomization.yaml # Uses only base
61
60
llm-katan/
62
61
kustomization.yaml # Patches base to add llm-katan sidecar
kustomization.yaml # PVC only; run once to create storage, not for day-2 updates
65
+
namespace.yaml # Local copy for self-contained apply
66
+
pvc.yaml # PVC definition
64
67
kustomization.yaml # Root points to overlays/core by default
65
68
README.md # Additional notes
66
69
namespace.yaml, pvc.yaml, service.yaml (top-level shortcuts kept for backward compat)
67
70
```
68
71
69
72
Notes:
70
73
71
-
- The base deployment includes an initContainer that downloads required models on first run.
72
-
- If your cluster has limited egress, prefer mounting local models via a PV/PVC and skip downloads:
73
-
- Copy `base/pv.example.yaml` to `base/pv.yaml`, apply it, and ensure `base/pvc.yaml` is bound to that PV.
74
-
- Mount point remains `/app/models` in the pod.
75
-
- See “Network Tips” for details on hostPath PV, image mirrors, and preloading images.
76
-
77
-
Important notes before you apply manifests:
78
-
79
-
-`vllm_endpoints.address` must be an IP address (not hostname) reachable from inside the cluster. If your LLM backends run as K8s Services, use the ClusterIP (for example `10.96.0.10`) and set `port` accordingly. Do not include protocol or path.
80
-
- The PVC in `deploy/kubernetes/pvc.yaml` uses `storageClassName: standard`. On some clouds or local clusters, the default StorageClass name may differ (e.g., `standard-rwo`, `gp2`, or a provisioner like local-path). Adjust as needed.
81
-
- Default PVC size is 30Gi. Size it to at least 2–3x of your total model footprint to leave room for indexes and updates.
82
-
- The initContainer downloads several models from Hugging Face on first run and writes them into the PVC. Ensure outbound egress to Hugging Face is allowed and there is at least ~6–8 GiB free space for the models specified.
83
-
- Per mode, the init container downloads differ:
84
-
- core: classifiers + the embedding model `sentence-transformers/all-MiniLM-L12-v2` into `/app/models/all-MiniLM-L12-v2`.
85
-
- llm-katan: everything in core, plus `Qwen/Qwen3-0.6B` into `/app/models/Qwen/Qwen3-0.6B`.
86
-
- The default `config.yaml` points to `qwen3` at `127.0.0.1:8002`, which matches the llm-katan overlay. If you use core (no sidecar), either change `vllm_endpoints` to your actual backend Service IP:Port, or deploy the llm-katan overlay.
74
+
- Base downloads models on first run (initContainer).
75
+
- In restricted networks, prefer local models via PV/PVC; see Network Tips for hostPath PV, mirrors, and image preload. Mount point is `/app/models`.
Copy file name to clipboardExpand all lines: website/docs/troubleshooting/network-tips.md
+45-57Lines changed: 45 additions & 57 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -180,10 +180,43 @@ Container runtimes on Kubernetes nodes do not automatically reuse the host Docke
180
180
181
181
### 5.1 Configure containerd or CRI mirrors
182
182
183
-
- For clusters backed by containerd (Kind, k3s, kubeadm), edit `/etc/containerd/config.toml` or use Kind’s `containerdConfigPatches` to add regional mirror endpoints for registries such as `docker.io`, `ghcr.io`, or `quay.io`.
183
+
- For clusters backed by containerd (Kind, k3s, kubeadm), edit `/etc/containerd/config.toml` or use Kind’s `containerdConfigPatches` to add regional mirror endpoints for registries such as `docker.io`, `ghcr.io`, or `registry.k8s.io`.
184
184
- Restart containerd and kubelet after changes so the new mirrors take effect.
185
185
- Avoid pointing mirrors to loopback proxies unless every node can reach that proxy address.
186
186
187
+
Example `/etc/containerd/config.toml` mirrors (China):
- Build required images locally, then push them into the cluster runtime. For Kind, run `kind load docker-image --name <cluster> <image:tag>`; for other clusters, use `crictl pull` or `ctr -n k8s.io images import` on each node.
@@ -203,74 +236,29 @@ Container runtimes on Kubernetes nodes do not automatically reuse the host Docke
203
236
- Use `kubectl describe pod <name>` or `kubectl get events` to confirm pull errors disappear.
204
237
- Check that services such as `semantic-router-metrics` now expose endpoints and respond via port-forward (`kubectl port-forward svc/<service> <local-port>:<service-port>`).
205
238
206
-
### 5.6 Mount local models via hostPath PV (no external HF)
239
+
### 5.6 Mount local models via PV/PVC (no external HF)
207
240
208
-
When you already have models under `./models` locally, you can mount them into the Pod without any download:
241
+
When you already have models under `./models` locally, mount them into the Pod and skip downloads:
209
242
210
-
1. Create a hostPath PV and a matching PVC (example paths assume Kind; for other clusters, pick a node path visible to kubelet):
211
-
212
-
```yaml
213
-
apiVersion: v1
214
-
kind: PersistentVolume
215
-
metadata:
216
-
name: semantic-router-models-pv
217
-
spec:
218
-
capacity:
219
-
storage: 50Gi
220
-
accessModes: ["ReadWriteOnce"]
221
-
storageClassName: standard
222
-
persistentVolumeReclaimPolicy: Retain
223
-
hostPath:
224
-
path: /tmp/hostpath-provisioner/models
225
-
type: DirectoryOrCreate
226
-
---
227
-
apiVersion: v1
228
-
kind: PersistentVolumeClaim
229
-
metadata:
230
-
name: semantic-router-models
231
-
spec:
232
-
accessModes: ["ReadWriteOnce"]
233
-
resources:
234
-
requests:
235
-
storage: 30Gi
236
-
storageClassName: standard
237
-
volumeName: semantic-router-models-pv
238
-
```
243
+
1. Create a PV (optional; edit `deploy/kubernetes/base/pv.yaml` hostPath to your node path and apply it). If you use a dynamic StorageClass, you can skip the PV.
239
244
240
-
2. Copy your local models into the node path (Kind example):
4. If the PV is tied to a specific node path, schedule the Pod onto that node via `nodeSelector` or add tolerations if you untainted the control-plane node:
257
+
4. Ensure the Deployment mounts the PVC at `/app/models` and set `imagePullPolicy: IfNotPresent` (already configured in `base/deployment.yaml`).
5. If the PV is tied to a specific node path, pin the Pod to that node using `nodeSelector` or add tolerations if you untainted the control-plane node.
272
260
273
-
This approach completely avoids Hugging Face downloads inside the cluster and is the most reliable in restricted networks.
261
+
This path avoids Hugging Face downloads and is the most reliable in restricted networks.
0 commit comments