add pvc to k8s & update docs

JaredforReal · JaredforReal · commit ee082b4ca24c · 2025-10-20T15:04:01.000+08:00
Signed-off-by: JaredforReal &lt;w13431838023@gmail.com&gt;
diff --git a/deploy/kubernetes/README.md b/deploy/kubernetes/README.md
@@ -28,8 +28,11 @@ The deployment consists of:
 
 ### Standard Kubernetes Deployment
 
+First-time apply (creates PVC via storage overlay):
+
 ```bash
-kubectl apply -k deploy/kubernetes/
+kubectl apply -k deploy/kubernetes/overlays/storage
+kubectl apply -k deploy/kubernetes/overlays/core   # or overlays/llm-katan
 
 # Check deployment status
 kubectl get pods -l app=semantic-router -n vllm-semantic-router-system
@@ -39,6 +42,12 @@ kubectl get services -l app=semantic-router -n vllm-semantic-router-system
 kubectl logs -l app=semantic-router -n vllm-semantic-router-system -f
 ```
 
+Day-2 updates (do not touch PVC):
+
+```bash
+kubectl apply -k deploy/kubernetes/overlays/core   # or overlays/llm-katan
+```
+
 ### Kind (Kubernetes in Docker) Deployment
 
 For local development and testing, you can deploy to a kind cluster with optimized resource settings.
@@ -86,6 +95,10 @@ kubectl wait --for=condition=Ready nodes --all --timeout=300s
 **Step 2: Deploy the application**
 
 ```bash
+# First-time storage (PVC)
+kubectl apply -k deploy/kubernetes/overlays/storage
+
+# Deploy app
 kubectl apply -k deploy/kubernetes/
 
 # Wait for deployment to be ready
@@ -298,7 +311,7 @@ kubectl logs -n semantic-router -l app=semantic-router -c model-downloader
 # Check resource usage
 kubectl top pods -n semantic-router
 
-# Adjust resource limits in deployment.yaml if needed
+# Adjust resource limits in base/deployment.yaml if needed
 ```
 
 ### Storage sizing
@@ -314,23 +327,23 @@ For different environments, you can adjust resource requirements:
 - **Testing**: 4Gi memory, 1 CPU
 - **Production**: 8Gi+ memory, 2+ CPU
 
-Edit the `resources` section in `deployment.yaml` accordingly.
+Edit the `resources` section in `base/deployment.yaml` accordingly.
 
 ## Files Overview
 
 ### Kubernetes Manifests (`deploy/kubernetes/`)
 
-- `base/` - Shared resources (Namespace, PVC, Service, ConfigMap)
-- `overlays/core/` - Core deployment (no llm-katan)
-- `overlays/llm-katan/` - Deployment with llm-katan sidecar
-- `deployment.yaml` - Plain deployment (used by core overlay)
-- `deployment.katan.yaml` - Sidecar deployment (used by llm-katan overlay)
-- `service.yaml` - gRPC, HTTP API, and metrics services
-- `pvc.yaml` - Persistent volume claim for model storage
-- `namespace.yaml` - Dedicated namespace for the application
-- `config.yaml` - Application configuration (defaults to qwen3 @ 127.0.0.1:8002)
-- `tools_db.json` - Tools database for semantic routing
-- `kustomization.yaml` - Root entry (defaults to core overlay)
+- `base/` - Shared resources (Namespace, Service, ConfigMap, Deployment)
+  - `namespace.yaml` - Dedicated namespace for the application
+  - `service.yaml` - gRPC, HTTP API, and metrics services
+  - `deployment.yaml` - App deployment (init downloads by default; imagePullPolicy IfNotPresent)
+  - `config.yaml` - Application configuration (defaults to qwen3 @ 127.0.0.1:8002)
+  - `tools_db.json` - Tools database for semantic routing
+  - `pv.yaml` - OPTIONAL hostPath PV for local models (edit path as needed)
+- `overlays/core/` - Core deployment (no llm-katan), references `base/`
+- `overlays/llm-katan/` - Adds llm-katan sidecar via local patch (no parent file references)
+- `overlays/storage/` - PVC only (self-contained `namespace.yaml` + `pvc.yaml`), run once to create storage
+- `kustomization.yaml` - Root entry (defaults to `overlays/core`)
 
 ### Development Tools
 
diff --git a/deploy/kubernetes/base/kustomization.yaml b/deploy/kubernetes/base/kustomization.yaml
@@ -3,7 +3,6 @@ kind: Kustomization
 
 resources:
   - ./namespace.yaml
-  - ./pvc.yaml
   - ./service.yaml
   - ./deployment.yaml
 
diff --git a/deploy/kubernetes/base/pv.yaml b/deploy/kubernetes/base/pv.yaml
diff --git a/deploy/kubernetes/overlays/storage/kustomization.yaml b/deploy/kubernetes/overlays/storage/kustomization.yaml
@@ -0,0 +1,6 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+resources:
+  - ./namespace.yaml
+  - ./pvc.yaml
diff --git a/deploy/kubernetes/overlays/storage/namespace.yaml b/deploy/kubernetes/overlays/storage/namespace.yaml
@@ -0,0 +1,4 @@
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: vllm-semantic-router-system
diff --git a/deploy/kubernetes/overlays/storage/pvc.yaml b/deploy/kubernetes/overlays/storage/pvc.yaml
diff --git a/website/docs/installation/kubernetes.md b/website/docs/installation/kubernetes.md
@@ -31,11 +31,11 @@ kind create cluster --name semantic-router-cluster --config tools/kind/kind-conf
 kubectl wait --for=condition=Ready nodes --all --timeout=300s
 ```
 
-**Note**: The kind configuration provides sufficient resources (8GB+ RAM, 4+ CPU cores) for running the semantic router and AI gateway components.
+Note: The kind configuration provides sufficient resources (8GB+ RAM, 4+ CPU cores).
 
 ## Step 2: Deploy vLLM Semantic Router
 
-Configure the semantic router by editing `deploy/kubernetes/config.yaml`. This file contains the vLLM configuration, including model config, endpoints, and policies. The repository provides two Kustomize overlays similar to docker-compose profiles:
+Edit `deploy/kubernetes/config.yaml` (models, endpoints, policies). Two overlays are provided:
 
 - core (default): only the semantic-router
   - Path: `deploy/kubernetes/overlays/core` (root `deploy/kubernetes/` points here by default)
@@ -49,41 +49,50 @@ deploy/kubernetes/
   base/
     kustomization.yaml        # base kustomize: namespace, PVC, service, deployment
     namespace.yaml            # Namespace for all resources
-    pvc.yaml                  # PVC for models (storageClass and size adjustable)
     service.yaml              # Service exposing gRPC/metrics/HTTP ports
     deployment.yaml           # Semantic Router Deployment (init downloads by default)
     config.yaml               # Router config (mounted via ConfigMap)
     tools_db.json             # Tools DB (mounted via ConfigMap)
-    pv.example.yaml           # OPTIONAL: hostPath PV example for local models
+    pv.yaml                   # OPTIONAL: hostPath PV for local models (edit path as needed)
   overlays/
     core/
       kustomization.yaml      # Uses only base
     llm-katan/
       kustomization.yaml      # Patches base to add llm-katan sidecar
       patch-llm-katan.yaml    # Strategic-merge patch injecting sidecar
+    storage/
+      kustomization.yaml      # PVC only; run once to create storage, not for day-2 updates
+      namespace.yaml          # Local copy for self-contained apply
+      pvc.yaml                # PVC definition
   kustomization.yaml          # Root points to overlays/core by default
   README.md                   # Additional notes
   namespace.yaml, pvc.yaml, service.yaml (top-level shortcuts kept for backward compat)
 ```
 
 Notes:
 
-- The base deployment includes an initContainer that downloads required models on first run.
-- If your cluster has limited egress, prefer mounting local models via a PV/PVC and skip downloads:
-  - Copy `base/pv.example.yaml` to `base/pv.yaml`, apply it, and ensure `base/pvc.yaml` is bound to that PV.
-  - Mount point remains `/app/models` in the pod.
-  - See “Network Tips” for details on hostPath PV, image mirrors, and preloading images.
-
-Important notes before you apply manifests:
-
-- `vllm_endpoints.address` must be an IP address (not hostname) reachable from inside the cluster. If your LLM backends run as K8s Services, use the ClusterIP (for example `10.96.0.10`) and set `port` accordingly. Do not include protocol or path.
-- The PVC in `deploy/kubernetes/pvc.yaml` uses `storageClassName: standard`. On some clouds or local clusters, the default StorageClass name may differ (e.g., `standard-rwo`, `gp2`, or a provisioner like local-path). Adjust as needed.
-- Default PVC size is 30Gi. Size it to at least 2–3x of your total model footprint to leave room for indexes and updates.
-- The initContainer downloads several models from Hugging Face on first run and writes them into the PVC. Ensure outbound egress to Hugging Face is allowed and there is at least ~6–8 GiB free space for the models specified.
-- Per mode, the init container downloads differ:
-  - core: classifiers + the embedding model `sentence-transformers/all-MiniLM-L12-v2` into `/app/models/all-MiniLM-L12-v2`.
-  - llm-katan: everything in core, plus `Qwen/Qwen3-0.6B` into `/app/models/Qwen/Qwen3-0.6B`.
-- The default `config.yaml` points to `qwen3` at `127.0.0.1:8002`, which matches the llm-katan overlay. If you use core (no sidecar), either change `vllm_endpoints` to your actual backend Service IP:Port, or deploy the llm-katan overlay.
+- Base downloads models on first run (initContainer).
+- In restricted networks, prefer local models via PV/PVC; see Network Tips for hostPath PV, mirrors, and image preload. Mount point is `/app/models`.
+
+First-time apply (creates PVC):
+
+```bash
+kubectl apply -k deploy/kubernetes/overlays/storage
+kubectl apply -k deploy/kubernetes/overlays/core        # or overlays/llm-katan
+```
+
+Day-2 updates (do not touch PVC):
+
+```bash
+kubectl apply -k deploy/kubernetes/overlays/core        # or overlays/llm-katan
+```
+
+Important:
+
+- `vllm_endpoints.address` must be an IP reachable inside the cluster (no scheme/path).
+- PVC default size is 30Gi; adjust to model footprint. StorageClass name may differ by cluster.
+- core downloads classifiers + `all-MiniLM-L12-v2`; llm-katan also prepares `Qwen/Qwen3-0.6B`.
+- Default config uses `qwen3@127.0.0.1:8002` (matches llm-katan); if using core, update endpoints accordingly.
 
 Deploy the semantic router service with all required components (core mode by default):
 
diff --git a/website/docs/troubleshooting/network-tips.md b/website/docs/troubleshooting/network-tips.md
@@ -180,10 +180,43 @@ Container runtimes on Kubernetes nodes do not automatically reuse the host Docke
 
 ### 5.1 Configure containerd or CRI mirrors
 
-- For clusters backed by containerd (Kind, k3s, kubeadm), edit `/etc/containerd/config.toml` or use Kind’s `containerdConfigPatches` to add regional mirror endpoints for registries such as `docker.io`, `ghcr.io`, or `quay.io`.
+- For clusters backed by containerd (Kind, k3s, kubeadm), edit `/etc/containerd/config.toml` or use Kind’s `containerdConfigPatches` to add regional mirror endpoints for registries such as `docker.io`, `ghcr.io`, or `registry.k8s.io`.
 - Restart containerd and kubelet after changes so the new mirrors take effect.
 - Avoid pointing mirrors to loopback proxies unless every node can reach that proxy address.
 
+Example `/etc/containerd/config.toml` mirrors (China):
+
+```toml
+[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
+  [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
+    endpoint = [
+      "https://docker.m.daocloud.io",
+      "https://mirror.ccs.tencentyun.com",
+      "https://mirror.baidubce.com",
+      "https://docker.mirrors.ustc.edu.cn",
+      "https://hub-mirror.c.163.com"
+    ]
+  [plugins."io.containerd.grpc.v1.cri".registry.mirrors."ghcr.io"]
+    endpoint = [
+      "https://ghcr.nju.edu.cn",
+      "https://ghcr.dockerproxy.com",
+      "https://ghcr.bj.bcebos.com"
+    ]
+  [plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry.k8s.io"]
+    endpoint = [
+      "https://k8s.m.daocloud.io",
+      "https://mirror.ccs.tencentyun.com",
+      "https://registry.aliyuncs.com"
+    ]
+```
+
+Apply and restart:
+
+```bash
+sudo systemctl restart containerd
+sudo systemctl restart kubelet
+```
+
 ### 5.2 Preload or sideload images
 
 - Build required images locally, then push them into the cluster runtime. For Kind, run `kind load docker-image --name <cluster> <image:tag>`; for other clusters, use `crictl pull` or `ctr -n k8s.io images import` on each node.
@@ -203,74 +236,29 @@ Container runtimes on Kubernetes nodes do not automatically reuse the host Docke
 - Use `kubectl describe pod <name>` or `kubectl get events` to confirm pull errors disappear.
 - Check that services such as `semantic-router-metrics` now expose endpoints and respond via port-forward (`kubectl port-forward svc/<service> <local-port>:<service-port>`).
 
-### 5.6 Mount local models via hostPath PV (no external HF)
+### 5.6 Mount local models via PV/PVC (no external HF)
 
-When you already have models under `./models` locally, you can mount them into the Pod without any download:
+When you already have models under `./models` locally, mount them into the Pod and skip downloads:
 
-1. Create a hostPath PV and a matching PVC (example paths assume Kind; for other clusters, pick a node path visible to kubelet):
-
-```yaml
-apiVersion: v1
-kind: PersistentVolume
-metadata:
-  name: semantic-router-models-pv
-spec:
-  capacity:
-    storage: 50Gi
-  accessModes: ["ReadWriteOnce"]
-  storageClassName: standard
-  persistentVolumeReclaimPolicy: Retain
-  hostPath:
-    path: /tmp/hostpath-provisioner/models
-    type: DirectoryOrCreate
----
-apiVersion: v1
-kind: PersistentVolumeClaim
-metadata:
-  name: semantic-router-models
-spec:
-  accessModes: ["ReadWriteOnce"]
-  resources:
-    requests:
-      storage: 30Gi
-  storageClassName: standard
-  volumeName: semantic-router-models-pv
-```
+1. Create a PV (optional; edit `deploy/kubernetes/base/pv.yaml` hostPath to your node path and apply it). If you use a dynamic StorageClass, you can skip the PV.
 
-2. Copy your local models into the node path (Kind example):
+2. Create the PVC once via the storage overlay:
 
 ```bash
-docker cp ./models semantic-router-cluster-control-plane:/tmp/hostpath-provisioner/
+kubectl apply -k deploy/kubernetes/overlays/storage
 ```
 
-3. Ensure the Deployment mounts the PVC at `/app/models` and set `imagePullPolicy: IfNotPresent`:
+3. Copy your local models to the node path (hostPath example for kind):
 
-```yaml
-volumes:
-  - name: models-volume
-    persistentVolumeClaim:
-      claimName: semantic-router-models
-containers:
-  - name: semantic-router
-    imagePullPolicy: IfNotPresent
-    volumeMounts:
-      - name: models-volume
-        mountPath: /app/models
+```bash
+docker cp ./models semantic-router-cluster-control-plane:/tmp/hostpath-provisioner/
 ```
 
-4. If the PV is tied to a specific node path, schedule the Pod onto that node via `nodeSelector` or add tolerations if you untainted the control-plane node:
+4. Ensure the Deployment mounts the PVC at `/app/models` and set `imagePullPolicy: IfNotPresent` (already configured in `base/deployment.yaml`).
 
-```yaml
-spec:
-  nodeSelector:
-    kubernetes.io/hostname: semantic-router-cluster-control-plane
-  tolerations:
-    - key: "node-role.kubernetes.io/control-plane"
-      effect: "NoSchedule"
-      operator: "Exists"
-```
+5. If the PV is tied to a specific node path, pin the Pod to that node using `nodeSelector` or add tolerations if you untainted the control-plane node.
 
-This approach completely avoids Hugging Face downloads inside the cluster and is the most reliable in restricted networks.
+This path avoids Hugging Face downloads and is the most reliable in restricted networks.
 
 ## 6. Troubleshooting