-
Notifications
You must be signed in to change notification settings - Fork 62
Description
What happened?
With changes in PR: #391, we found the metadatas (annotations/labels) from engine/router on pods not being passed down to the pod spec reconcile logic so causing wrong determinations in pod spec reconcile logic where we need to rely on the values of those annotations/labels.
The wrong determinations we found so far in the pod spec reconcile logic (caused by #391)
- wrong env variable added;
- wrong volume mount added;
- wrong volume added.
What did you expect to happen?
It is okay to make changes to avoid unnecessary annotations/labels added to the unwanted components. The expectation is for the places where the logic needs to rely on those metadatas (annotations/labels from engine/router on pods), we need to pass down the correct metadatas, to make sure the existing reconcile logic won't break.
The expectation for such changes:
- Make sure to mention explicitly in PR description what components you target to change to have different metadata;
- Make sure you still pass the original metadata when there is a need to rely on its value.
Under the case your goal has conflicts with current reconciling logic, you should open an issue.
How can we reproduce it (as minimally and precisely as possible)?
Under the case where the runtime spec is with an annotation ome.io/inject-model-init: 'true' under engineConfig which we use to inject the model init for model enigma usage, with changes from #391, this annotations from engine won't be passed down to pod spec reconcile logic so causing unwanted env variable (MODEL_PATH) and unwanted volume mount added to our serving container.
Example runtime with ome.io/inject-model-init: 'true' annotation:
apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
annotations:
meta.helm.sh/release-name: ome-model-serving
meta.helm.sh/release-namespace: ome
labels:
app.kubernetes.io/managed-by: Helm
name: sample-runtime
spec:
engineConfig:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- BM.GPU.H100.8
annotations:
ome.io/inject-model-init: 'true'
labels:
logging-forward: enabled
runner:
image: sample-image:sample-tag
ports:
- containerPort: 8080
name: http1
protocol: TCP
resources:
limits:
nvidia.com/gpu: '2'
requests:
cpu: 16
memory: 120Gi
nvidia.com/gpu: '2'
volumeMounts:
- mountPath: /dev/shm
name: shared-memory
- mountPath: /opt/ml/model
name: model-empty-dir
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
volumes:
- emptyDir:
medium: Memory
sizeLimit: 1Gi
name: shared-memory
- emptyDir:
medium: Memory
name: model-empty-dir
modelSizeRange:
max: 120B
min: 100B
supportedModelFormats:
- autoSelect: true
modelArchitecture: CohereForCausalLM
modelFormat:
name: tensorrtllm
operator: Equal
version: v0.8.0
weight: 1
modelFramework:
name: tensorrtllm
operator: Equal
version: v0.8.0
weight: 1
version: '1.0'
Anything else we need to know?
It is related to changes in PR: #391
PR is reverted, please follow this issue to make corresponding changes moving forward.
Environment
- OME version:
- Kubernetes version (use
kubectl version): - Cloud provider or hardware configuration:
- OS (e.g., from
/etc/os-release): - Runtime (SGLang, vLLM, etc.) and version:
- Model being served (if applicable):
- Install method (Helm, kubectl, etc.):