-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Labels
bugPOINT - Bugfix entry in the CHANGELOGPOINT - Bugfix entry in the CHANGELOGquestionFurther information is requestedFurther information is requested
Milestone
Description
CDP Control Plane Region: EU-1
Configuration:
ml_worker:
instance_type: m6a.8xlarge
instance_count: 1
min_instances: 1
max_instances: 4
root_volume: 512
instance_tier: ON_DEMAND
ml_worker_gpu:
min_instances: 0
max_instances: 3
instance_count: 0
instance_tier: ON_DEMAND
instance_type: g4dn.2xlarge
root_volume: 512
enable_governance: false
Module configs
- name: Create instance groups
block:
- name: Set standard non-gpu instance groups
set_fact:
instance_groups:
- name: cpu_settings
autoscaling:
maxInstances: "{{ ml_worker['max_instances'] }}"
minInstances: "{{ ml_worker['min_instances'] }}"
instanceType: "{{ ml_worker['instance_type'] }}"
instanceTier: "{{ ml_worker['instance_tier'] }}"
rootVolume:
size: "{{ ml_worker['root_volume'] }}"
- name: Add GPU instance group if defined
set_fact:
instance_groups: "{{ instance_groups + gpu_instance_group }}"
when: "'ml_worker_gpu' in cml_cluster"
vars:
ml_worker_gpu: "{{ cml_cluster['ml_worker_gpu'] }}"
gpu_instance_group:
- name: gpu_settings
autoscaling:
maxInstances: "{{ ml_worker_gpu['max_instances'] }}"
minInstances: "{{ ml_worker_gpu['min_instances'] }}"
instanceType: "{{ ml_worker_gpu['instance_type'] }}"
instanceTier: "{{ ml_worker_gpu['instance_tier'] }}"
rootVolume:
size: "{{ ml_worker_gpu['root_volume'] }}"
vars:
ml_worker: "{{ cml_cluster['ml_worker'] }}"
- name: "Install ML workspace {{ cml_cluster_name }}"
cloudera.cloud.ml:
name: "{{ cml_cluster_name }}"
env: "{{ env_name }}"
k8s_request:
environmentName: "{{ env_name }}"
instanceGroups: "{{ instance_groups }}"
tags: "{{ cml_cluster['tags'] }}"
governance: "{{ cml_cluster['enable_governance'] }}"
public_loadbalancer: false
monitoring: true
ip_addresses: []
debug: true
timeout: 7200
cp_region: "{{ cp_region }}"
Errors
│ Normal Scheduled 5m4s default-scheduler Successfully assigned mlx/ds-operator-5b64cfc648-x7nxp to ip-10-132-9-62.eu-central-1.compute.internal │
│ Warning FailedMount 5m2s (x2 over 5m3s) kubelet MountVolume.SetUp failed for volume "ds-operator-tls" : secret "ds-operator-tls2" not found │
│ Warning FailedMount 5m2s (x2 over 5m3s) kubelet MountVolume.SetUp failed for volume "ds-vfs-crt" : secret "ds-vfs-tls2" not found │
│ Warning FailedMount 5m2s (x2 over 5m3s) kubelet MountVolume.SetUp failed for volume "s2i-registry-auth-crt" : secret "s2i-registry-auth-tls2" not found │
│ Warning FailedMount 5m2s (x2 over 5m3s) kubelet MountVolume.SetUp failed for volume "tgtgen-tls" : secret "tgtgen-tls2" not found │
│ Warning FailedMount 5m2s (x2 over 5m3s) kubelet MountVolume.SetUp failed for volume "tcp-ingress-controller-crt" : secret "tcp-ingress-controller-tls2" not found │
│ Warning FailedMount 5m1s (x3 over 5m3s) kubelet MountVolume.SetUp failed for volume "s2i-registry-crt" : secret "s2i-registry-tls2" not found │
│ Warning FailedMount 5m1s (x3 over 5m3s) kubelet MountVolume.SetUp failed for volume "host-ssh-keys" : secret "cdsw-host-ssh-keys" not found │
│ Warning FailedMount 5m1s (x3 over 5m3s) kubelet MountVolume.SetUp failed for volume "ds-web-crt" : secret "web-tls2" not found │
│ Warning FailedMount 5m1s (x3 over 5m3s) kubelet MountVolume.SetUp failed for volume "ds-cdh-client-crt" : secret "ds-cdh-client-tls2" not found │
│ Warning FailedMount 5m1s kubelet MountVolume.SetUp failed for volume "api-crt" : secret "api-tls2" not found │
│ Warning FailedMount 4m44s (x2 over 4m49s) kubelet (combined from similar events): MountVolume.SetUp failed for volume "api-crt" : secret "api-tls2" not found
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Normal Scheduled 9m6s default-scheduler Successfully assigned mlx/grafana-core-c88b74df5-nfvlp to ip-10-132-9-95.eu-central-1.compute.internal │
│ Normal Pulling 8m55s kubelet Pulling image "container.repository.cloudera.com/cloudera/cdsw/cdsw-ubi-minimal:2.0.34-b116" │
│ Normal Pulled 8m52s kubelet Successfully pulled image "container.repository.cloudera.com/cloudera/cdsw/cdsw-ubi-minimal:2.0.34-b116" in 3.107336927s │
│ Normal Created 8m52s kubelet Created container grafana-root-migration │
│ Normal Started 8m52s kubelet Started container grafana-root-migration │
│ Normal Pulling 8m51s kubelet Pulling image "container.repository.cloudera.com/cloudera_thirdparty/ubi-grafana:6.7.4-ubi-8.5-239.cldr.1" │
│ Normal Pulled 8m41s kubelet Successfully pulled image "container.repository.cloudera.com/cloudera_thirdparty/ubi-grafana:6.7.4-ubi-8.5-239.cldr.1" in 10.000649349s │
│ Normal Created 8m41s kubelet Created container grafana-core │
│ Normal Started 8m41s kubelet Started container grafana-core │
│ Warning Unhealthy 8m28s (x4 over 8m40s) kubelet Readiness probe failed: Get "http://100.100.74.70:3000/login": dial tcp 100.100.74.70:3000: connect: connection refused │
│ Warning Unhealthy 3m27s (x26 over 7m7s) kubelet Readiness probe failed: Get "http://100.100.74.70:3000/login": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Normal Scheduled 9m43s default-scheduler Successfully assigned mlx/tcp-ingress-controller-56597b95cf-nfpk7 to ip-10-132-9-95.eu-central-1.compute.internal │
│ Warning FailedMount 9m24s (x6 over 9m40s) kubelet MountVolume.SetUp failed for volume "web-crt" : secret "web-tls2" not found │
│ Warning FailedMount 9m24s (x6 over 9m40s) kubelet MountVolume.SetUp failed for volume "operator-crt" : secret "ds-operator-tls2" not found │
│ Warning FailedMount 9m24s (x6 over 9m40s) kubelet MountVolume.SetUp failed for volume "tcp-ingress-controller-tls" : secret "tcp-ingress-controller-tls2" not found │
│ Normal Pulling 8m58s kubelet Pulling image "container.repository.cloudera.com/cloudera/cdsw/tcp-ingress-controller:2.0.34-b116" │
│ Normal Pulled 8m54s kubelet Successfully pulled image "container.repository.cloudera.com/cloudera/cdsw/tcp-ingress-controller:2.0.34-b116" in 3.324504111s │
│ Normal Created 8m54s kubelet Created container tcp-ingress-controller │
│ Normal Started 8m54s kubelet Started container tcp-ingress-controller │
│ Warning Unhealthy 8m18s kubelet Liveness probe failed: dial tcp 100.100.74.82:8000: connect: connection refused │
│ Warning Unhealthy 4m28s (x31 over 8m38s) kubelet Readiness probe failed: dial tcp 100.100.74.82:8000: connect: connection refused
│ Warning Unhealthy 9m46s (x30 over 13m) kubelet Readiness probe failed: Get "http://100.100.74.75:3000/internal/load-balancer/health-ping": dial tcp 100.100.74.75:3000: connect: connection refused │
Normal EnsuredLoadBalancer 60m Ensured load balancer
2022-12-16T12:14:16.777Z Service: MLXControlPlane, Message: &ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[]LoadBalancerIngress{LoadBalancerIngress{IP:,Hostname:ac74be2de1e8c4bc6a9d551978d9ab77-4127b221295ea1bb.elb.eu-central-1.amazonaws.com,Ports:[]PortStatus{},},},},Conditions:[]Condition{},}
2022-12-16T12:14:16.965Z Service: MLXControlPlane, Message: Pod(s) not ready: [api-67488979d7-8h46b ds-reconciler-6dd6ccf448-5kgq6 grafana-core-c88b74df5-6sz96 runtime-addon-trigger-2.0.34-b116-pzhzh web-65c7f5c99c-skmfd]
2022-12-16T12:17:17.208Z Service: MLXControlPlane, Message: api-67488979d7-8h46b: Warning BackOff 62m Back-off restarting failed container
2022-12-16T12:17:17.229Z Service: MLXControlPlane, Message: ds-reconciler-6dd6ccf448-5kgq6: Warning BackOff 60m Back-off restarting failed container
2022-12-16T12:17:17.252Z Service: MLXControlPlane, Message: grafana-core-c88b74df5-6sz96: Warning Unhealthy 60m Readiness probe failed: Get "http://100.100.184.74:3000/login": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-12-16T12:17:17.269Z Service: MLXControlPlane, Message: runtime-addon-trigger-2.0.34-b116-pzhzh: Normal Created 62m Created container runtime-addon-trigger
Normal Started 62m Started container runtime-addon-trigger
Normal Pulled 61m Container image "container.repository.cloudera.com/cloudera/cdsw/runtime-addon-loader:2.0.34-b116" already present on machine
Warning BackOff 60m Back-off restarting failed container
2022-12-16T12:17:17.291Z Service: MLXControlPlane, Message: web-65c7f5c99c-skmfd:
2022-12-16T12:17:17.297Z Service: MLXControlPlane, Message: Failed to install ML workspace. Reason:client rate limiter Wait returned an error: rate: Wait(n=1) would exceed
Metadata
Metadata
Assignees
Labels
bugPOINT - Bugfix entry in the CHANGELOGPOINT - Bugfix entry in the CHANGELOGquestionFurther information is requestedFurther information is requested