Skip to content

CML provisioning fails #81

@nmarian85

Description

@nmarian85

CDP Control Plane Region: EU-1

Configuration:

    ml_worker:
      instance_type: m6a.8xlarge
      instance_count: 1
      min_instances: 1
      max_instances: 4
      root_volume: 512
      instance_tier: ON_DEMAND
    ml_worker_gpu:
      min_instances: 0
      max_instances: 3
      instance_count: 0
      instance_tier: ON_DEMAND
      instance_type: g4dn.2xlarge
      root_volume: 512
    enable_governance: false

Module configs

- name: Create instance groups
  block:
    - name: Set standard non-gpu instance groups
      set_fact:
        instance_groups:
          - name: cpu_settings
            autoscaling:
              maxInstances: "{{ ml_worker['max_instances'] }}"
              minInstances: "{{ ml_worker['min_instances'] }}"
            instanceType: "{{ ml_worker['instance_type'] }}"
            instanceTier: "{{ ml_worker['instance_tier'] }}"
            rootVolume:
              size: "{{ ml_worker['root_volume'] }}"

    - name: Add GPU instance group if defined
      set_fact:
        instance_groups: "{{ instance_groups + gpu_instance_group }}"
      when: "'ml_worker_gpu' in cml_cluster"
      vars:
        ml_worker_gpu: "{{ cml_cluster['ml_worker_gpu'] }}"
        gpu_instance_group:
          - name: gpu_settings
            autoscaling:
              maxInstances: "{{ ml_worker_gpu['max_instances'] }}"
              minInstances: "{{ ml_worker_gpu['min_instances'] }}"
            instanceType: "{{ ml_worker_gpu['instance_type'] }}"
            instanceTier: "{{ ml_worker_gpu['instance_tier'] }}"
            rootVolume:
              size: "{{ ml_worker_gpu['root_volume'] }}"
  vars:
    ml_worker: "{{ cml_cluster['ml_worker'] }}"

- name: "Install ML workspace {{ cml_cluster_name }}"
  cloudera.cloud.ml:
    name: "{{ cml_cluster_name }}"
    env: "{{ env_name }}"
    k8s_request:
      environmentName: "{{ env_name }}"
      instanceGroups: "{{ instance_groups }}"
      tags: "{{ cml_cluster['tags'] }}"
    governance: "{{ cml_cluster['enable_governance'] }}"
    public_loadbalancer: false
    monitoring: true
    ip_addresses: []
    debug: true
    timeout: 7200
    cp_region: "{{ cp_region }}"

Errors

│   Normal   Scheduled    5m4s                   default-scheduler  Successfully assigned mlx/ds-operator-5b64cfc648-x7nxp to ip-10-132-9-62.eu-central-1.compute.internal                                                                  │
│   Warning  FailedMount  5m2s (x2 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "ds-operator-tls" : secret "ds-operator-tls2" not found                                                                             │
│   Warning  FailedMount  5m2s (x2 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "ds-vfs-crt" : secret "ds-vfs-tls2" not found                                                                                       │
│   Warning  FailedMount  5m2s (x2 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "s2i-registry-auth-crt" : secret "s2i-registry-auth-tls2" not found                                                                 │
│   Warning  FailedMount  5m2s (x2 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "tgtgen-tls" : secret "tgtgen-tls2" not found                                                                                       │
│   Warning  FailedMount  5m2s (x2 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "tcp-ingress-controller-crt" : secret "tcp-ingress-controller-tls2" not found                                                       │
│   Warning  FailedMount  5m1s (x3 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "s2i-registry-crt" : secret "s2i-registry-tls2" not found                                                                           │
│   Warning  FailedMount  5m1s (x3 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "host-ssh-keys" : secret "cdsw-host-ssh-keys" not found                                                                             │
│   Warning  FailedMount  5m1s (x3 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "ds-web-crt" : secret "web-tls2" not found                                                                                          │
│   Warning  FailedMount  5m1s (x3 over 5m3s)    kubelet            MountVolume.SetUp failed for volume "ds-cdh-client-crt" : secret "ds-cdh-client-tls2" not found                                                                         │
│   Warning  FailedMount  5m1s                   kubelet            MountVolume.SetUp failed for volume "api-crt" : secret "api-tls2" not found                                                                                             │
│   Warning  FailedMount  4m44s (x2 over 4m49s)  kubelet            (combined from similar events): MountVolume.SetUp failed for volume "api-crt" : secret "api-tls2" not found    





│   Type     Reason     Age                    From               Message                                                                                                                                                                   │
│   ----     ------     ----                   ----               -------                                                                                                                                                                   │
│   Normal   Scheduled  9m6s                   default-scheduler  Successfully assigned mlx/grafana-core-c88b74df5-nfvlp to ip-10-132-9-95.eu-central-1.compute.internal                                                                    │
│   Normal   Pulling    8m55s                  kubelet            Pulling image "container.repository.cloudera.com/cloudera/cdsw/cdsw-ubi-minimal:2.0.34-b116"                                                                              │
│   Normal   Pulled     8m52s                  kubelet            Successfully pulled image "container.repository.cloudera.com/cloudera/cdsw/cdsw-ubi-minimal:2.0.34-b116" in 3.107336927s                                                  │
│   Normal   Created    8m52s                  kubelet            Created container grafana-root-migration                                                                                                                                  │
│   Normal   Started    8m52s                  kubelet            Started container grafana-root-migration                                                                                                                                  │
│   Normal   Pulling    8m51s                  kubelet            Pulling image "container.repository.cloudera.com/cloudera_thirdparty/ubi-grafana:6.7.4-ubi-8.5-239.cldr.1"                                                                │
│   Normal   Pulled     8m41s                  kubelet            Successfully pulled image "container.repository.cloudera.com/cloudera_thirdparty/ubi-grafana:6.7.4-ubi-8.5-239.cldr.1" in 10.000649349s                                   │
│   Normal   Created    8m41s                  kubelet            Created container grafana-core                                                                                                                                            │
│   Normal   Started    8m41s                  kubelet            Started container grafana-core                                                                                                                                            │
│   Warning  Unhealthy  8m28s (x4 over 8m40s)  kubelet            Readiness probe failed: Get "http://100.100.74.70:3000/login": dial tcp 100.100.74.70:3000: connect: connection refused                                                   │
│   Warning  Unhealthy  3m27s (x26 over 7m7s)  kubelet            Readiness probe failed: Get "http://100.100.74.70:3000/login": context deadline exceeded (Client.Timeout exceeded while awaiting headers)    



   Normal   Scheduled    9m43s                   default-scheduler  Successfully assigned mlx/tcp-ingress-controller-56597b95cf-nfpk7 to ip-10-132-9-95.eu-central-1.compute.internal                                                      │
│   Warning  FailedMount  9m24s (x6 over 9m40s)   kubelet            MountVolume.SetUp failed for volume "web-crt" : secret "web-tls2" not found                                                                                            │
│   Warning  FailedMount  9m24s (x6 over 9m40s)   kubelet            MountVolume.SetUp failed for volume "operator-crt" : secret "ds-operator-tls2" not found                                                                               │
│   Warning  FailedMount  9m24s (x6 over 9m40s)   kubelet            MountVolume.SetUp failed for volume "tcp-ingress-controller-tls" : secret "tcp-ingress-controller-tls2" not found                                                      │
│   Normal   Pulling      8m58s                   kubelet            Pulling image "container.repository.cloudera.com/cloudera/cdsw/tcp-ingress-controller:2.0.34-b116"                                                                     │
│   Normal   Pulled       8m54s                   kubelet            Successfully pulled image "container.repository.cloudera.com/cloudera/cdsw/tcp-ingress-controller:2.0.34-b116" in 3.324504111s                                         │
│   Normal   Created      8m54s                   kubelet            Created container tcp-ingress-controller                                                                                                                               │
│   Normal   Started      8m54s                   kubelet            Started container tcp-ingress-controller                                                                                                                               │
│   Warning  Unhealthy    8m18s                   kubelet            Liveness probe failed: dial tcp 100.100.74.82:8000: connect: connection refused                                                                                        │
│   Warning  Unhealthy    4m28s (x31 over 8m38s)  kubelet            Readiness probe failed: dial tcp 100.100.74.82:8000: connect: connection refused  



│   Warning  Unhealthy    9m46s (x30 over 13m)  kubelet            Readiness probe failed: Get "http://100.100.74.75:3000/internal/load-balancer/health-ping": dial tcp 100.100.74.75:3000: connect: connection refused                     │
Normal	EnsuredLoadBalancer	60m	Ensured load balancer
2022-12-16T12:14:16.777Z	Service: MLXControlPlane, Message: &ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[]LoadBalancerIngress{LoadBalancerIngress{IP:,Hostname:ac74be2de1e8c4bc6a9d551978d9ab77-4127b221295ea1bb.elb.eu-central-1.amazonaws.com,Ports:[]PortStatus{},},},},Conditions:[]Condition{},}
2022-12-16T12:14:16.965Z	Service: MLXControlPlane, Message: Pod(s) not ready: [api-67488979d7-8h46b ds-reconciler-6dd6ccf448-5kgq6 grafana-core-c88b74df5-6sz96 runtime-addon-trigger-2.0.34-b116-pzhzh web-65c7f5c99c-skmfd]
2022-12-16T12:17:17.208Z	Service: MLXControlPlane, Message: api-67488979d7-8h46b: Warning	BackOff	62m	Back-off restarting failed container
2022-12-16T12:17:17.229Z	Service: MLXControlPlane, Message: ds-reconciler-6dd6ccf448-5kgq6: Warning	BackOff	60m	Back-off restarting failed container
2022-12-16T12:17:17.252Z	Service: MLXControlPlane, Message: grafana-core-c88b74df5-6sz96: Warning	Unhealthy	60m	Readiness probe failed: Get "http://100.100.184.74:3000/login": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-12-16T12:17:17.269Z	Service: MLXControlPlane, Message: runtime-addon-trigger-2.0.34-b116-pzhzh: Normal	Created	62m	Created container runtime-addon-trigger
Normal	Started	62m	Started container runtime-addon-trigger
Normal	Pulled	61m	Container image "container.repository.cloudera.com/cloudera/cdsw/runtime-addon-loader:2.0.34-b116" already present on machine
Warning	BackOff	60m	Back-off restarting failed container
2022-12-16T12:17:17.291Z	Service: MLXControlPlane, Message: web-65c7f5c99c-skmfd: 
2022-12-16T12:17:17.297Z	Service: MLXControlPlane, Message: Failed to install ML workspace. Reason:client rate limiter Wait returned an error: rate: Wait(n=1) would exceed 

Metadata

Metadata

Assignees

Labels

bugPOINT - Bugfix entry in the CHANGELOGquestionFurther information is requested

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions