Skip to content

SparkApplication submission fails with Vert.x Thread blockedΒ #2852

@hellekin08

Description

@hellekin08

A SparkApplication launched by Airflow fails during submission (before driver/executor pods start). The dominant symptom in the submission logs is a Vert.x event loop blocked thread error (io.vertx.core.VertxException: Thread blocked) followed by the spark-submit process exiting with a Kubernetes client exception / timeout.

26/02/24 17:24:31 WARN BlockedThreadChecker: Thread Thread[vert.x-eventloop-thread-1,5,main] has been blocked for 2888 ms, time limit is 2000 ms
26/02/24 17:24:32 WARN BlockedThreadChecker: Thread Thread[vert.x-eventloop-thread-1,5,main] has been blocked for 3888 ms, time limit is 2000 ms
26/02/24 17:24:33 WARN BlockedThreadChecker: Thread Thread[vert.x-eventloop-thread-1,5,main] has been blocked for 4888 ms, time limit is 2000 ms
26/02/24 17:24:34 WARN BlockedThreadChecker: Thread Thread[vert.x-eventloop-thread-1,5,main] has been blocked for 5888 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked
  at ... java.lang.invoke.* (lambda/metafactory compilation)
  at app//io.vertx.core.net.impl.NetClientImpl.connectInternal2(NetClientImpl.java:309)
  ...

We are currently on Spark version 3.5.1. This is the CRD we submit:

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: "azure-load-data-{{ ts_nodash }}"
  namespace: "{{ dag_run.conf.get('NAMESPACE', params.NAMESPACE) }}"
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "xxxx"
  imagePullPolicy: Always
  mainApplicationFile: local:///opt/app/main.py
  sparkVersion: "3.5.1"
  timeToLiveSeconds: 3000
  restartPolicy:
    type: Never

  nodeSelector: {}

  sparkConf:
    spark.kubernetes.submission.waitAppCompletion: "true"
    spark.kubernetes.driver.deleteOnTermination: "false"
    spark.kubernetes.executor.deleteOnTermination: "false"
    spark.kubernetes.driver.pod.seccompProfile.type: RuntimeDefault
    spark.kubernetes.executor.pod.seccompProfile.type: RuntimeDefault
    spark.sql.legacy.parquet.datetimeRebaseModeInWrite: LEGACY
    spark.sql.legacy.parquet.int96RebaseModeInWrite: LEGACY
    spark.sql.session.timeZone: UTC
    spark.pyspark.python: python3
    spark.pyspark.driver.python: python3

  driver:
    serviceAccount: spark
    coreRequest: "100m"
    coreLimit: "1"
    memory: "1024m"
    env:
      # --- Required: Airflow context ---
      - name: AIRFLOW_DAG_ID
        value: "{{ dag.dag_id }}"
      - name: AIRFLOW_RUN_ID
        value: "{{ run_id }}"

    securityContext:
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 1000
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: false
      capabilities:
        drop: ["ALL"]
    labels: { version: "3.5.1" }

  executor:
    instances: 1
    coreRequest: "100m"
    coreLimit: "1"
    memory: "2048m"
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 1000
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: false
      capabilities:
        drop: ["ALL"]
    labels: { version: "3.5.1" }
    env:
      - name: PYTHONUNBUFFERED
        value: "1"

Has anyone experienced similar Vert.x Thread blocked errors during Spark 3.5.x submission on Kubernetes?

In particular:

  • Did this turn out to be related to CPU throttling of the submission process?
  • Kubernetes API server latency?
  • Java 17 + Vert.x interaction?
  • Fabric8 client timeout behavior?
  • Any guidance or similar experiences would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions