-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Open
Description
A SparkApplication launched by Airflow fails during submission (before driver/executor pods start). The dominant symptom in the submission logs is a Vert.x event loop blocked thread error (io.vertx.core.VertxException: Thread blocked) followed by the spark-submit process exiting with a Kubernetes client exception / timeout.
26/02/24 17:24:31 WARN BlockedThreadChecker: Thread Thread[vert.x-eventloop-thread-1,5,main] has been blocked for 2888 ms, time limit is 2000 ms
26/02/24 17:24:32 WARN BlockedThreadChecker: Thread Thread[vert.x-eventloop-thread-1,5,main] has been blocked for 3888 ms, time limit is 2000 ms
26/02/24 17:24:33 WARN BlockedThreadChecker: Thread Thread[vert.x-eventloop-thread-1,5,main] has been blocked for 4888 ms, time limit is 2000 ms
26/02/24 17:24:34 WARN BlockedThreadChecker: Thread Thread[vert.x-eventloop-thread-1,5,main] has been blocked for 5888 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked
at ... java.lang.invoke.* (lambda/metafactory compilation)
at app//io.vertx.core.net.impl.NetClientImpl.connectInternal2(NetClientImpl.java:309)
...
We are currently on Spark version 3.5.1. This is the CRD we submit:
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: "azure-load-data-{{ ts_nodash }}"
namespace: "{{ dag_run.conf.get('NAMESPACE', params.NAMESPACE) }}"
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: "xxxx"
imagePullPolicy: Always
mainApplicationFile: local:///opt/app/main.py
sparkVersion: "3.5.1"
timeToLiveSeconds: 3000
restartPolicy:
type: Never
nodeSelector: {}
sparkConf:
spark.kubernetes.submission.waitAppCompletion: "true"
spark.kubernetes.driver.deleteOnTermination: "false"
spark.kubernetes.executor.deleteOnTermination: "false"
spark.kubernetes.driver.pod.seccompProfile.type: RuntimeDefault
spark.kubernetes.executor.pod.seccompProfile.type: RuntimeDefault
spark.sql.legacy.parquet.datetimeRebaseModeInWrite: LEGACY
spark.sql.legacy.parquet.int96RebaseModeInWrite: LEGACY
spark.sql.session.timeZone: UTC
spark.pyspark.python: python3
spark.pyspark.driver.python: python3
driver:
serviceAccount: spark
coreRequest: "100m"
coreLimit: "1"
memory: "1024m"
env:
# --- Required: Airflow context ---
- name: AIRFLOW_DAG_ID
value: "{{ dag.dag_id }}"
- name: AIRFLOW_RUN_ID
value: "{{ run_id }}"
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: false
capabilities:
drop: ["ALL"]
labels: { version: "3.5.1" }
executor:
instances: 1
coreRequest: "100m"
coreLimit: "1"
memory: "2048m"
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: false
capabilities:
drop: ["ALL"]
labels: { version: "3.5.1" }
env:
- name: PYTHONUNBUFFERED
value: "1"
Has anyone experienced similar Vert.x Thread blocked errors during Spark 3.5.x submission on Kubernetes?
In particular:
- Did this turn out to be related to CPU throttling of the submission process?
- Kubernetes API server latency?
- Java 17 + Vert.x interaction?
- Fabric8 client timeout behavior?
- Any guidance or similar experiences would be greatly appreciated.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels