-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Description
Exceptions result when running as above for both driver and executor processes. Stack traces for both includes
/workspace/src/common/cuda_rt_utils.cc: 60: cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable
indicating XGBoost code is attempting to create CUDA contexts even though the corresponding Spark driver and executor processes should either not be using GPUs (Spark driver) or have been assigned other device ids by Spark.
On the driver side, it seems one source (might be more than one with the first one triggering the error) might be coming from RabitTracker initialization which uses the common InitNewThread function whose logic seems to setCurrentDevice on the default device. The source on the executor side is not clear at this point, but it is XGBoost code.
When switching GPUs to default process mode, I believe this behavior (depending on root cause above for executors, i.e. is it coming from main executor thread or new thread) can lead to race conditions in current stream/thread device settings with spark-rapids, resulting in spurious cuda kernels launched by executors on non-assigned device 0, in turn causing errors like:
Caused by: ai.rapids.cudf.CudfException: after determining tmp storage requirements for exclusive_scan: cudaErrorInvalidDevice: invalid device ordinal
first noted in NVIDIA/spark-rapids-examples#565
Specifically, executor processes launching kernels on assigned device and device 0 can be observed in nsys traces.