xgboost4j-spark-gpu fails to run on multi-GPU server with GPUs in exclusive process mode and spark-rapids plugin

Exceptions result when running as above for both driver and executor processes.    Stack traces for both includes
`
/workspace/src/common/cuda_rt_utils.cc: 60: cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable
`
indicating XGBoost code is attempting to create CUDA contexts even though the corresponding Spark driver and executor processes should either not be using GPUs (Spark driver) or have been assigned other device ids by Spark.

On the driver side, it seems one source (might be more than one with the first one triggering the error) might be coming from RabitTracker initialization which uses the common [InitNewThread function](https://github.com/dmlc/xgboost/blob/cd7f9974e996a5b30e9cd3530510a5ae291c6bbf/src/global_config.cc#L17) whose logic seems to setCurrentDevice on the default device.   The source on the executor side is not clear at this point, but it is XGBoost code.

When switching GPUs to default process mode, I believe this behavior (depending on root cause above for executors, i.e. is it coming from main executor thread or new thread) can lead to race conditions in current stream/thread device settings with spark-rapids, resulting in spurious cuda kernels launched by executors on non-assigned device 0, in turn causing errors like: 
```
Caused by: ai.rapids.cudf.CudfException: after determining tmp storage requirements for exclusive_scan: cudaErrorInvalidDevice: invalid device ordinal
```
first noted in https://github.com/NVIDIA/spark-rapids-examples/issues/565

Specifically, executor processes launching kernels on assigned device and device 0 can be observed in nsys traces.
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

xgboost4j-spark-gpu fails to run on multi-GPU server with GPUs in exclusive process mode and spark-rapids plugin #11884

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

xgboost4j-spark-gpu fails to run on multi-GPU server with GPUs in exclusive process mode and spark-rapids plugin #11884

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions