Skip to content

xgboost4j-spark-gpu fails to run on multi-GPU server with GPUs in exclusive process mode and spark-rapids plugin #11884

@eordentlich

Description

@eordentlich

Exceptions result when running as above for both driver and executor processes. Stack traces for both includes
/workspace/src/common/cuda_rt_utils.cc: 60: cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable
indicating XGBoost code is attempting to create CUDA contexts even though the corresponding Spark driver and executor processes should either not be using GPUs (Spark driver) or have been assigned other device ids by Spark.

On the driver side, it seems one source (might be more than one with the first one triggering the error) might be coming from RabitTracker initialization which uses the common InitNewThread function whose logic seems to setCurrentDevice on the default device. The source on the executor side is not clear at this point, but it is XGBoost code.

When switching GPUs to default process mode, I believe this behavior (depending on root cause above for executors, i.e. is it coming from main executor thread or new thread) can lead to race conditions in current stream/thread device settings with spark-rapids, resulting in spurious cuda kernels launched by executors on non-assigned device 0, in turn causing errors like:

Caused by: ai.rapids.cudf.CudfException: after determining tmp storage requirements for exclusive_scan: cudaErrorInvalidDevice: invalid device ordinal

first noted in NVIDIA/spark-rapids-examples#565

Specifically, executor processes launching kernels on assigned device and device 0 can be observed in nsys traces.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions