Skip to content

/usr/bin/ld: cannot find -lcurand: No such file or directory #68

@WANGDUNDUN1

Description

@WANGDUNDUN1

<x_grpo_trainer.XGRPOTrainer object at 0x7f49e9163e50>
2025-03-26 23:00:15 - INFO - main - *** Train ***
[INFO|deepspeed.py:386] 2025-03-26 23:00:15,856 >> Detected ZeRO Offload and non-DeepSpeed optimizers: This combination should work as long as the custom optimizer has both CPU and GPU implementation (except LAMB)
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
[1/1] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/opt/conda/lib -L/opt/conda/lib/python3.11/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
FAILED: cpu_adam.so
c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/opt/conda/lib -L/opt/conda/lib/python3.11/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
/usr/bin/ld: cannot find -lcurand: No such file or directory
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
Loading extension module cpu_adam...
[rank1]: Traceback (most recent call last):
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2104, in _run_ninja_build
[rank1]: subprocess.run(
[rank1]: File "/opt/conda/lib/python3.11/subprocess.py", line 571, in run
[rank1]: raise CalledProcessError(retcode, process.args,
[rank1]: subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

[rank1]: The above exception was the direct cause of the following exception:

[rank1]: Traceback (most recent call last):
[rank1]: File "/root/hy-nas/X-R1/src/x_r1/grpo.py", line 275, in
[rank1]: main(script_args, training_args, model_args )
[rank1]: File "/root/hy-nas/X-R1/src/x_r1/grpo.py", line 239, in main
[rank1]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2369, in _inner_training_loop
[rank1]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 1392, in prepare
[rank1]: result = self._prepare_deepspeed(*args)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 1942, in _prepare_deepspeed
[rank1]: optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 97, in map_pytorch_optim_to_deepspeed
[rank1]: return optimizer_class(optimizer.param_groups, **defaults)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init
[rank1]: self.ds_opt_adam = CPUAdamBuilder().load()
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 540, in load
[rank1]: return self.jit_load(verbose)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 587, in jit_load
[rank1]: op_module = load(name=self.name,
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1314, in load
[rank1]: return _jit_compile(
[rank1]: ^^^^^^^^^^^^^
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1721, in _jit_compile
[rank1]: _write_ninja_file_and_build_library(
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1833, in _write_ninja_file_and_build_library
[rank1]: _run_ninja_build(
[rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2120, in _run_ninja_build
[rank1]: raise RuntimeError(message) from e
[rank1]: RuntimeError: Error building extension 'cpu_adam'
[rank2]: Traceback (most recent call last):
[rank2]: File "/root/hy-nas/X-R1/src/x_r1/grpo.py", line 275, in
[rank2]: main(script_args, training_args, model_args )
[rank2]: File "/root/hy-nas/X-R1/src/x_r1/grpo.py", line 239, in main
[rank2]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank2]: return inner_training_loop(
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2369, in _inner_training_loop
[rank2]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 1392, in prepare
[rank2]: result = self._prepare_deepspeed(*args)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 1942, in _prepare_deepspeed
[rank2]: optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 97, in map_pytorch_optim_to_deepspeed
[rank2]: return optimizer_class(optimizer.param_groups, **defaults)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init
[rank2]: self.ds_opt_adam = CPUAdamBuilder().load()
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 540, in load
[rank2]: return self.jit_load(verbose)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 587, in jit_load
[rank2]: op_module = load(name=self.name,
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1314, in load
[rank2]: return _jit_compile(
[rank2]: ^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1746, in _jit_compile
[rank2]: return _import_module_from_library(name, build_directory, is_python_module)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2140, in _import_module_from_library
[rank2]: module = importlib.util.module_from_spec(spec)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "", line 573, in module_from_spec
[rank2]: File "", line 1233, in create_module
[rank2]: File "", line 241, in _call_with_frames_removed
[rank2]: ImportError: /root/.cache/torch_extensions/py311_cu124/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f82abc1d8a0>
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del
self.ds_opt_adam.destroy_adam(self.opt_id)
^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Using /root/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py311_cu124/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/opt/conda/lib -L/opt/conda/lib/python3.11/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
FAILED: cpu_adam.so
c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/opt/conda/lib -L/opt/conda/lib/python3.11/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
/usr/bin/ld: cannot find -lcurand: No such file or directory
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2104, in _run_ninja_build
[rank0]: subprocess.run(
[rank0]: File "/opt/conda/lib/python3.11/subprocess.py", line 571, in run
[rank0]: raise CalledProcessError(retcode, process.args,
[rank0]: subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]: File "/root/hy-nas/X-R1/src/x_r1/grpo.py", line 275, in
[rank0]: main(script_args, training_args, model_args )
[rank0]: File "/root/hy-nas/X-R1/src/x_r1/grpo.py", line 239, in main
[rank0]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2369, in _inner_training_loop
[rank0]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 1392, in prepare
[rank0]: result = self._prepare_deepspeed(*args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 1942, in _prepare_deepspeed
[rank0]: optimizer = map_pytorch_optim_to_deepspeed(optimizer)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 97, in map_pytorch_optim_to_deepspeed
[rank0]: return optimizer_class(optimizer.param_groups, **defaults)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init
[rank0]: self.ds_opt_adam = CPUAdamBuilder().load()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 540, in load
[rank0]: return self.jit_load(verbose)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 587, in jit_load
[rank0]: op_module = load(name=self.name,
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1314, in load
[rank0]: return _jit_compile(
[rank0]: ^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1721, in _jit_compile
[rank0]: _write_ninja_file_and_build_library(
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1833, in _write_ninja_file_and_build_library
[rank0]: _run_ninja_build(
[rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2120, in _run_ninja_build
[rank0]: raise RuntimeError(message) from e
[rank0]: RuntimeError: Error building extension 'cpu_adam'
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f14a58bd8a0>
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f49eb72d8a0>
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
[rank0]:[W326 23:00:20.472864575 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0326 23:00:21.625000 40238 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 40383 closing signal SIGTERM
W0326 23:00:21.627000 40238 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 40385 closing signal SIGTERM
E0326 23:00:21.843000 40238 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 40384) of binary: /opt/conda/bin/python3.11
Traceback (most recent call last):
File "/opt/conda/bin/accelerate", line 8, in
sys.exit(main())
^^^^^^
File "/opt/conda/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/opt/conda/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1177, in launch_command
deepspeed_launcher(args)
File "/opt/conda/lib/python3.11/site-packages/accelerate/commands/launch.py", line 863, in deepspeed_launcher
distrib_run.run(args)
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/x_r1/grpo.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-03-26_23:00:21
host : 3ac636845a0a
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 40384)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions