-
Notifications
You must be signed in to change notification settings - Fork 559
Open
Labels
bugSomething isn't workingSomething isn't workingxla:tpuTPU specific issues and PRsTPU specific issues and PRs
Description
🐛 Bug
PyTorch/XLA does not raise an OOM error (e.g. resource exhausted) when trying to run a computation that has to allocate a tensor that is too larget to fit in memory. I think this might be possibly an OpenXLA bug.
In the program below, after trying to print b
, the program hangs. In order to get the stacktrace shown, I had to send a SIGTERM to kill the execution.
$ python -X faulthandler
Python 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torch_xla
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
>>> a = torch.rand(1024, 1024, 1024, 1024, 1024, device=torch_xla.device())
>>> b = a.sum()
>>> b
https://symbolize.stripped_domain/r/?trace=7f9fd18f4cfc,7fa11b84251f,7f9fd186d964,7f9fd18f73f3,7f9fd155c8f0,7f9fd14c74d7,7f9fce456d76,7f9fc8d7edbc,7f9fc8d9b90c,7f9fc8d9c439,7f9fc550cb6b,7f9fc550a7b1,7f9fc55085ad,7f9fc8b00e7d,7f9fc54b4422,7f9fc545674c,7f9fc54483ae,7fa04327da4d,7fa04327dd67,7fa0380bc9b4,7fa0380c3624,7fa0380bd7e2,7fa037aa8ad2,7fa037aa9f9f,7fa037a9c3e7,7fa03777ea4d,7fa037779fd0,7fa03777d2b9,7fa0375d885d,7fa037b19cb2,7fa037bdc8d2,7fa100458e09,7fa1002eb526&map=
*** SIGTERM received by PID 42326 (TID 42326) on cpu 81 from PID 43410; stack trace: ***
PC: @ 0x7f9fd18f4cfc (unknown) (unknown)
@ 0x7f9fd186afe5 1904 (unknown)
@ 0x7fa11b842520 1918670160 (unknown)
@ 0x7f9fd186d965 64 (unknown)
@ 0x7f9fd18f73f4 128 (unknown)
@ 0x7f9fd155c8f1 352 (unknown)
@ 0x7f9fd14c74d8 144 (unknown)
@ 0x7f9fce456d77 448 (unknown)
@ 0x7f9fc8d7edbd 2528 (unknown)
@ 0x7f9fc8d9b90d 5904 (unknown)
@ 0x7f9fc8d9c43a 816 (unknown)
@ 0x7f9fc550cb6c 1120 (unknown)
@ 0x7f9fc550a7b2 1984 (unknown)
@ 0x7f9fc55085ae 2544 (unknown)
@ 0x7f9fc8b00e7e 4528 (unknown)
@ 0x7f9fc54b4423 4400 (unknown)
@ 0x7f9fc545674d 2176 (unknown)
@ 0x7f9fc54483af 4896 (unknown)
@ 0x7fa04327da4e 832 xla::InitializeArgsAndCompile()
@ 0x7fa04327dd68 176 xla::PjRtCApiClient::CompileAndLoad()
@ 0x7fa0380bc9b5 2384 torch_xla::runtime::PjRtComputationClient::Compile()::{lambda()#5}::operator()()
@ 0x7fa0380c3625 144 torch_xla::runtime::util::RaisePythonValueErrorOnFailure<>()
@ 0x7fa0380bd7e3 3792 torch_xla::runtime::PjRtComputationClient::Compile()
@ 0x7fa037aa8ad3 3872 torch_xla::XLAGraphExecutor::Compile()
@ 0x7fa037aa9fa0 1120 torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal()
@ 0x7fa037a9c3e8 528 torch_xla::XLAGraphExecutor::SyncTensorsGraph()
@ 0x7fa03777ea4e 144 torch_xla::XLATensor::ApplyPendingGraph()
@ 0x7fa037779fd1 528 torch_xla::XLATensor::GetXlaData()
@ 0x7fa03777d2ba 208 torch_xla::XLATensor::ToTensor()
@ 0x7fa0375d885e 560 torch_xla::XLANativeFunctions::_to_copy()
@ 0x7fa037b19cb3 64 at::(anonymous namespace)::(anonymous namespace)::wrapper_XLA___to_copy()
@ 0x7fa037bdc8d3 192 c10::impl::wrap_kernel_functor_unboxed_<>::call()
@ 0x7fa100458e0a 224 c10::callUnboxedKernelFunction<>()
@ 0x7fa1002eb527 320 c10::Dispatcher::redispatch<>()
@ 0x7fa1001b1750 224 at::_ops::_to_copy::redispatch()
@ 0x7fa10117d0ec 128 at::(anonymous namespace)::_to_copy()
@ 0x7fa1011a0f64 192 c10::impl::wrap_kernel_functor_unboxed_<>::call()
@ 0x7fa100458e0a 224 c10::callUnboxedKernelFunction<>()
@ 0x7fa1001b132a 448 at::_ops::_to_copy::call()
@ 0x7fa0feee0b95 80 at::_to_copy()
@ 0x7fa0feed7153 144 _to_copy_functionalize()
@ 0x7fa0feedc54b 192 c10::impl::wrap_kernel_functor_unboxed_<>::call()
@ 0x7fa100458e0a 224 c10::callUnboxedKernelFunction<>()
@ 0x7fa1002eb527 320 c10::Dispatcher::redispatch<>()
@ 0x7fa1001b1750 224 at::_ops::_to_copy::redispatch()
@ 0x7fa104131db7 112 at::redispatch::_to_copy()
@ 0x7fa103ff4396 160 torch::autograd::VariableType::(anonymous namespace)::_to_copy()::{lambda()#1}::operator()()
@ 0x7fa103ff478d 272 torch::autograd::VariableType::(anonymous namespace)::_to_copy()
@ 0x7fa1040ec494 240 c10::impl::wrap_kernel_functor_unboxed_<>::call()
@ 0x7fa100458e0a 224 c10::callUnboxedKernelFunction<>()
@ 0x7fa1001b132a 448 at::_ops::_to_copy::call()
@ 0x7fa0feee0b95 80 at::_to_copy()
@ 0x7fa0ff9c06bf 96 at::native::to_impl()
@ 0x7fa0ff9c0b49 128 at::native::to()
@ 0x7fa1019385af 112 at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd_dtype_layout_to()
@ 0x7fa101a41f10 224 c10::impl::wrap_kernel_functor_unboxed_<>::call()
@ 0x7fa1007f9b08 240 c10::callUnboxedKernelFunction<>()
@ 0x7fa1005bff19 464 at::_ops::to_dtype_layout::call()
@ 0x7fa118aad8ab 144 at::Tensor::to()
@ 0x7fa1189d22e5 128 torch::autograd::dispatch_to()
@ 0x7fa1189da781 352 torch::autograd::THPVariable_to()
@ 0x5572b23db0c7 (unknown) (unknown)
@ 0x5572b27c8ba0 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f9fd18f4cfc,7f9fd186afe4,7fa11b84251f,7f9fd186d964,7f9fd18f73f3,7f9fd155c8f0,7f9fd14c74d7,7f9fce456d76,7f9fc8d7edbc,7f9fc8d9b90c,7f9fc8d9c439,7f9fc550cb6b,7f9fc550a7b1,7f9fc55085ad,7f9fc8b00e7d,7f9fc54b4422,7f9fc545674c,7f9fc54483ae,7fa04327da4d,7fa04327dd67,7fa0380bc9b4,7fa0380c3624,7fa0380bd7e2,7fa037aa8ad2,7fa037aa9f9f,7fa037a9c3e7,7fa03777ea4d,7fa037779fd0,7fa03777d2b9,7fa0375d885d,7fa037b19cb2,7fa037bdc8d2,7fa100458e09,7fa1002eb526,7fa1001b174f,7fa10117d0eb,7fa1011a0f63,7fa100458e09,7fa1001b1329,7fa0feee0b94,7fa0feed7152,7fa0feedc54a,7fa100458e09,7fa1002eb526,7fa1001b174f,7fa104131db6,7fa103ff4395,7fa103ff478c,7fa1040ec493,7fa100458e09,7fa1001b1329,7fa0feee0b94,7fa0ff9c06be,7fa0ff9c0b48,7fa1019385ae,7fa101a41f0f,7fa1007f9b07,7fa1005bff18,7fa118aad8aa,7fa1189d22e4,7fa1189da780,5572b23db0c6,5572b27c8b9f&map=
E0723 16:55:47.733320 42326 coredump_hook.cc:247] RAW: Remote crash gathering disabled for SIGTERM.
E0723 16:55:47.838193 42326 process_state.cc:808] RAW: Raising signal 15 with default behavior
Terminated
Expected behavior
Out-of-memory error.
Environment
- Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
- torch_xla version: Error Handling: refactor
ComputationClient::TransferFromDevice
to propagate status. #9429
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingxla:tpuTPU specific issues and PRsTPU specific issues and PRs