Skip to content

TPU hangs on too large tensor. #9502

@ysiraichi

Description

@ysiraichi

🐛 Bug

PyTorch/XLA does not raise an OOM error (e.g. resource exhausted) when trying to run a computation that has to allocate a tensor that is too larget to fit in memory. I think this might be possibly an OpenXLA bug.

In the program below, after trying to print b, the program hangs. In order to get the stacktrace shown, I had to send a SIGTERM to kill the execution.

$ python -X faulthandler
Python 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import torch
>>> import torch_xla
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.

>>> a = torch.rand(1024, 1024, 1024, 1024, 1024, device=torch_xla.device())
>>> b = a.sum()

>>> b
https://symbolize.stripped_domain/r/?trace=7f9fd18f4cfc,7fa11b84251f,7f9fd186d964,7f9fd18f73f3,7f9fd155c8f0,7f9fd14c74d7,7f9fce456d76,7f9fc8d7edbc,7f9fc8d9b90c,7f9fc8d9c439,7f9fc550cb6b,7f9fc550a7b1,7f9fc55085ad,7f9fc8b00e7d,7f9fc54b4422,7f9fc545674c,7f9fc54483ae,7fa04327da4d,7fa04327dd67,7fa0380bc9b4,7fa0380c3624,7fa0380bd7e2,7fa037aa8ad2,7fa037aa9f9f,7fa037a9c3e7,7fa03777ea4d,7fa037779fd0,7fa03777d2b9,7fa0375d885d,7fa037b19cb2,7fa037bdc8d2,7fa100458e09,7fa1002eb526&map=
*** SIGTERM received by PID 42326 (TID 42326) on cpu 81 from PID 43410; stack trace: ***
PC: @     0x7f9fd18f4cfc  (unknown)  (unknown)
    @     0x7f9fd186afe5       1904  (unknown)
    @     0x7fa11b842520  1918670160  (unknown)
    @     0x7f9fd186d965         64  (unknown)
    @     0x7f9fd18f73f4        128  (unknown)
    @     0x7f9fd155c8f1        352  (unknown)
    @     0x7f9fd14c74d8        144  (unknown)
    @     0x7f9fce456d77        448  (unknown)
    @     0x7f9fc8d7edbd       2528  (unknown)
    @     0x7f9fc8d9b90d       5904  (unknown)
    @     0x7f9fc8d9c43a        816  (unknown)
    @     0x7f9fc550cb6c       1120  (unknown)
    @     0x7f9fc550a7b2       1984  (unknown)
    @     0x7f9fc55085ae       2544  (unknown)
    @     0x7f9fc8b00e7e       4528  (unknown)
    @     0x7f9fc54b4423       4400  (unknown)
    @     0x7f9fc545674d       2176  (unknown)
    @     0x7f9fc54483af       4896  (unknown)
    @     0x7fa04327da4e        832  xla::InitializeArgsAndCompile()
    @     0x7fa04327dd68        176  xla::PjRtCApiClient::CompileAndLoad()
    @     0x7fa0380bc9b5       2384  torch_xla::runtime::PjRtComputationClient::Compile()::{lambda()#5}::operator()()
    @     0x7fa0380c3625        144  torch_xla::runtime::util::RaisePythonValueErrorOnFailure<>()
    @     0x7fa0380bd7e3       3792  torch_xla::runtime::PjRtComputationClient::Compile()
    @     0x7fa037aa8ad3       3872  torch_xla::XLAGraphExecutor::Compile()
    @     0x7fa037aa9fa0       1120  torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal()
    @     0x7fa037a9c3e8        528  torch_xla::XLAGraphExecutor::SyncTensorsGraph()
    @     0x7fa03777ea4e        144  torch_xla::XLATensor::ApplyPendingGraph()
    @     0x7fa037779fd1        528  torch_xla::XLATensor::GetXlaData()
    @     0x7fa03777d2ba        208  torch_xla::XLATensor::ToTensor()
    @     0x7fa0375d885e        560  torch_xla::XLANativeFunctions::_to_copy()
    @     0x7fa037b19cb3         64  at::(anonymous namespace)::(anonymous namespace)::wrapper_XLA___to_copy()
    @     0x7fa037bdc8d3        192  c10::impl::wrap_kernel_functor_unboxed_<>::call()
    @     0x7fa100458e0a        224  c10::callUnboxedKernelFunction<>()
    @     0x7fa1002eb527        320  c10::Dispatcher::redispatch<>()
    @     0x7fa1001b1750        224  at::_ops::_to_copy::redispatch()
    @     0x7fa10117d0ec        128  at::(anonymous namespace)::_to_copy()
    @     0x7fa1011a0f64        192  c10::impl::wrap_kernel_functor_unboxed_<>::call()
    @     0x7fa100458e0a        224  c10::callUnboxedKernelFunction<>()
    @     0x7fa1001b132a        448  at::_ops::_to_copy::call()
    @     0x7fa0feee0b95         80  at::_to_copy()
    @     0x7fa0feed7153        144  _to_copy_functionalize()
    @     0x7fa0feedc54b        192  c10::impl::wrap_kernel_functor_unboxed_<>::call()
    @     0x7fa100458e0a        224  c10::callUnboxedKernelFunction<>()
    @     0x7fa1002eb527        320  c10::Dispatcher::redispatch<>()
    @     0x7fa1001b1750        224  at::_ops::_to_copy::redispatch()
    @     0x7fa104131db7        112  at::redispatch::_to_copy()
    @     0x7fa103ff4396        160  torch::autograd::VariableType::(anonymous namespace)::_to_copy()::{lambda()#1}::operator()()
    @     0x7fa103ff478d        272  torch::autograd::VariableType::(anonymous namespace)::_to_copy()
    @     0x7fa1040ec494        240  c10::impl::wrap_kernel_functor_unboxed_<>::call()
    @     0x7fa100458e0a        224  c10::callUnboxedKernelFunction<>()
    @     0x7fa1001b132a        448  at::_ops::_to_copy::call()
    @     0x7fa0feee0b95         80  at::_to_copy()
    @     0x7fa0ff9c06bf         96  at::native::to_impl()
    @     0x7fa0ff9c0b49        128  at::native::to()
    @     0x7fa1019385af        112  at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd_dtype_layout_to()
    @     0x7fa101a41f10        224  c10::impl::wrap_kernel_functor_unboxed_<>::call()
    @     0x7fa1007f9b08        240  c10::callUnboxedKernelFunction<>()
    @     0x7fa1005bff19        464  at::_ops::to_dtype_layout::call()
    @     0x7fa118aad8ab        144  at::Tensor::to()
    @     0x7fa1189d22e5        128  torch::autograd::dispatch_to()
    @     0x7fa1189da781        352  torch::autograd::THPVariable_to()
    @     0x5572b23db0c7  (unknown)  (unknown)
    @     0x5572b27c8ba0  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f9fd18f4cfc,7f9fd186afe4,7fa11b84251f,7f9fd186d964,7f9fd18f73f3,7f9fd155c8f0,7f9fd14c74d7,7f9fce456d76,7f9fc8d7edbc,7f9fc8d9b90c,7f9fc8d9c439,7f9fc550cb6b,7f9fc550a7b1,7f9fc55085ad,7f9fc8b00e7d,7f9fc54b4422,7f9fc545674c,7f9fc54483ae,7fa04327da4d,7fa04327dd67,7fa0380bc9b4,7fa0380c3624,7fa0380bd7e2,7fa037aa8ad2,7fa037aa9f9f,7fa037a9c3e7,7fa03777ea4d,7fa037779fd0,7fa03777d2b9,7fa0375d885d,7fa037b19cb2,7fa037bdc8d2,7fa100458e09,7fa1002eb526,7fa1001b174f,7fa10117d0eb,7fa1011a0f63,7fa100458e09,7fa1001b1329,7fa0feee0b94,7fa0feed7152,7fa0feedc54a,7fa100458e09,7fa1002eb526,7fa1001b174f,7fa104131db6,7fa103ff4395,7fa103ff478c,7fa1040ec493,7fa100458e09,7fa1001b1329,7fa0feee0b94,7fa0ff9c06be,7fa0ff9c0b48,7fa1019385ae,7fa101a41f0f,7fa1007f9b07,7fa1005bff18,7fa118aad8aa,7fa1189d22e4,7fa1189da780,5572b23db0c6,5572b27c8b9f&map=
E0723 16:55:47.733320   42326 coredump_hook.cc:247] RAW: Remote crash gathering disabled for SIGTERM.
E0723 16:55:47.838193   42326 process_state.cc:808] RAW: Raising signal 15 with default behavior
Terminated

Expected behavior

Out-of-memory error.

Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingxla:tpuTPU specific issues and PRs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions