Skip to content

[MLIR][Python] Fix bug when garbage collecting a DenseResourceElementsAttribute #149414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

mamanain
Copy link
Contributor

Bug description

When running iree.turbine example script we have encountered the following error:

import torch
import iree.turbine.aot as aot
module = torch.nn.Linear(2, 2, bias=True)
exported = aot.export(module, torch.rand(2, 2))

> python test.py
Fatal Python error: _PyImport_Init: global import state already initialized
Python runtime state: preinitialized

Current thread 0x00007dd45de45300 (most recent call first):
  Garbage-collecting
  <no Python frame>

This error reproduces in multiple environments from a simple installation from the main branch. Other people have also noticed this issue, however it was not caught in turbine's CI since this test is disabled there:

tests/examples/aot_mlp_test.py::AOTMLPTest::testMLPExportSimple 
[gw0] [ 95%] SKIPPED tests/examples/aot_mlp_test.py::AOTMLPTest::testMLPExportSimple 

While figuring out the origins of the issue we have found the export part which leads to this issue happening and it was related to MLIR's python bindings:

from iree.compiler.ir import DenseResourceElementsAttr, RankedTensorType, IntegerType, Context, Location
import numpy as np

ctx = Context()

with Location.unknown(ctx):
    DenseResourceElementsAttr.get_from_buffer(
        memoryview(np.array([1,2,3])),
        "some_resource",
        RankedTensorType.get((3,), IntegerType.get_signed(32)),
        context=ctx
    )

> python bindings.py
Fatal Python error: _PyImport_Init: global import state already initialized
Python runtime state: preinitialized

Current thread 0x00007dd45de45300 (most recent call first):
  Garbage-collecting
  <no Python frame>

This reproduces with bindings not built by IREE.
Turned out that if we explicitly destroy the context object issue from the previous script disappears, same thing happens if we explicitly destroy the export result in the first script.

...
del ctx
> python bindings.py
>
...
del exported
gc.collect()
> python test.py
>

I would argue that this is a bug and needs to be fixed, since not explicitly destroying all the objects shouldn't lead to a fatal python error.

Bug origins

Bug comes from this modification to the bindings - here people introduced explicit construction of python interpreter and GIL ca when deleting a DenseResource object.

    // The userData is a Py_buffer* that the deleter owns.
    auto deleter = [](void *userData, const void *data, size_t size,
                      size_t align) {
      if (!Py_IsInitialized())
        Py_Initialize();
      Py_buffer *ownedView = static_cast<Py_buffer *>(userData);
      nb::gil_scoped_acquire gil;
      PyBuffer_Release(ownedView);
      delete ownedView;
    };

However, when python terminates the interpreter it sets IsInitilized variable to false before doing garbage collection. So when the deleter is called Py_IsInitialized returns false while actually not all of the interpreter state has been stripped down. Import state still exist and so this leads to the error that we see in the scripts:

PyStatus
_PyImport_Init(void)
{
    if (INITTAB != NULL) {
        return _PyStatus_ERR("global import state already initialized");
    }
    if (init_builtin_modules_table() != 0) {
        return PyStatus_NoMemory();
    }
    return _PyStatus_OK();
}
// this function is ran as a part of initialization process, so it throws an error since INITTAB hasn't been deleted yet

Solution

We need to separate cases when interpreter says it is not initialized due to never existing or being killed in the moment. There is a function which can do that.

if (!(Py_IsInitialized() || _Py_IsFinalizing()))
        Py_Initialize();

However this function was deprecated in python3.13 and another one came into its place. To support all versions of Python we introduce a function which compiles to different variants based on python version. We took this function from lldb.

Maksim Manainen added 2 commits July 17, 2025 14:45
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added the mlir label Jul 17, 2025
@llvmbot
Copy link
Member

llvmbot commented Jul 17, 2025

@llvm/pr-subscribers-mlir

Author: Max Manainen (mamanain)

Changes

Bug description

When running iree.turbine example script we have encountered the following error:

import torch
import iree.turbine.aot as aot
module = torch.nn.Linear(2, 2, bias=True)
exported = aot.export(module, torch.rand(2, 2))

&gt; python test.py
Fatal Python error: _PyImport_Init: global import state already initialized
Python runtime state: preinitialized

Current thread 0x00007dd45de45300 (most recent call first):
  Garbage-collecting
  &lt;no Python frame&gt;

This error reproduces in multiple environments from a simple installation from the main branch. Other people have also noticed this issue, however it was not caught in turbine's CI since this test is disabled there:

tests/examples/aot_mlp_test.py::AOTMLPTest::testMLPExportSimple 
[gw0] [ 95%] SKIPPED tests/examples/aot_mlp_test.py::AOTMLPTest::testMLPExportSimple 

While figuring out the origins of the issue we have found the export part which leads to this issue happening and it was related to MLIR's python bindings:

from iree.compiler.ir import DenseResourceElementsAttr, RankedTensorType, IntegerType, Context, Location
import numpy as np

ctx = Context()

with Location.unknown(ctx):
    DenseResourceElementsAttr.get_from_buffer(
        memoryview(np.array([1,2,3])),
        "some_resource",
        RankedTensorType.get((3,), IntegerType.get_signed(32)),
        context=ctx
    )

&gt; python bindings.py
Fatal Python error: _PyImport_Init: global import state already initialized
Python runtime state: preinitialized

Current thread 0x00007dd45de45300 (most recent call first):
  Garbage-collecting
  &lt;no Python frame&gt;

This reproduces with bindings not built by IREE.
Turned out that if we explicitly destroy the context object issue from the previous script disappears, same thing happens if we explicitly destroy the export result in the first script.

...
del ctx
&gt; python bindings.py
&gt;
...
del exported
gc.collect()
&gt; python test.py
&gt;

I would argue that this is a bug and needs to be fixed, since not explicitly destroying all the objects shouldn't lead to a fatal python error.

Bug origins

Bug comes from this modification to the bindings - here people introduced explicit construction of python interpreter and GIL ca when deleting a DenseResource object.

    // The userData is a Py_buffer* that the deleter owns.
    auto deleter = [](void *userData, const void *data, size_t size,
                      size_t align) {
      if (!Py_IsInitialized())
        Py_Initialize();
      Py_buffer *ownedView = static_cast&lt;Py_buffer *&gt;(userData);
      nb::gil_scoped_acquire gil;
      PyBuffer_Release(ownedView);
      delete ownedView;
    };

However, when python terminates the interpreter it sets IsInitilized variable to false before doing garbage collection. So when the deleter is called Py_IsInitialized returns false while actually not all of the interpreter state has been stripped down. Import state still exist and so this leads to the error that we see in the scripts:

PyStatus
_PyImport_Init(void)
{
    if (INITTAB != NULL) {
        return _PyStatus_ERR("global import state already initialized");
    }
    if (init_builtin_modules_table() != 0) {
        return PyStatus_NoMemory();
    }
    return _PyStatus_OK();
}
// this function is ran as a part of initialization process, so it throws an error since INITTAB hasn't been deleted yet

Solution

We need to separate cases when interpreter says it is not initialized due to never existing or being killed in the moment. There is a function which can do that.

if (!(Py_IsInitialized() || _Py_IsFinalizing()))
        Py_Initialize();

However this function was deprecated in python3.13 and another one came into its place. To support all versions of Python we introduce a function which compiles to different variants based on python version. We took this function from lldb.


Full diff: https://github.com/llvm/llvm-project/pull/149414.diff

2 Files Affected:

  • (modified) mlir/lib/Bindings/Python/IRAttributes.cpp (+11-1)
  • (modified) mlir/test/python/ir/array_attributes.py (+13)
diff --git a/mlir/lib/Bindings/Python/IRAttributes.cpp b/mlir/lib/Bindings/Python/IRAttributes.cpp
index 8f79caf08a6d0..fad7d2ad25db0 100644
--- a/mlir/lib/Bindings/Python/IRAttributes.cpp
+++ b/mlir/lib/Bindings/Python/IRAttributes.cpp
@@ -1428,6 +1428,16 @@ class PyDenseIntElementsAttribute
   }
 };
 
+static bool pythonIsFinalizing() {
+#if PY_VERSION_HEX >= 0x030d0000
+  return Py_IsFinalizing();
+#elif PY_VERSION_HEX >= 0x03070000
+  return _Py_IsFinalizing();
+#else
+  return _Py_Finalizing != nullptr;
+#endif
+}
+
 class PyDenseResourceElementsAttribute
     : public PyConcreteAttribute<PyDenseResourceElementsAttribute> {
 public:
@@ -1474,7 +1484,7 @@ class PyDenseResourceElementsAttribute
     // The userData is a Py_buffer* that the deleter owns.
     auto deleter = [](void *userData, const void *data, size_t size,
                       size_t align) {
-      if (!Py_IsInitialized())
+      if (!(Py_IsInitialized() || pythonIsFinalizing()))
         Py_Initialize();
       Py_buffer *ownedView = static_cast<Py_buffer *>(userData);
       nb::gil_scoped_acquire gil;
diff --git a/mlir/test/python/ir/array_attributes.py b/mlir/test/python/ir/array_attributes.py
index ef1d835fc6401..48ae2fdaf567f 100644
--- a/mlir/test/python/ir/array_attributes.py
+++ b/mlir/test/python/ir/array_attributes.py
@@ -617,3 +617,16 @@ def test_attribute(context, mview):
     # CHECK: BACKING MEMORY DELETED
     # CHECK: EXIT FUNCTION
     print("EXIT FUNCTION")
+
+
+print("TEST: danglingResource")
+# This error occurs only when there is an alive context with a DenseResourceElementsAttr 
+# in the end of the program, so we put it here without an encapsulating function.
+ctx = Context()
+
+with ctx, Location.unknown():
+    DenseResourceElementsAttr.get_from_buffer(
+        memoryview(np.array([1,2,3])),
+        "some_resource",
+        RankedTensorType.get((3,), IntegerType.get_signed(32))
+    )

Copy link

⚠️ Python code formatter, darker found issues in your code. ⚠️

You can test this locally with the following command:
darker --check --diff -r HEAD~1...HEAD mlir/test/python/ir/array_attributes.py
View the diff from darker here.
--- array_attributes.py	2025-07-17 14:50:31.000000 +0000
+++ array_attributes.py	2025-07-17 22:19:01.649894 +0000
@@ -618,15 +618,15 @@
     # CHECK: EXIT FUNCTION
     print("EXIT FUNCTION")
 
 
 print("TEST: danglingResource")
-# This error occurs only when there is an alive context with a DenseResourceElementsAttr 
+# This error occurs only when there is an alive context with a DenseResourceElementsAttr
 # in the end of the program, so we put it here without an encapsulating function.
 ctx = Context()
 
 with ctx, Location.unknown():
     DenseResourceElementsAttr.get_from_buffer(
-        memoryview(np.array([1,2,3])),
+        memoryview(np.array([1, 2, 3])),
         "some_resource",
-        RankedTensorType.get((3,), IntegerType.get_signed(32))
+        RankedTensorType.get((3,), IntegerType.get_signed(32)),
     )

@makslevental makslevental self-assigned this Jul 17, 2025
@@ -617,3 +617,16 @@ def test_attribute(context, mview):
# CHECK: BACKING MEMORY DELETED
# CHECK: EXIT FUNCTION
print("EXIT FUNCTION")


print("TEST: danglingResource")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR - I'll have a closer look soon - but just heads up this test won't actually "check" anything. Notice the #CHECK comments in the rest of the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you put it into its own function.

Copy link
Contributor Author

@mamanain mamanain Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't check anything specifically but it still will fail since this bug leads to return code of the whole script to be non 0. Without the fix the test fails.

I've tried to put it into a function but I think that in this case it has to be outside of one for bug to replicate. You need a context object to exist when garbage collecting during script termination.

Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makslevental hi! Just gently pinging you on this PR. Please let me know if I shouldn't. There is no rush here, the bug is quite mild.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry man I've been caught up - I'll get to it today I promise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries at all

makslevental added a commit that referenced this pull request Jul 25, 2025
This PR melds #150137 and
#149414 *and* partially reverts
#124832.

The summary is the `PyDenseResourceElementsAttribute` finalizer/deleter
has/had two problems

1. wasn't threadsafe (can be called from a different thread than that
which currently holds the GIL)
2. can be called while the interpreter is "not initialized"

#124832 for some reason decides
to re-initialize the interpreter to avoid case 2 and runs afoul of the
fact that `Py_IsInitialized` can be false during the finalization of the
interpreter itself (e.g., at the end of a script).

I don't know why this decision was made (I missed the PR) but I believe
we should never be calling
[Py_Initialize](https://docs.python.org/3/c-api/init.html#c.Py_Initialize):

> In an application \*\*\*\***embedding Python**\*\*\*\*, this should be
called before using any other Python/C API functions

**but we aren't embedding Python**!

So therefore we will only be in case 2 when the interpreter is being
finalized and in that case we should just leak the buffer.

Note,
[lldb](https://github.com/llvm/llvm-project/blob/548ca9e97673a168023a616d311d901ca04b29a3/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp#L81-L93)
does a similar sort of thing for its finalizers.

Co-authored-by: Anton Korobeynikov <[email protected]>
Co-authored-by: Max Manainen <[email protected]>

Co-authored-by: Anton Korobeynikov <[email protected]>
Co-authored-by: Max Manainen <[email protected]>
@makslevental
Copy link
Contributor

Fixed by #150561

llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Jul 25, 2025
…zer (#150561)

This PR melds llvm/llvm-project#150137 and
llvm/llvm-project#149414 *and* partially reverts
llvm/llvm-project#124832.

The summary is the `PyDenseResourceElementsAttribute` finalizer/deleter
has/had two problems

1. wasn't threadsafe (can be called from a different thread than that
which currently holds the GIL)
2. can be called while the interpreter is "not initialized"

llvm/llvm-project#124832 for some reason decides
to re-initialize the interpreter to avoid case 2 and runs afoul of the
fact that `Py_IsInitialized` can be false during the finalization of the
interpreter itself (e.g., at the end of a script).

I don't know why this decision was made (I missed the PR) but I believe
we should never be calling
[Py_Initialize](https://docs.python.org/3/c-api/init.html#c.Py_Initialize):

> In an application \*\*\*\***embedding Python**\*\*\*\*, this should be
called before using any other Python/C API functions

**but we aren't embedding Python**!

So therefore we will only be in case 2 when the interpreter is being
finalized and in that case we should just leak the buffer.

Note,
[lldb](https://github.com/llvm/llvm-project/blob/548ca9e97673a168023a616d311d901ca04b29a3/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp#L81-L93)
does a similar sort of thing for its finalizers.

Co-authored-by: Anton Korobeynikov <[email protected]>
Co-authored-by: Max Manainen <[email protected]>

Co-authored-by: Anton Korobeynikov <[email protected]>
Co-authored-by: Max Manainen <[email protected]>
@mamanain mamanain closed this Jul 25, 2025
mahesh-attarde pushed a commit to mahesh-attarde/llvm-project that referenced this pull request Jul 28, 2025
…50561)

This PR melds llvm#150137 and
llvm#149414 *and* partially reverts
llvm#124832.

The summary is the `PyDenseResourceElementsAttribute` finalizer/deleter
has/had two problems

1. wasn't threadsafe (can be called from a different thread than that
which currently holds the GIL)
2. can be called while the interpreter is "not initialized"

llvm#124832 for some reason decides
to re-initialize the interpreter to avoid case 2 and runs afoul of the
fact that `Py_IsInitialized` can be false during the finalization of the
interpreter itself (e.g., at the end of a script).

I don't know why this decision was made (I missed the PR) but I believe
we should never be calling
[Py_Initialize](https://docs.python.org/3/c-api/init.html#c.Py_Initialize):

> In an application \*\*\*\***embedding Python**\*\*\*\*, this should be
called before using any other Python/C API functions

**but we aren't embedding Python**!

So therefore we will only be in case 2 when the interpreter is being
finalized and in that case we should just leak the buffer.

Note,
[lldb](https://github.com/llvm/llvm-project/blob/548ca9e97673a168023a616d311d901ca04b29a3/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp#L81-L93)
does a similar sort of thing for its finalizers.

Co-authored-by: Anton Korobeynikov <[email protected]>
Co-authored-by: Max Manainen <[email protected]>

Co-authored-by: Anton Korobeynikov <[email protected]>
Co-authored-by: Max Manainen <[email protected]>
ajaden-codes pushed a commit to Jaddyen/llvm-project that referenced this pull request Jul 28, 2025
…50561)

This PR melds llvm#150137 and
llvm#149414 *and* partially reverts
llvm#124832.

The summary is the `PyDenseResourceElementsAttribute` finalizer/deleter
has/had two problems

1. wasn't threadsafe (can be called from a different thread than that
which currently holds the GIL)
2. can be called while the interpreter is "not initialized"

llvm#124832 for some reason decides
to re-initialize the interpreter to avoid case 2 and runs afoul of the
fact that `Py_IsInitialized` can be false during the finalization of the
interpreter itself (e.g., at the end of a script).

I don't know why this decision was made (I missed the PR) but I believe
we should never be calling
[Py_Initialize](https://docs.python.org/3/c-api/init.html#c.Py_Initialize):

> In an application \*\*\*\***embedding Python**\*\*\*\*, this should be
called before using any other Python/C API functions

**but we aren't embedding Python**!

So therefore we will only be in case 2 when the interpreter is being
finalized and in that case we should just leak the buffer.

Note,
[lldb](https://github.com/llvm/llvm-project/blob/548ca9e97673a168023a616d311d901ca04b29a3/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp#L81-L93)
does a similar sort of thing for its finalizers.

Co-authored-by: Anton Korobeynikov <[email protected]>
Co-authored-by: Max Manainen <[email protected]>

Co-authored-by: Anton Korobeynikov <[email protected]>
Co-authored-by: Max Manainen <[email protected]>
makslevental added a commit to makslevental/python_bindings_fork that referenced this pull request Aug 13, 2025
This PR melds llvm/llvm-project#150137 and
llvm/llvm-project#149414 *and* partially reverts
llvm/llvm-project#124832.

The summary is the `PyDenseResourceElementsAttribute` finalizer/deleter
has/had two problems

1. wasn't threadsafe (can be called from a different thread than that
which currently holds the GIL)
2. can be called while the interpreter is "not initialized"

llvm/llvm-project#124832 for some reason decides
to re-initialize the interpreter to avoid case 2 and runs afoul of the
fact that `Py_IsInitialized` can be false during the finalization of the
interpreter itself (e.g., at the end of a script).

I don't know why this decision was made (I missed the PR) but I believe
we should never be calling
[Py_Initialize](https://docs.python.org/3/c-api/init.html#c.Py_Initialize):

> In an application \*\*\*\***embedding Python**\*\*\*\*, this should be
called before using any other Python/C API functions

**but we aren't embedding Python**!

So therefore we will only be in case 2 when the interpreter is being
finalized and in that case we should just leak the buffer.

Note,
[lldb](https://github.com/llvm/llvm-project/blob/548ca9e97673a168023a616d311d901ca04b29a3/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp#L81-L93)
does a similar sort of thing for its finalizers.

Co-authored-by: Anton Korobeynikov <[email protected]>
Co-authored-by: Max Manainen <[email protected]>

Co-authored-by: Anton Korobeynikov <[email protected]>
Co-authored-by: Max Manainen <[email protected]>
makslevental added a commit to makslevental/python_bindings_fork that referenced this pull request Aug 14, 2025
This PR melds llvm/llvm-project#150137 and
llvm/llvm-project#149414 *and* partially reverts
llvm/llvm-project#124832.

The summary is the `PyDenseResourceElementsAttribute` finalizer/deleter
has/had two problems

1. wasn't threadsafe (can be called from a different thread than that
which currently holds the GIL)
2. can be called while the interpreter is "not initialized"

llvm/llvm-project#124832 for some reason decides
to re-initialize the interpreter to avoid case 2 and runs afoul of the
fact that `Py_IsInitialized` can be false during the finalization of the
interpreter itself (e.g., at the end of a script).

I don't know why this decision was made (I missed the PR) but I believe
we should never be calling
[Py_Initialize](https://docs.python.org/3/c-api/init.html#c.Py_Initialize):

> In an application \*\*\*\***embedding Python**\*\*\*\*, this should be
called before using any other Python/C API functions

**but we aren't embedding Python**!

So therefore we will only be in case 2 when the interpreter is being
finalized and in that case we should just leak the buffer.

Note,
[lldb](https://github.com/llvm/llvm-project/blob/548ca9e97673a168023a616d311d901ca04b29a3/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp#L81-L93)
does a similar sort of thing for its finalizers.

Co-authored-by: Anton Korobeynikov <[email protected]>
Co-authored-by: Max Manainen <[email protected]>

Co-authored-by: Anton Korobeynikov <[email protected]>
Co-authored-by: Max Manainen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants