-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[MLIR][Python] Fix bug when garbage collecting a DenseResourceElementsAttribute #149414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-mlir Author: Max Manainen (mamanain) ChangesBug descriptionWhen running import torch
import iree.turbine.aot as aot
module = torch.nn.Linear(2, 2, bias=True)
exported = aot.export(module, torch.rand(2, 2))
> python test.py
Fatal Python error: _PyImport_Init: global import state already initialized
Python runtime state: preinitialized
Current thread 0x00007dd45de45300 (most recent call first):
Garbage-collecting
<no Python frame> This error reproduces in multiple environments from a simple installation from the main branch. Other people have also noticed this issue, however it was not caught in turbine's CI since this test is disabled there:
While figuring out the origins of the issue we have found the export part which leads to this issue happening and it was related to MLIR's python bindings: from iree.compiler.ir import DenseResourceElementsAttr, RankedTensorType, IntegerType, Context, Location
import numpy as np
ctx = Context()
with Location.unknown(ctx):
DenseResourceElementsAttr.get_from_buffer(
memoryview(np.array([1,2,3])),
"some_resource",
RankedTensorType.get((3,), IntegerType.get_signed(32)),
context=ctx
)
> python bindings.py
Fatal Python error: _PyImport_Init: global import state already initialized
Python runtime state: preinitialized
Current thread 0x00007dd45de45300 (most recent call first):
Garbage-collecting
<no Python frame> This reproduces with bindings not built by IREE. ...
del ctx
> python bindings.py
> ...
del exported
gc.collect()
> python test.py
> I would argue that this is a bug and needs to be fixed, since not explicitly destroying all the objects shouldn't lead to a fatal python error. Bug originsBug comes from this modification to the bindings - here people introduced explicit construction of python interpreter and GIL ca when deleting a DenseResource object. // The userData is a Py_buffer* that the deleter owns.
auto deleter = [](void *userData, const void *data, size_t size,
size_t align) {
if (!Py_IsInitialized())
Py_Initialize();
Py_buffer *ownedView = static_cast<Py_buffer *>(userData);
nb::gil_scoped_acquire gil;
PyBuffer_Release(ownedView);
delete ownedView;
}; However, when python terminates the interpreter it sets IsInitilized variable to false before doing garbage collection. So when the deleter is called PyStatus
_PyImport_Init(void)
{
if (INITTAB != NULL) {
return _PyStatus_ERR("global import state already initialized");
}
if (init_builtin_modules_table() != 0) {
return PyStatus_NoMemory();
}
return _PyStatus_OK();
}
// this function is ran as a part of initialization process, so it throws an error since INITTAB hasn't been deleted yet SolutionWe need to separate cases when interpreter says it is not initialized due to never existing or being killed in the moment. There is a function which can do that. if (!(Py_IsInitialized() || _Py_IsFinalizing()))
Py_Initialize(); However this function was deprecated in python3.13 and another one came into its place. To support all versions of Python we introduce a function which compiles to different variants based on python version. We took this function from lldb. Full diff: https://github.com/llvm/llvm-project/pull/149414.diff 2 Files Affected:
diff --git a/mlir/lib/Bindings/Python/IRAttributes.cpp b/mlir/lib/Bindings/Python/IRAttributes.cpp
index 8f79caf08a6d0..fad7d2ad25db0 100644
--- a/mlir/lib/Bindings/Python/IRAttributes.cpp
+++ b/mlir/lib/Bindings/Python/IRAttributes.cpp
@@ -1428,6 +1428,16 @@ class PyDenseIntElementsAttribute
}
};
+static bool pythonIsFinalizing() {
+#if PY_VERSION_HEX >= 0x030d0000
+ return Py_IsFinalizing();
+#elif PY_VERSION_HEX >= 0x03070000
+ return _Py_IsFinalizing();
+#else
+ return _Py_Finalizing != nullptr;
+#endif
+}
+
class PyDenseResourceElementsAttribute
: public PyConcreteAttribute<PyDenseResourceElementsAttribute> {
public:
@@ -1474,7 +1484,7 @@ class PyDenseResourceElementsAttribute
// The userData is a Py_buffer* that the deleter owns.
auto deleter = [](void *userData, const void *data, size_t size,
size_t align) {
- if (!Py_IsInitialized())
+ if (!(Py_IsInitialized() || pythonIsFinalizing()))
Py_Initialize();
Py_buffer *ownedView = static_cast<Py_buffer *>(userData);
nb::gil_scoped_acquire gil;
diff --git a/mlir/test/python/ir/array_attributes.py b/mlir/test/python/ir/array_attributes.py
index ef1d835fc6401..48ae2fdaf567f 100644
--- a/mlir/test/python/ir/array_attributes.py
+++ b/mlir/test/python/ir/array_attributes.py
@@ -617,3 +617,16 @@ def test_attribute(context, mview):
# CHECK: BACKING MEMORY DELETED
# CHECK: EXIT FUNCTION
print("EXIT FUNCTION")
+
+
+print("TEST: danglingResource")
+# This error occurs only when there is an alive context with a DenseResourceElementsAttr
+# in the end of the program, so we put it here without an encapsulating function.
+ctx = Context()
+
+with ctx, Location.unknown():
+ DenseResourceElementsAttr.get_from_buffer(
+ memoryview(np.array([1,2,3])),
+ "some_resource",
+ RankedTensorType.get((3,), IntegerType.get_signed(32))
+ )
|
You can test this locally with the following command:darker --check --diff -r HEAD~1...HEAD mlir/test/python/ir/array_attributes.py View the diff from darker here.--- array_attributes.py 2025-07-17 14:50:31.000000 +0000
+++ array_attributes.py 2025-07-17 22:19:01.649894 +0000
@@ -618,15 +618,15 @@
# CHECK: EXIT FUNCTION
print("EXIT FUNCTION")
print("TEST: danglingResource")
-# This error occurs only when there is an alive context with a DenseResourceElementsAttr
+# This error occurs only when there is an alive context with a DenseResourceElementsAttr
# in the end of the program, so we put it here without an encapsulating function.
ctx = Context()
with ctx, Location.unknown():
DenseResourceElementsAttr.get_from_buffer(
- memoryview(np.array([1,2,3])),
+ memoryview(np.array([1, 2, 3])),
"some_resource",
- RankedTensorType.get((3,), IntegerType.get_signed(32))
+ RankedTensorType.get((3,), IntegerType.get_signed(32)),
)
|
@@ -617,3 +617,16 @@ def test_attribute(context, mview): | |||
# CHECK: BACKING MEMORY DELETED | |||
# CHECK: EXIT FUNCTION | |||
print("EXIT FUNCTION") | |||
|
|||
|
|||
print("TEST: danglingResource") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR - I'll have a closer look soon - but just heads up this test won't actually "check" anything. Notice the #CHECK
comments in the rest of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can you put it into its own function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't check anything specifically but it still will fail since this bug leads to return code of the whole script to be non 0. Without the fix the test fails.
I've tried to put it into a function but I think that in this case it has to be outside of one for bug to replicate. You need a context object to exist when garbage collecting during script termination.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@makslevental hi! Just gently pinging you on this PR. Please let me know if I shouldn't. There is no rush here, the bug is quite mild.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry man I've been caught up - I'll get to it today I promise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries at all
This PR melds #150137 and #149414 *and* partially reverts #124832. The summary is the `PyDenseResourceElementsAttribute` finalizer/deleter has/had two problems 1. wasn't threadsafe (can be called from a different thread than that which currently holds the GIL) 2. can be called while the interpreter is "not initialized" #124832 for some reason decides to re-initialize the interpreter to avoid case 2 and runs afoul of the fact that `Py_IsInitialized` can be false during the finalization of the interpreter itself (e.g., at the end of a script). I don't know why this decision was made (I missed the PR) but I believe we should never be calling [Py_Initialize](https://docs.python.org/3/c-api/init.html#c.Py_Initialize): > In an application \*\*\*\***embedding Python**\*\*\*\*, this should be called before using any other Python/C API functions **but we aren't embedding Python**! So therefore we will only be in case 2 when the interpreter is being finalized and in that case we should just leak the buffer. Note, [lldb](https://github.com/llvm/llvm-project/blob/548ca9e97673a168023a616d311d901ca04b29a3/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp#L81-L93) does a similar sort of thing for its finalizers. Co-authored-by: Anton Korobeynikov <[email protected]> Co-authored-by: Max Manainen <[email protected]> Co-authored-by: Anton Korobeynikov <[email protected]> Co-authored-by: Max Manainen <[email protected]>
Fixed by #150561 |
…zer (#150561) This PR melds llvm/llvm-project#150137 and llvm/llvm-project#149414 *and* partially reverts llvm/llvm-project#124832. The summary is the `PyDenseResourceElementsAttribute` finalizer/deleter has/had two problems 1. wasn't threadsafe (can be called from a different thread than that which currently holds the GIL) 2. can be called while the interpreter is "not initialized" llvm/llvm-project#124832 for some reason decides to re-initialize the interpreter to avoid case 2 and runs afoul of the fact that `Py_IsInitialized` can be false during the finalization of the interpreter itself (e.g., at the end of a script). I don't know why this decision was made (I missed the PR) but I believe we should never be calling [Py_Initialize](https://docs.python.org/3/c-api/init.html#c.Py_Initialize): > In an application \*\*\*\***embedding Python**\*\*\*\*, this should be called before using any other Python/C API functions **but we aren't embedding Python**! So therefore we will only be in case 2 when the interpreter is being finalized and in that case we should just leak the buffer. Note, [lldb](https://github.com/llvm/llvm-project/blob/548ca9e97673a168023a616d311d901ca04b29a3/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp#L81-L93) does a similar sort of thing for its finalizers. Co-authored-by: Anton Korobeynikov <[email protected]> Co-authored-by: Max Manainen <[email protected]> Co-authored-by: Anton Korobeynikov <[email protected]> Co-authored-by: Max Manainen <[email protected]>
…50561) This PR melds llvm#150137 and llvm#149414 *and* partially reverts llvm#124832. The summary is the `PyDenseResourceElementsAttribute` finalizer/deleter has/had two problems 1. wasn't threadsafe (can be called from a different thread than that which currently holds the GIL) 2. can be called while the interpreter is "not initialized" llvm#124832 for some reason decides to re-initialize the interpreter to avoid case 2 and runs afoul of the fact that `Py_IsInitialized` can be false during the finalization of the interpreter itself (e.g., at the end of a script). I don't know why this decision was made (I missed the PR) but I believe we should never be calling [Py_Initialize](https://docs.python.org/3/c-api/init.html#c.Py_Initialize): > In an application \*\*\*\***embedding Python**\*\*\*\*, this should be called before using any other Python/C API functions **but we aren't embedding Python**! So therefore we will only be in case 2 when the interpreter is being finalized and in that case we should just leak the buffer. Note, [lldb](https://github.com/llvm/llvm-project/blob/548ca9e97673a168023a616d311d901ca04b29a3/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp#L81-L93) does a similar sort of thing for its finalizers. Co-authored-by: Anton Korobeynikov <[email protected]> Co-authored-by: Max Manainen <[email protected]> Co-authored-by: Anton Korobeynikov <[email protected]> Co-authored-by: Max Manainen <[email protected]>
…50561) This PR melds llvm#150137 and llvm#149414 *and* partially reverts llvm#124832. The summary is the `PyDenseResourceElementsAttribute` finalizer/deleter has/had two problems 1. wasn't threadsafe (can be called from a different thread than that which currently holds the GIL) 2. can be called while the interpreter is "not initialized" llvm#124832 for some reason decides to re-initialize the interpreter to avoid case 2 and runs afoul of the fact that `Py_IsInitialized` can be false during the finalization of the interpreter itself (e.g., at the end of a script). I don't know why this decision was made (I missed the PR) but I believe we should never be calling [Py_Initialize](https://docs.python.org/3/c-api/init.html#c.Py_Initialize): > In an application \*\*\*\***embedding Python**\*\*\*\*, this should be called before using any other Python/C API functions **but we aren't embedding Python**! So therefore we will only be in case 2 when the interpreter is being finalized and in that case we should just leak the buffer. Note, [lldb](https://github.com/llvm/llvm-project/blob/548ca9e97673a168023a616d311d901ca04b29a3/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp#L81-L93) does a similar sort of thing for its finalizers. Co-authored-by: Anton Korobeynikov <[email protected]> Co-authored-by: Max Manainen <[email protected]> Co-authored-by: Anton Korobeynikov <[email protected]> Co-authored-by: Max Manainen <[email protected]>
This PR melds llvm/llvm-project#150137 and llvm/llvm-project#149414 *and* partially reverts llvm/llvm-project#124832. The summary is the `PyDenseResourceElementsAttribute` finalizer/deleter has/had two problems 1. wasn't threadsafe (can be called from a different thread than that which currently holds the GIL) 2. can be called while the interpreter is "not initialized" llvm/llvm-project#124832 for some reason decides to re-initialize the interpreter to avoid case 2 and runs afoul of the fact that `Py_IsInitialized` can be false during the finalization of the interpreter itself (e.g., at the end of a script). I don't know why this decision was made (I missed the PR) but I believe we should never be calling [Py_Initialize](https://docs.python.org/3/c-api/init.html#c.Py_Initialize): > In an application \*\*\*\***embedding Python**\*\*\*\*, this should be called before using any other Python/C API functions **but we aren't embedding Python**! So therefore we will only be in case 2 when the interpreter is being finalized and in that case we should just leak the buffer. Note, [lldb](https://github.com/llvm/llvm-project/blob/548ca9e97673a168023a616d311d901ca04b29a3/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp#L81-L93) does a similar sort of thing for its finalizers. Co-authored-by: Anton Korobeynikov <[email protected]> Co-authored-by: Max Manainen <[email protected]> Co-authored-by: Anton Korobeynikov <[email protected]> Co-authored-by: Max Manainen <[email protected]>
This PR melds llvm/llvm-project#150137 and llvm/llvm-project#149414 *and* partially reverts llvm/llvm-project#124832. The summary is the `PyDenseResourceElementsAttribute` finalizer/deleter has/had two problems 1. wasn't threadsafe (can be called from a different thread than that which currently holds the GIL) 2. can be called while the interpreter is "not initialized" llvm/llvm-project#124832 for some reason decides to re-initialize the interpreter to avoid case 2 and runs afoul of the fact that `Py_IsInitialized` can be false during the finalization of the interpreter itself (e.g., at the end of a script). I don't know why this decision was made (I missed the PR) but I believe we should never be calling [Py_Initialize](https://docs.python.org/3/c-api/init.html#c.Py_Initialize): > In an application \*\*\*\***embedding Python**\*\*\*\*, this should be called before using any other Python/C API functions **but we aren't embedding Python**! So therefore we will only be in case 2 when the interpreter is being finalized and in that case we should just leak the buffer. Note, [lldb](https://github.com/llvm/llvm-project/blob/548ca9e97673a168023a616d311d901ca04b29a3/lldb/source/Plugins/ScriptInterpreter/Python/PythonDataObjects.cpp#L81-L93) does a similar sort of thing for its finalizers. Co-authored-by: Anton Korobeynikov <[email protected]> Co-authored-by: Max Manainen <[email protected]> Co-authored-by: Anton Korobeynikov <[email protected]> Co-authored-by: Max Manainen <[email protected]>
Bug description
When running
iree.turbine
example script we have encountered the following error:This error reproduces in multiple environments from a simple installation from the main branch. Other people have also noticed this issue, however it was not caught in turbine's CI since this test is disabled there:
While figuring out the origins of the issue we have found the export part which leads to this issue happening and it was related to MLIR's python bindings:
This reproduces with bindings not built by IREE.
Turned out that if we explicitly destroy the context object issue from the previous script disappears, same thing happens if we explicitly destroy the export result in the first script.
I would argue that this is a bug and needs to be fixed, since not explicitly destroying all the objects shouldn't lead to a fatal python error.
Bug origins
Bug comes from this modification to the bindings - here people introduced explicit construction of python interpreter and GIL ca when deleting a DenseResource object.
However, when python terminates the interpreter it sets IsInitilized variable to false before doing garbage collection. So when the deleter is called
Py_IsInitialized
returns false while actually not all of the interpreter state has been stripped down. Import state still exist and so this leads to the error that we see in the scripts:Solution
We need to separate cases when interpreter says it is not initialized due to never existing or being killed in the moment. There is a function which can do that.
However this function was deprecated in python3.13 and another one came into its place. To support all versions of Python we introduce a function which compiles to different variants based on python version. We took this function from lldb.