Skip to content

To fix the shutdown errors, we need to stop the tokio loop... #750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: gh/zdevito/62/base
Choose a base branch
from

Conversation

zdevito
Copy link
Contributor

@zdevito zdevito commented Aug 4, 2025

Stack from ghstack (oldest at bottom):

I am pretty sure our finalization errors are some combo of:

  1. schedule some work that will hold or take GIL on a thread owned by tokio. This includes anything that holds a reference to a PyObject, because Drop on PyObject will take the GIL.
  2. call Py_Finalize from the main thread
  3. Python unloads
  4. tokio thread tries to use GIL and crashes.

We also start at least one std::thread which runs python that will encounter this problem and not be on the tokio loop.

To avoid this, we need to shutdown the tokio loop (i.e. allow all tasks to reach an await, and then cancel everything).

However, pyo3_async_runtime makes this nearly impossible because it requires a 'static lifetime reference to the loop, so it cannot be shutdown.

This diff shows how to call shutdown on the tokio loop, and I observed it able to fix the test_proc_mesh_size finalization issues. However, I have to avoid initializing pyo3_async_runtime.

One way to make this shippable would be to remove our usein of pyo3_async_runtime entirely. We primarily need it for future_into_py but we can just reimplement it directly ourself.

Differential Revision: D79533010

I am pretty sure our finalization errors are some combo of:

1. schedule some work that will hold or take GIL on a thread owned by tokio. This includes anything that holds a reference to a PyObject, because Drop on PyObject will take the GIL.
2. call Py_Finalize from the main thread
3. Python unloads
4. tokio thread tries to use GIL and crashes.

We also start at least one std::thread which runs python that will encounter this problem and not be on the tokio loop.

To avoid this, we need to shutdown the tokio loop (i.e. allow all tasks to reach an await, and then cancel everything).

However, pyo3_async_runtime makes this nearly impossible because it requires a 'static lifetime reference to the loop, so it cannot be shutdown.

This diff shows how to call shutdown on the tokio loop, and I observed it able to fix the `test_proc_mesh_size` finalization issues. However, I have to avoid initializing pyo3_async_runtime.

One way to make this shippable would be to remove our usein of pyo3_async_runtime entirely. We primarily need it for future_into_py but we can just reimplement it directly ourself.

Differential Revision: [D79533010](https://our.internmc.facebook.com/intern/diff/D79533010/)

[ghstack-poisoned]
zdevito added a commit that referenced this pull request Aug 4, 2025
I am pretty sure our finalization errors are some combo of:

1. schedule some work that will hold or take GIL on a thread owned by tokio. This includes anything that holds a reference to a PyObject, because Drop on PyObject will take the GIL.
2. call Py_Finalize from the main thread
3. Python unloads
4. tokio thread tries to use GIL and crashes.

We also start at least one std::thread which runs python that will encounter this problem and not be on the tokio loop.

To avoid this, we need to shutdown the tokio loop (i.e. allow all tasks to reach an await, and then cancel everything).

However, pyo3_async_runtime makes this nearly impossible because it requires a 'static lifetime reference to the loop, so it cannot be shutdown.

This diff shows how to call shutdown on the tokio loop, and I observed it able to fix the `test_proc_mesh_size` finalization issues. However, I have to avoid initializing pyo3_async_runtime.

One way to make this shippable would be to remove our usein of pyo3_async_runtime entirely. We primarily need it for future_into_py but we can just reimplement it directly ourself.

Differential Revision: [D79533010](https://our.internmc.facebook.com/intern/diff/D79533010/)

ghstack-source-id: 300628159
Pull Request resolved: #750
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 4, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79533010

…..."

I am pretty sure our finalization errors are some combo of:

1. schedule some work that will hold or take GIL on a thread owned by tokio. This includes anything that holds a reference to a PyObject, because Drop on PyObject will take the GIL.
2. call Py_Finalize from the main thread
3. Python unloads
4. tokio thread tries to use GIL and crashes.

We also start at least one std::thread which runs python that will encounter this problem and not be on the tokio loop.

To avoid this, we need to shutdown the tokio loop (i.e. allow all tasks to reach an await, and then cancel everything).

However, pyo3_async_runtime makes this nearly impossible because it requires a 'static lifetime reference to the loop, so it cannot be shutdown.

This diff shows how to call shutdown on the tokio loop, and I observed it able to fix the `test_proc_mesh_size` finalization issues. However, I have to avoid initializing pyo3_async_runtime.

One way to make this shippable would be to remove our usein of pyo3_async_runtime entirely. We primarily need it for future_into_py but we can just reimplement it directly ourself.

Differential Revision: [D79533010](https://our.internmc.facebook.com/intern/diff/D79533010/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79533010

…..."

I am pretty sure our finalization errors are some combo of:

1. schedule some work that will hold or take GIL on a thread owned by tokio. This includes anything that holds a reference to a PyObject, because Drop on PyObject will take the GIL.
2. call Py_Finalize from the main thread
3. Python unloads
4. tokio thread tries to use GIL and crashes.

We also start at least one std::thread which runs python that will encounter this problem and not be on the tokio loop.

To avoid this, we need to shutdown the tokio loop (i.e. allow all tasks to reach an await, and then cancel everything).

However, pyo3_async_runtime makes this nearly impossible because it requires a 'static lifetime reference to the loop, so it cannot be shutdown.

This diff shows how to call shutdown on the tokio loop, and I observed it able to fix the `test_proc_mesh_size` finalization issues. However, I have to avoid initializing pyo3_async_runtime.

One way to make this shippable would be to remove our usein of pyo3_async_runtime entirely. We primarily need it for future_into_py but we can just reimplement it directly ourself.

Differential Revision: [D79533010](https://our.internmc.facebook.com/intern/diff/D79533010/)

[ghstack-poisoned]
zdevito added a commit that referenced this pull request Aug 5, 2025
Pull Request resolved: #750

I am pretty sure our finalization errors are some combo of:

1. schedule some work that will hold or take GIL on a thread owned by tokio. This includes anything that holds a reference to a PyObject, because Drop on PyObject will take the GIL.
2. call Py_Finalize from the main thread
3. Python unloads
4. tokio thread tries to use GIL and crashes.

We also start at least one std::thread which runs python that will encounter this problem and not be on the tokio loop.

To avoid this, we need to shutdown the tokio loop (i.e. allow all tasks to reach an await, and then cancel everything).

However,` pyo3_async_runtime::tokio` makes this nearly impossible because it requires a 'static lifetime reference to the loop, so it cannot be shutdown.

This diff shows how to call shutdown on the tokio loop, and I observed it able to fix the `test_proc_mesh_size` finalization issues. However, I have to avoid initializing pyo3_async_runtime::tokio.

This diff removes our use of  pyo3_async_runtime::tokio. We instead replace with our own wrapper to get a python future which uses our get_tokio_runtime function.

ghstack-source-id: 300750441

Differential Revision: [D79533010](https://our.internmc.facebook.com/intern/diff/D79533010/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79533010

…..."

I am pretty sure our finalization errors are some combo of:

1. schedule some work that will hold or take GIL on a thread owned by tokio. This includes anything that holds a reference to a PyObject, because Drop on PyObject will take the GIL.
2. call Py_Finalize from the main thread
3. Python unloads
4. tokio thread tries to use GIL and crashes.

We also start at least one std::thread which runs python that will encounter this problem and not be on the tokio loop.

To avoid this, we need to shutdown the tokio loop (i.e. allow all tasks to reach an await, and then cancel everything).

However, pyo3_async_runtime makes this nearly impossible because it requires a 'static lifetime reference to the loop, so it cannot be shutdown.

This diff shows how to call shutdown on the tokio loop, and I observed it able to fix the `test_proc_mesh_size` finalization issues. However, I have to avoid initializing pyo3_async_runtime.

One way to make this shippable would be to remove our usein of pyo3_async_runtime entirely. We primarily need it for future_into_py but we can just reimplement it directly ourself.

Differential Revision: [D79533010](https://our.internmc.facebook.com/intern/diff/D79533010/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79533010

…..."

I am pretty sure our finalization errors are some combo of:

1. schedule some work that will hold or take GIL on a thread owned by tokio. This includes anything that holds a reference to a PyObject, because Drop on PyObject will take the GIL.
2. call Py_Finalize from the main thread
3. Python unloads
4. tokio thread tries to use GIL and crashes.

We also start at least one std::thread which runs python that will encounter this problem and not be on the tokio loop.

To avoid this, we need to shutdown the tokio loop (i.e. allow all tasks to reach an await, and then cancel everything).

However, pyo3_async_runtime makes this nearly impossible because it requires a 'static lifetime reference to the loop, so it cannot be shutdown.

This diff shows how to call shutdown on the tokio loop, and I observed it able to fix the `test_proc_mesh_size` finalization issues. However, I have to avoid initializing pyo3_async_runtime.

One way to make this shippable would be to remove our usein of pyo3_async_runtime entirely. We primarily need it for future_into_py but we can just reimplement it directly ourself.

Differential Revision: [D79533010](https://our.internmc.facebook.com/intern/diff/D79533010/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79533010

…..."

I am pretty sure our finalization errors are some combo of:

1. schedule some work that will hold or take GIL on a thread owned by tokio. This includes anything that holds a reference to a PyObject, because Drop on PyObject will take the GIL.
2. call Py_Finalize from the main thread
3. Python unloads
4. tokio thread tries to use GIL and crashes.

We also start at least one std::thread which runs python that will encounter this problem and not be on the tokio loop.

To avoid this, we need to shutdown the tokio loop (i.e. allow all tasks to reach an await, and then cancel everything).

However, pyo3_async_runtime makes this nearly impossible because it requires a 'static lifetime reference to the loop, so it cannot be shutdown.

This diff shows how to call shutdown on the tokio loop, and I observed it able to fix the `test_proc_mesh_size` finalization issues. However, I have to avoid initializing pyo3_async_runtime.

One way to make this shippable would be to remove our usein of pyo3_async_runtime entirely. We primarily need it for future_into_py but we can just reimplement it directly ourself.

Differential Revision: [D79533010](https://our.internmc.facebook.com/intern/diff/D79533010/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79533010

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants