Skip to content

Add support for the free-threaded build #178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 71 commits into from
Jul 28, 2025

Conversation

ngoldbaum
Copy link
Contributor

@ngoldbaum ngoldbaum commented Jun 19, 2025

Fixes #126.

Adds support for the free-threaded build of Python 3.13 and 3.14 by improving the thread safety of CFFI internals.

Overview

Mutable state that is accessible to more than one thread in the CFFI backend is now mediated via atomic operations, critical sections, or mutexes, depending on the use-case:

  • Replace use of PyDict and PyWeakref C APIs producing borrowed references with new APIs (or shims on older versions) that produce strong references.
  • get_primitive_type: We initialize all primitive types at module initialization, instead of doing them lazily. This is pretty inexpensive (~100 µs).
  • file_struct: now initialized during module initialization
  • _get_ct_int: now initialized during module initialization
  • init_once_cache: The cache itself is initialized under a critical section and we use APIs that return strong references instead of borrowed references.
  • malloc_closure: use a global mutex
  • Moved mutable state to uint8 flags synchronized via atomic operations.
  • b_complete_struct_or_union: Added critical section.
  • Make the test suite runnable under pytest-run-parallel and sets up pytest-run-parallel Linux, Mac, and Windows CI as well as a run using a TSAN-instrumented build running on Python 3.14.

Open questions

  • More docs?
  • The ctypes module in CPython isn't thread-safe in 3.13t. Should we worry about the ctypes backend on 3.13t?
  • More explicit multithreaded tests?

cc @kumaraditya303 @colesbury

@colesbury also wants to do another round of code review on the final version of this PR.

colesbury and others added 30 commits March 12, 2025 13:46
fix various errors and warnings seen on clang and gcc
…d-flag

Add CT_UNDER_CONSTRUCTION to indicate that a type is being mutated
(Kumar will fix this soon!)
Move CT_CUSTOM_FIELD_POS and CT_WITH_PACKED_CHANGE to ct_flags_mut
fix headers to avoid duplicated definitions
…dsafe

fix thread safety of `_realize_c_struct_or_union`
@ngoldbaum
Copy link
Contributor Author

OK great, please feel free to push directly to this PR branch if that's convenient for you. I'll also try to reproduce what you were seeing with the test extensions.

Having a per-test check in pytest-run-parallel like you describe is probably a good idea. We could add it to the test executor, which is already wrapped in a try/finally.

Of course that doesn't help if the GIL gets re-enabled in a subprocess, which might be what's happening here.

@ngoldbaum
Copy link
Contributor Author

Looking at several of the cases from 1), it feels like there needs to be an extension kwarg (or some other knob) to force Py_GIL_DISABLED to be defined (or explicitly undefined) for a given generated extension, rather than relying solely on whatever preprocessor state happens to be floating around when we build an extension (kinda like all the effort that goes into deciding whether or not to set Py_LIMITED_API).

This wording is making me think that maybe you're on Windows and are using the Python.org python installer, which unfortunately has a pretty major issue that won't be fixed: python/cpython#127294.

In short, if you install both the free-threaded and GIL-enabled interpreter, they will share a Python.h and site-packages folder, which can cause issues like you describe above. In particular, on Windows, the build for a free-threaded extension has to manually pass in Py_GIL_DISABLED, since Python.h doesn't set it.

The Windows Python.org installer is the only distribution that has this issue - if you install Python via the new PyManager installer or nuget, the free-threaded interpreter and GIL-enabled interpreter won't share an overlapping installation.

@ngoldbaum
Copy link
Contributor Author

I looked at the extensions created by the tests and it looks like they all set Py_MOD_GIL_NOT_USED, at least on my Mac testing environment. I'm also unable to reproduce the GIL being re-enabled. It looks like it's not happening on any of the CI runs either, unless I'm missing something.

I spent a little time drilling down into the extensions generated by recompiler.py, but ultimately all the modules created there are modified versions of the "main" _cffi_backend module, which has Py_MOD_GIL_NOT_USED set.

@colesbury tells me he thinks the most likely result of not having Py_GIL_DISABLED set is a crash, rather than the GIL being re-enabled.

I'll wait until I have more info to proceed further.

@colesbury
Copy link
Contributor

I'm not sure why @nitzmahone is seeing GIL enabled warnings. I don't see any locally and I don't see any in the GitHub CI logs. Could you have some unrelated or stale modules in the same directory that are pulled in by pytest?

The Windows explanation doesn't make sense to me. First, setuptools passed the necessary Py_GIL_DISABLED compiler flag to work around the Windows installer issue that @ngoldbaum mentioned. Even if that didn't happen, a missing Py_GIL_DISABLED flag when compiling will lead to crashes on import due to incorrect struct definitions; I don't think you'd see GIL enabled warnings.

@nitzmahone - when you get a chance to look at it more, would you please share the relevant logs?

@nitzmahone
Copy link
Member

The issues I've been seeing were all from manual poking around on my Linux dev box (Fedora 42 x86_64) with a bone-stock build from pyenv install 3.14.0rc1t.

Most of the test extensions are being correctly built with Py_GIL_DISABLED, which was why I hacked in the more granular per-test pre/post check (which currently has to be run sans worker threads to blame the correct test). I'd just started digging into the why last night when I ran out of time. From my past experience with it, CFFI's test suite is notoriously "leaky", and there are a number of one-off tests sprinkled around that do things ... differently. It's also complicated by so many operational modes- I've fixed some tests in the past that were directly (or indirectly) invoking the wrong Python and/or config, so it's possible that's what's happening here. I was just disabling the individual tests that were re-enabling the GIL last night to come up with the full list, so I haven't dived into the why yet.

Now that I'm less pudding-brain, I'll go back over it with fresh eyes and make sure I'm not getting tripped up by random stale non-t test build artifacts or something else. I might also temporarily switch to an xdist/forked run model like we do for Ansible- getting rid of intra-test interpreter state leakage makes it a lot easier to pinpoint problems.

More to come...

@nitzmahone
Copy link
Member

nitzmahone commented Jul 24, 2025

Grr, sorry for the fire-drill- I don't know exactly where it was coming from because I just blew it all away, but must've been some cached intermediate build cruft from previous local/manual test runs where the extension init GIL opt-out wasn't occurring.

I'm just validating a tweak to the CIBW test config to force warning-as-error on GIL-re-enable for all the t targets- once I'm sure it's behaving correctly, I'll add a commit to this PR, then assuming it's all green we can merge and do a release.

Thanks all!

@nitzmahone
Copy link
Member

nitzmahone commented Jul 24, 2025

Hrm, after kicking off a few manual workflow runs with all targets enabled to simulate a release, the Linux and MacOS 3.13t parallel runs (skipped for PRs) are both segfaulting in CI in exactly the same place (testing/cffi1/test_pkgconfig.py) where 3.14t is fine.

I'm able to very reliably repro the segfault locally on Fedora 42's packaged build of 3.13.5t with:

pytest testing/cffi1/test_re_python.py --parallel-threads=3  # 2 breaks sometimes, >=3 breaks quite reliably

where various similar permutations seem fine on 3.14.0rc1t.

I'm not opposed to proceeding with merge/beta without fixing this, but I would really like to at least understand what's going on there. I need to hang it up for the day pretty soon, and I'm unavailable on Friday morning. If someone else that can repro this locally wants to try and catch it in the act against a debug build, cool, otherwise I'll take a stab at it Friday afternoon.

@ngoldbaum
Copy link
Contributor Author

ngoldbaum commented Jul 25, 2025

Argh! So close...

I didn't see this when I triggered the full CI back in June: https://github.com/ngoldbaum/cffi-ft/actions/runs/15769464108. I Just double-checked and that run did include the commit that makes sure the GIL doesn't get re-enabled in the tests, so that test run did pass on 3.13t with the GIL disabled.

Too bad I didn't think to redo that exercise in the month since then. Apologies for the oversight.

I ran the tests tonight using 3.13.5t on my Mac and I see similar results (TSAN reports and random test failures, although no segfaults).

We did some updates to the PR after I triggered that full CI run - in particular we adjusted the locking strategy to do something a lot simpler - just use a single critical section on a single globally allocated dummy PyObject. On Python 3.14, this acts more-or-less like a sort of "per-library GIL". On Python 3.13 there are different semantics for when critical sections can be suspended and I suspect that's the source of the differences here. That's just my guess anyway.

I pinged @kumaraditya303 to help track this down. Hopefully he has time to look at this tomorrow and there is something straightforward we can do, but it may take a little time to work out the correct fix.

Just a thought - maybe we can disable pytest-run-parallel CI for the wheel builds on 3.13t while we work out these issues, just to unblock people who want to enable builds downstream. We can note in the release notes that there are known thread safety issues on 3.13t.

@mattip
Copy link
Contributor

mattip commented Jul 25, 2025

I wouldn't want the first release to go out with known problems, hopefully it will be straightforward to fix this.

@nitzmahone
Copy link
Member

nitzmahone commented Jul 25, 2025

Yeah, I'd prefer to have the beta working without caveat- hopefully there's a quick fix once we figure out exactly what's going on. If we can't get an easy fix, well, we can cross that bridge if we come to it.

Painfully slow as it is, it'd probably be wise to at least temporarily enable the full test suite for a Windows target or two as well- IIRC I had the whole thing passing for either early 3.14 alpha or late 3.13 pre-relase (only tested with GIL-ful threading), so not sure what its current operational state is. I'll pick that up Friday afternoon as well.

@ngoldbaum
Copy link
Contributor Author

ngoldbaum commented Jul 25, 2025

@kumaraditya303 did some digging and we're now pretty confident the issue is what I described above: recursive critical sections can be suspended more often in 3.13 than they could be in 3.14. This comes down to this change: python/cpython#128126. While that is billed as a performance improvement, it also has a side-effect of making recursive critical sections behave more like recursive locks in 3.14 than they did in 3.13.

We now think the clearest path forward is probably just to not try to support 3.13t in CFFI, at least not at first. If it turns out that 3.13t support is critical for whatever reason, we can add support later, but it'd be a shame to not ship 3.14t support using our current approach when we know that works fine.

Another reason not to support 3.13t is there are also races in CPython internals that are triggered in the CFFI test suite (see the gist I shared yesterday). Some of them are coming from PyDict internals -- CFFI uses python dictionaries heavily in its implementation -- and as I understand it the fixes for these bugs probably won't be backported to 3.13.

If we do need to support 3.13t then we probably need to replace the CFFI_LOCK critical section with a recursive mutex. The problem there is there isn't an obvious, portable choice to use in C. In NumPy we solved a similar issue by using C++ standard library features but we probably can't do that here. The Python C API doesn't expose a recursive mutex.

If anyone really does need to use 3.13t, they can, they'll just need to deal with the thread safety issues. You can work around them by initializing types in the main thread before spawning worker threads. Obviously that's not good and CFFI shouldn't officially support that, but 3.13t is experimental as-is and anyone using it can be expected to go a little out of their way to get things working.

@nitzmahone @mattip if you're ok with not supporting 3.13t then I'll go ahead and remove the 3.13t CI and wheel builds from this PR.

@ofek
Copy link
Contributor

ofek commented Jul 25, 2025

As a user, we are okay with only 3.14 support.

@nitzmahone
Copy link
Member

nitzmahone commented Jul 25, 2025

Yeah, I'm fine with that. It's not ideal, but IIUC the experimental label wasn't retroactively removed from 3.13t- 3.14t is the first release that's been "blessed" for production use. If 3.13t support would complicate things with bespoke/non-portable sync primitives, I'm all for skipping it.

Are you thinking that CFFI 2.0+ should explicitly refuse to build against 3.13t, or just that it's documented as "YMMV" and that we won't offer wheels for it?

@ngoldbaum
Copy link
Contributor Author

ngoldbaum commented Jul 25, 2025

If 3.13t support would complicate things with bespoke/non-portable sync primitives, I'm all for skipping it.

Awesome! I'll work on this now.

Are you thinking that CFFI 2.0+ should explicitly refuse to build against 3.13t, or just that it's documented as "YMMV" and that we won't offer wheels for it?

I can do it either way. I think maybe let's be conservative and add a check to the setup.py file that bails with a RuntimeError saying 3.13t is unsupported and free-threaded support starts in 3.14. Since there will only be an sdist available on 3.13t, anyone with a build or runtime dependency on CFFI will continue to see build errors in 3.13t.

To make that concrete, right now you see errors if you try to install something that depends on CFFI due to trying to use the limited API. Here's the output of pip install cryptography today: https://gist.github.com/ngoldbaum/cf271f81ef13802e1575d3770ed0928c

If I add a sys.version_info check to the cffi setup.py and force cryptography to use my local clone of CFFI as a build dependency, I see this output instead, with a much more friendly error message that explains what the problem and solution are: https://gist.github.com/ngoldbaum/232efb3b79ebd4c69a376be1b3ffbddb.

Someone can patch that error away and the build will succeed, but then they know they're doing unsupported things.

Does that sound reasonable?

@ngoldbaum
Copy link
Contributor Author

ngoldbaum commented Jul 25, 2025

See latest commits. I also kicked off a "full" CI run on my fork: https://github.com/ngoldbaum/cffi-ft/actions/runs/16530469003

@ngoldbaum
Copy link
Contributor Author

It looks like everything is passing. Still waiting on PPC64le and s390x but that will take a while since they're emulated. Maybe I should re-enable Windows too?

@nitzmahone
Copy link
Member

Yeah, that all looks great- I'm ready to merge if you are.

@ngoldbaum
Copy link
Contributor Author

ngoldbaum commented Jul 25, 2025

Let's ship it! :shipit:

@ngoldbaum
Copy link
Contributor Author

Also if you need any support in any tasks for shipping the beta or final release, feel free to ping me 😀

@alex
Copy link
Contributor

alex commented Jul 28, 2025

Is anything else required to merge? (Or did we livelock here :D)

@nitzmahone
Copy link
Member

Nah, just woke up to lots of other things on fire this morning- merging now...

@nitzmahone nitzmahone merged commit 7ed073d into python-cffi:main Jul 28, 2025
32 checks passed
@alex
Copy link
Contributor

alex commented Jul 28, 2025

Woooo! Huge thank you to everyone who made this happen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proper free-threaded ABI support blockers
7 participants