Skip to content

Fix race condition in pool close (#3217) #3299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

madadam
Copy link
Contributor

@madadam madadam commented Jun 19, 2024

Attempt to fix #3217.

@abonander
Copy link
Collaborator

@madadam if you rebase it should fix the CI failure.

@madadam madadam force-pushed the pool-close-race-condition branch from eaaa536 to cdb5707 Compare August 8, 2024 12:05
@abonander
Copy link
Collaborator

Given that the PgListener test is consistently failing even after multiple re-runs, I'm wondering if there's some subtle problem with the fix here.

@madadam madadam force-pushed the pool-close-race-condition branch from cdb5707 to 5a6d8b5 Compare February 6, 2025 12:07
@madadam
Copy link
Contributor Author

madadam commented Feb 6, 2025

Finally found some time to look into this. The test was failing due to a deadlock: There was still one checked out connection inside the PgListener and so Pool::close was waiting for it to be released which never happened. The reason this was passing before is that the test accidentally relied on the old buggy behaviour of Pool::close where it didn't always wait for all connections to close. I fixed the test, rebased against main and updated the PR.

@abonander
Copy link
Collaborator

That's weird, now some of the migrations tests are timing out.

@madadam
Copy link
Contributor Author

madadam commented Feb 10, 2025

Yeah I noticed. I'll try to look into it when I can. Btw, how do you guys run these tests locally? I noticed that tests/x.py doesn't run the same test suite as what's run on the CI. In fact, I'm getting a compile error currently:

# unit test core
 $ cargo test --no-default-features --manifest-path sqlx-core/Cargo.toml --features json,offline,migrate,_rt-async-std,_tls-rustls 
warning: /home/adam/projects/sqlx/Cargo.toml: file `/home/adam/projects/sqlx/tests/sqlite/macros.rs` found to be present in multiple build targets:
  * `integration-test` target `sqlite-macros`
  * `integration-test` target `sqlite-unbundled-macros`
warning: /home/adam/projects/sqlx/sqlx-macros-core/Cargo.toml: unused manifest key: lints.rust.unexpected_cfgs.check-cfg
   Compiling sqlx-core v0.8.3 (/home/adam/projects/sqlx/sqlx-core)
error[E0425]: cannot find value `provider` in this scope
   --> sqlx-core/src/net/tls/tls_rustls.rs:107:54
    |
107 |     let config = ClientConfig::builder_with_provider(provider.clone())
    |                                                      ^^^^^^^^ not found in this scope

Also, trying to run a single target using the --target option throws exception:

# test postgres 17
Traceback (most recent call last):
  File "/home/adam/projects/sqlx/tests/./x.py", line 179, in <module>
    run(
  File "/home/adam/projects/sqlx/tests/./x.py", line 90, in run
    database_url = start_database(service, database="sqlite/sqlite.db" if service == "sqlite" else "sqlx", cwd=dir_tests)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adam/projects/sqlx/tests/docker.py", line 24, in start_database
    res = subprocess.run(
          ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.12/subprocess.py", line 1955, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'docker-compose'

@madadam
Copy link
Contributor Author

madadam commented Mar 3, 2025

Ok, I think the problem is that when parent pool is used (which is the case in those failing tests), the child pool's semaphore is created with zero initial permits. So trying to acquire any permits on it in close causes deadlock. I need to think how to fix this.

@abonander
Copy link
Collaborator

abonander commented Apr 13, 2025

@madadam I think we could just get rid of the parent/child pool thing. I've been conceptualizing a whole new architecture for Pool that it wouldn't fit into anyway.

Instead, we could just divide a default max_connections value, say, 64, by the number of test threads being spawned, and use a semaphore to lock that many permits at a time and give that many connections to each test (edit: actually, I'm not sure this is necessary, and it would seem to break when using nextest anyway).

We could use an environment variable, SQLX_TEST_MAX_CONNECTIONS to control the number of connections being divided up, and a control attribute to #[sqlx::test] to adjust the max_connections the pool should have (less or more).

@abonander
Copy link
Collaborator

Re. tests/x.py, I don't personally use it and the CI doesn't use it, so it's at the mercy of someone bothering to update it when it breaks. I've been meaning to get rid of it, but some people find it useful so it's not an easy decision. I also don't know what I would replace it with. Justfile, maybe? If anything?

Being able to run the same tests CI performs locally would be awesome, but there's also the issue of having a single source of truth for the tests. If commands get added to x.py/ the Justfile that aren't tested in CI, we have the same problem again. But I don't want CI to just be x.py --all-tests because that would have awful concurrency and wouldn't give great feedback on Github without setting up bots. So then adding a new test means adding it to the x.py/Justfile/whatever, and also adding it to CI.

https://github.com/nektos/act seems promising but it needs some tweaking since it doesn't support ubuntu-24.04 out of the box yet.

The top result I get from Reddit about locally runnable CI is "just use Makefiles"... gross.

@abonander
Copy link
Collaborator

I'm thinking it'd be really neat if cargo test just worked. Maybe using testcontainers.

@jpmelos
Copy link
Contributor

jpmelos commented Jul 29, 2025

@madadam @abonander I'm suggesting a fix here: #3952. Would you like to continue the discussion there? I'd love to help expedite the fix for this bug with any additional help I can offer, since these leftover connections are negatively affecting a project of mine, plus I want to contribute back to this project!

@madadam
Copy link
Contributor Author

madadam commented Aug 13, 2025

@jpmelos Looks good to me and can be a good quick fix until the more substantial changes @abonander is talking about are implemented.

@abonander
Copy link
Collaborator

Closed by #3952

@abonander abonander closed this Aug 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pool::close does not always wait for all connections to close
3 participants