-
Notifications
You must be signed in to change notification settings - Fork 611
refactor: idle pool and fix non-deterministic teardown of idle cleanup goroutine #1687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
refactor: idle pool and fix non-deterministic teardown of idle cleanup goroutine #1687
Conversation
…s shuffling of conns
2cf0ff2 to
0154a51
Compare
|
Thanks @GeorgeMac for the PR. Your test case using Although I wonder if we need the whole I quickly tried simple fix
type clickhouse struct {
opt *Options
idle chan nativeTransport
open chan struct{}
exit chan struct{}
closed atomic.Bool
connID int64
}
func (ch *clickhouse) Close() error {
if ch.closed.Load() {
return nil
}
for {
select {
case conn := <-ch.idle:
conn.debugf("[close: closing pool]")
conn.close()
default:
ch.exit <- struct{}{}
ch.closed.Store(true)
return nil
}
}
}(2) is to wait till the go routine ( I'm curious how bad "not having age based priority queue" is impacting your use cases? Happy to discuss in detail. |
|
@kavirajk we could definitely go after a simpler fix that keeps it more in-line with what is already here if that is preferred 👍 What you proposed works for ensuring that we stop leaking However, one thing to keep in mind here is that is doesn't guarantee to drain the idle pool on shutdown, which leaves the real connections unclosed. The Now we have released connections which will never be closed, because the auto close idle routine is gone. We could go a step further with your proposal there and attempt to drain the idle channel. This is also something I have addressed in my implementation, and I suspect that the existing idle connections counting tests in the You are right, the ordering by age is not necessary to solve the problem, and I am happy to remove it if it is not favourable. However, I would challenge the decision to use a channel as an idle pool, and would instead suggest we would be better served with a slice and a lock for this. Much like how database/sql.DB manages its connection pool. |
|
Update: Actually, you are right to question the order connections by age. I am not sure what I was thinking there. I guess just prioritising older connections get reuse before releasing them. It also had the nice side effect that when putting a connection into a pool at capacity, we always throw away the connection with the soonest expiration. However, in hindsight (after sleeping on it and your push back) I see that really FIFO order of connections being reused is probably a fairer use of all connections. Rather than potentially never using newer connections, which could happen with the priority queue based on age. In that case, a channel has good semantics for the behaviour, but I still think it has ergonomic issues and still really we need a lock to coordinate that once closed any returned connections aren't added, but simply closed themselves. Let me shuffle this around, see if I can keep it closer to the original implementation, remove the leak, and ensure all connections are drained. |
|
Sorry @kavirajk in an effort to remove complexity, I made the change bigger 😅 The change I made breaks the logic up into three separate parts:
I updated the description above to go into details. While bigger, I hope the implementation is more digestible in some ways (or maybe not, you tell me), and with far more tests than before. |
|
No worries @GeorgeMac. Thanks for trying to fix this issue. I will take a deep look today :) |
Summary
This PR does a couple things:
startAutoCloseIdleConnectionsgoroutinesWe observed in the CH operator that
startAutoCloseIdleConnectionsgoroutines were being leaked.We found in one long running operator, 35k instances of this goroutine running unchecked.
These get leaked if
conn.Close()is not called, however, it appears that the client implementation can also leak them in the situation that the receiver onch.exitis not ready whenconn.Close()is called. Since it does so with a select and a default, which is best effort and needs a receiver to be ready.A quick test with the new
testing/synctestpackage showed that this was very easy to demonstrate by opening a connection and then immediately closing it. I wanted to include this test in this PR, however, this package only stabilized in Go 1.25 and in 1.24 it was introduced behind aGOEXPERIMENTenv var with a different library API. So I have omitted it for now and included it as an example in this description:And it fails deterministically like this:
After this change, this test passes.
Additional Refactor
To tackle this, I wanted to revisit the idle pool, which so far has used a channel for holding connections.
I think the channel works OK, however, it has some drawbacks:
Instead, I have proposed here a new idle pool and circular queue implementation.
Circular Queue
The circular queue is a ring with a max capacity and constant memory usage (it avoids resizing).
Pushing and pulling from the queue wraps items around in the slice under the hood.
When the queue is full, any attempts to push new entries are rejected.
https://github.com/GeorgeMac/clickhouse-go/blob/71a7247f7421d7939fd6b9ead2a3808121763f1f/internal/circular/queue.go#L40-L51
The queue supports clearing elements based on a filter function (which is used for clearing expired connections).
This returns an iterator over the items which were cleared (we use this later for closing cleared connections).
The queue does not protect against concurrent access, it is expected that the caller will handle that.
https://github.com/GeorgeMac/clickhouse-go/blob/71a7247f7421d7939fd6b9ead2a3808121763f1f/internal/circular/queue.go#L86-L93
There is a set of unit tests and benchmarks focused on attempting to validate these semantics and memory guarantees.
Idle Pool
The idle pool focuses on supporting concurrent access to connections in the backing queue, background cleanup of expired connections and safe cleanup of these processes on shutdown (to address the original idle cleanup goroutine problem).
It exposes a Get and a Put method for managing connections in the pool.
Put handles automatically dropping connections which are expired or when capacity in the pool is reached.
Get removes connections from the queue, until it finds a non-expired one or the queue is empty.
For each already expired one it comes across, it closes it.
The cleanup routine periodically attempts to clear connections which have expired.
It blocks until either its ticker tics, or it is signaled to finish, in which case it returns.
Close handles signalling the cleanup routine to finish, then waiting for it to do so.
Once the cleanup routine is finished, it goes through and clears all connections in the pool.
Changes to
type clickhouse struct {}Primarily, this swaps the channel used for idle connections with the new
*idlePool.This removes the need for the idle cleanup from this part of the code.
Now both acquire and release interface with the idle pool to attempt to fetch existing idle connections and return them.
There is also a little further simplification such as leaning into context timeouts for some dial timeout handling.
Let me know what you think 🙏
Checklist
Delete items not relevant to your PR: