fix: Reconnect on a closed connection when checked out#304
fix: Reconnect on a closed connection when checked out#304jrogov wants to merge 1 commit intoplausible:masterfrom
Conversation
| def request_chunked(conn, method, path, headers, stream, opts) do | ||
| with {:ok, conn, ref} <- send_request(conn, method, path, headers, :stream), | ||
| {:ok, conn} <- stream_body(conn, ref, stream), | ||
| do: receive_full_response(conn, timeout(conn, opts)) | ||
| with_retry_if_stale_connection(conn, fn conn -> | ||
| with {:ok, conn, ref} <- send_request(conn, method, path, headers, :stream), | ||
| {:ok, conn} <- stream_body(conn, ref, stream), | ||
| do: receive_full_response(conn, timeout(conn, opts)) | ||
| end) |
There was a problem hiding this comment.
I'm pretty sure we should not do this on request_chunked when stream-writing the data, as it would mean that we could write duplicated data. Just wanted to check if that's correct.
If that's so, let me know, and I'll remove this part
There was a problem hiding this comment.
as it would mean that we could write duplicated data
I think since it's an ongoing HTTP request, if the connection gets closed for it, ClickHouse would drop all received data, at least for non-async inserts. But I haven't actually checked that. So it might be that we don't gain anything from reconnecting since we need to start sending the stream anew anyway.
There was a problem hiding this comment.
Also request_chunked should probably be removed (I'll open a PR), I added it because I didn't know at the time about ability to do Stream.into/2 with DBConnection.
DBConnection.run(ch, fn conn ->
File.stream!("demo.csv", 512_000)
|> Stream.into(Ch.stream(conn, "insert into demo format CSV\n"))
|> Stream.run()
end)But the issue would remain, do we reconnect in those scenarios or not.
There was a problem hiding this comment.
I think since it's an ongoing HTTP request, if the connection gets closed for it, ClickHouse would drop all received data, at least for non-async inserts.
Good catch!
But thinking about it, the problem we're trying to solve here will never manifest in the middle of the insert anyway: the query gets closed because of the keep-alive-timeout, in particular on Clickhouse side.
Even if it's because of a proxy in-between, it'll still happen in the very beginning, before we actually manage to insert any data at all.
So the flow will be initiate chunked request -> try to send data -> try to receive ack -> get socket: closed`, with no data inserted at all -> reconnect (this PR) -> start actually inserting the data.
The only side-effect would be (partially) running the Stream twice..?
But that gets covered by your notion that...
Also
request_chunkedshould probably be removed (I'll open a PR), I added it because I didn't know at the time about ability to doStream.into/2with DBConnection.
What would be the flow here, then? I see that lib/ch/stream.ex gets a portion of data, and then simply sends it, IIUC. Then it wouldn't rerun the Stream if it fails on sending that data?
But IIUC we should also add retry mechanism here as well: https://github.com/adjust/ch/blob/fcfe1e3fbdb9999a84719425896b919f9a84f049/lib/ch/connection.ex#L193-L223
WDYT?
|
👋 @jrogov I wonder, what's your use-case? If something is pausing for that long, why not do two separate checkouts? |
We run queries on a considerably big dataset. Even with CH these queries can run up to 10 minutes. The queries need so much memory that they frequently OOM unless we split them in smaller queries, and chain them together (think Obviously, we don't want to hand-over connections between these "sub-queries" – we'd rather optimize for a single stream efficiency than request parallelization. So we need |
Why not? Compared to 10 minute queries, checkout costs are probably negligible. And these HTTP connections should probably be considered interchangeable. I think I am missing something ...
You might also want to consider not using a shared pool for it at all but a "dedicated" one, with a single connection. It's not easy to do right now because of the way Ch uses One problem with reconnects/retries like this, I think, is that we can only know about closed connection on Maybe we can have something like I don't know if "closed socket" errors ever happen on Maybe in the future we could use |
(In collaboration with @trollfred)
Hiya 👋
We've been experimenting with
checkouta bit more, and stumbled upon a rather inconvenient behaviour ofChwhen a connection's beencheckoutofecho_chrepo.TL;DR: due to Clickhouse's
keep_alive_timeout(defaults to 10 seconds), when a connection is checked out and there's a pause between DB operations, connection gets closed by server, and instead of reconnecting,chjust fails.This PR provides a simple single reconnect in such cases.
Steps to reproduce:
Result: