Gracefully overload portal NET-215 by kalabukdima · Pull Request #91 · subsquid/sqd-portal

kalabukdima · 2026-04-01T13:29:15Z

Goal

When the portal is overloaded, it breaks. E.g., it starts overusing its own network link and bans workers for considering them too slow.

This PR aims to keep it running at capacity.

Approach

Two resources are mainly limiting the capacity:

Workers available for running the queries
Network bandwidth for downloading responses

This PR addresses both

Workers reservation is used as a backpressure mechanism. If there are not enough standby workers to serve your request reliably, you immediately get "service overloaded".
Queries are running in two steps now: awaiting the first byte and the remaining body. Parallel downloads are limited and prioritized globally. More on that below.

Changes

Congestion control

All ongoing downloads are going through the global scheduler now. To read the next chunk of data, you wait for a permit from the semaphore. The semaphore capacity is adjusted over time (AIMD) but can be configured to be constant.

Downloads are also prioritized — the query that started earlier gets the permit first. It allows sending more fetch-ahead queries without stalling more important downloads.

Worker reservation

Previously:

each stream has a limited buffer of "slots" for fetch-ahead
for each slot, we start a query to some worker
later, if the query fails, we look for another worker to retry

The problem is that if the buffer size is high, we allocate too many workers for the future. But when a query fails, we can't find a worker for a retry, and the stream is interrupted.

Now:

pre-allocate all the workers we may need for this slot
if we couldn't find enough workers, don't even open the slot
when the slot is finished, the workers are released to the pool again

This way buffers just become shorter under the load, but problematic retries are avoided. Slower syncing is better than interrupting running streams.

The code became much simpler. It also allowed forming error messages that explain why each query attempt failed, not just the last one.

Priorities

Workers are now prioritized based on the observed throughput (keeping the backoff on hard errors). Throughput is measured only at time periods when the download was actively running. Thanks to the limited number of parallel downloads, I don't expect it to be too noisy. It may still highly depend on the response size, though.

Connections

Previously:

if you want to connect to an unknown worker, a Kademlia query is triggered to find the IP, and then the connection is established (3-5s usually)
the timeout applies to the whole pipeline

Now:

a background job is trying to establish connections to all existing workers
if there is no established connection when the portal needs it, it immediately gets an error and backs off

Remaining issues

if some slot errored, we should cancel the future ones because the stream will be interrupted at this point anyway
worker priority pool still doesn't look reliable and should be redesigned
worker priority pool is not observed with metrics

Limit the number of ongoing downloads and use AIMD to adjust window size

- store each running request or its error in the slot - allocate all the workers for the slot in advance - if all attempts failed, correctly form the error msg - improve readability of retries logic - decouple idle pending state and waiting for queries state

Simplifies lease tracking and makes it more robust

mo4islona · 2026-04-14T13:58:57Z

I reviewed the overload-control changes and found a few correctness issues that look worth fixing before merge:

Congestion accounting misses the wait-for-first-byte phase (src/network/client.rs)
The scheduler permit is released right after send_query_request() returns, but the request can still spend a long time waiting in worker execution / TTFB before receive_first_byte() finishes. Because download_utilization() is derived from the scheduler state, that whole phase is invisible to the overload gate, so the portal can keep admitting new streams while many requests are already in flight.
ReadError::TooLarge is treated like a transport timeout (src/network/client.rs)
convert_read_error() routes both TooLarge and transport errors through report_query_failure(), which applies the long timeout-style penalty to the worker. An oversized response is not a timeout; the worker did respond, so this can incorrectly sideline healthy workers for the full timeout window.
Readiness can become true with zero connections when there is only one worker (src/network/client.rs)
active_connections >= num_workers * 3 / 4 truncates to 0 when num_workers == 1, so the readiness endpoint can report healthy even with no active connection to the only worker.
New / unknown workers are silently deprioritized (src/network/priorities.rs)
Truly unknown workers get default_priority() == (Best, 0, 0), while any measured worker gets a negative throughput key and therefore sorts ahead. In practice this means replacement / freshly joined workers may never be sampled while the existing pool stays healthy.
Low-severity note: get_worker now briefly leases the worker just to return its id (src/http_server.rs)
The old path explicitly did a non-leasing pick. The new path briefly increments running_queries and then drops the lease immediately, which creates a small race where concurrent callers can observe the worker as busy.

kalabukdima · 2026-04-16T06:22:12Z

I reviewed the overload-control changes and found a few correctness issues that look worth fixing before merge:

Did you or did AI review it?

kalabukdima and others added 14 commits March 11, 2026 18:14

Introduce congestion control

58f97e3

Limit the number of ongoing downloads and use AIMD to adjust window size

Prioritize individual downloads

b95c0de

Prioritize workers based on throughput

d6f4b9b

Actively keep connections with all workers

889a981

Log full error message from transport lib

dfa42a5

Don't run kademlia lookup for already found peers

a1146be

Defuse scopeguard correctly

3646eea

Rebase sqd-network repo

e549e68

Bump sqd-query to support Tempo network fields (#88) (#90)

f782518

Fix ttfb calculation

d585295

Refactor NetworkClient for readability

9b4d056

Return all errors on failure, not just the last one

3bfc1f2

Fix default worker priority

2330299

Refactor StreamController

4d4d9ab

- store each running request or its error in the slot - allocate all the workers for the slot in advance - if all attempts failed, correctly form the error msg - improve readability of retries logic - decouple idle pending state and waiting for queries state

kalabukdima requested a review from define-null April 1, 2026 13:29

kalabukdima added 4 commits April 8, 2026 12:14

Return a guard to auto release picked workers

ad63057

Simplifies lease tracking and makes it more robust

Wait for other queries to complete on timeout

d42f98e

Remove noisy logs

ed8365b

Fix clippy warnings and apply formatting

d39cf7a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully overload portal NET-215#91

Gracefully overload portal NET-215#91
kalabukdima wants to merge 18 commits intomasterfrom
reliability

kalabukdima commented Apr 1, 2026 •

edited

Loading

Uh oh!

mo4islona commented Apr 14, 2026

Uh oh!

kalabukdima commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kalabukdima commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

Approach

Changes

Congestion control

Worker reservation

Previously:

Now:

Priorities

Connections

Previously:

Now:

Remaining issues

Uh oh!

mo4islona commented Apr 14, 2026

Uh oh!

kalabukdima commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kalabukdima commented Apr 1, 2026 •

edited

Loading