Skip to content

Gracefully overload portal NET-215#91

Open
kalabukdima wants to merge 18 commits intomasterfrom
reliability
Open

Gracefully overload portal NET-215#91
kalabukdima wants to merge 18 commits intomasterfrom
reliability

Conversation

@kalabukdima
Copy link
Copy Markdown
Contributor

@kalabukdima kalabukdima commented Apr 1, 2026

Goal

When the portal is overloaded, it breaks. E.g., it starts overusing its own network link and bans workers for considering them too slow.

This PR aims to keep it running at capacity.

Approach

Two resources are mainly limiting the capacity:

  • Workers available for running the queries
  • Network bandwidth for downloading responses

This PR addresses both

  • Workers reservation is used as a backpressure mechanism. If there are not enough standby workers to serve your request reliably, you immediately get "service overloaded".
  • Queries are running in two steps now: awaiting the first byte and the remaining body. Parallel downloads are limited and prioritized globally. More on that below.

Changes

Congestion control

All ongoing downloads are going through the global scheduler now. To read the next chunk of data, you wait for a permit from the semaphore. The semaphore capacity is adjusted over time (AIMD) but can be configured to be constant.

Downloads are also prioritized — the query that started earlier gets the permit first. It allows sending more fetch-ahead queries without stalling more important downloads.

Worker reservation

Previously:

  • each stream has a limited buffer of "slots" for fetch-ahead
  • for each slot, we start a query to some worker
  • later, if the query fails, we look for another worker to retry

The problem is that if the buffer size is high, we allocate too many workers for the future. But when a query fails, we can't find a worker for a retry, and the stream is interrupted.

Now:

  • pre-allocate all the workers we may need for this slot
  • if we couldn't find enough workers, don't even open the slot
  • when the slot is finished, the workers are released to the pool again

This way buffers just become shorter under the load, but problematic retries are avoided. Slower syncing is better than interrupting running streams.

The code became much simpler. It also allowed forming error messages that explain why each query attempt failed, not just the last one.

Priorities

Workers are now prioritized based on the observed throughput (keeping the backoff on hard errors). Throughput is measured only at time periods when the download was actively running. Thanks to the limited number of parallel downloads, I don't expect it to be too noisy. It may still highly depend on the response size, though.

Connections

Previously:

  • if you want to connect to an unknown worker, a Kademlia query is triggered to find the IP, and then the connection is established (3-5s usually)
  • the timeout applies to the whole pipeline

Now:

  • a background job is trying to establish connections to all existing workers
  • if there is no established connection when the portal needs it, it immediately gets an error and backs off

Remaining issues

  • if some slot errored, we should cancel the future ones because the stream will be interrupted at this point anyway
  • worker priority pool still doesn't look reliable and should be redesigned
  • worker priority pool is not observed with metrics

@kalabukdima kalabukdima requested a review from define-null April 1, 2026 13:29
@mo4islona
Copy link
Copy Markdown
Contributor

I reviewed the overload-control changes and found a few correctness issues that look worth fixing before merge:

  1. Congestion accounting misses the wait-for-first-byte phase (src/network/client.rs)
    The scheduler permit is released right after send_query_request() returns, but the request can still spend a long time waiting in worker execution / TTFB before receive_first_byte() finishes. Because download_utilization() is derived from the scheduler state, that whole phase is invisible to the overload gate, so the portal can keep admitting new streams while many requests are already in flight.

  2. ReadError::TooLarge is treated like a transport timeout (src/network/client.rs)
    convert_read_error() routes both TooLarge and transport errors through report_query_failure(), which applies the long timeout-style penalty to the worker. An oversized response is not a timeout; the worker did respond, so this can incorrectly sideline healthy workers for the full timeout window.

  3. Readiness can become true with zero connections when there is only one worker (src/network/client.rs)
    active_connections >= num_workers * 3 / 4 truncates to 0 when num_workers == 1, so the readiness endpoint can report healthy even with no active connection to the only worker.

  4. New / unknown workers are silently deprioritized (src/network/priorities.rs)
    Truly unknown workers get default_priority() == (Best, 0, 0), while any measured worker gets a negative throughput key and therefore sorts ahead. In practice this means replacement / freshly joined workers may never be sampled while the existing pool stays healthy.

  5. Low-severity note: get_worker now briefly leases the worker just to return its id (src/http_server.rs)
    The old path explicitly did a non-leasing pick. The new path briefly increments running_queries and then drops the lease immediately, which creates a small race where concurrent callers can observe the worker as busy.

@kalabukdima
Copy link
Copy Markdown
Contributor Author

I reviewed the overload-control changes and found a few correctness issues that look worth fixing before merge:

Did you or did AI review it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants