Skip to content

Conversation

deuszx
Copy link
Contributor

@deuszx deuszx commented Oct 8, 2025

Motivation

The Linera client needs to interact with multiple validator nodes efficiently. Previously, the
client would make individual requests to validators without:

  1. Performance tracking: No mechanism to prefer faster, more reliable validators
  2. Request deduplication: Concurrent requests for the same data would all hit the network, wasting
    bandwidth and validator resources
  3. Response caching: Repeated requests for the same data would always go to validators
  4. Load balancing: No rate limiting per validator, risking overload
  5. Resilience: No fallback mechanism when a validator is slow or unresponsive

This led to:

  • Unnecessary network traffic and validator load
  • Poor user experience with redundant waiting
  • No optimization based on validator performance
  • Risk of overwhelming validators with too many concurrent requests
  • No recovery mechanism when validators are slow

Proposal

This PR introduces ValidatorManager, a sophisticated request orchestration layer that provides
intelligent peer selection, request deduplication, caching, and performance-based routing.

Key Features

  1. Performance Tracking with Exponential Moving Averages (EMA)
  • Tracks latency, success rate, and current load for each validator
  • Uses configurable weights to compute a composite performance score
  • Intelligently selects the best available validator for each request
  • Weighted random selection from top performers to avoid hotspots
  1. Request Deduplication
  • Exact matching: Multiple concurrent requests for identical data are deduplicated
  • Subsumption-based matching: Smaller requests are satisfied by larger in-flight requests that
    contain the needed data (e.g., a request for blocks 10-12 can be satisfied by an in-flight request
    for blocks 10-20)
  • Broadcast mechanism ensures all waiting requesters receive the result when the request completes
  • Timeout handling: Stale in-flight requests (>200ms) are not deduplicated against, allowing fresh
    attempts
  1. Response Caching
  • Successfully completed requests are cached with configurable TTL (default: 2 seconds)
  • LRU eviction when cache reaches maximum size (default: 1000 entries)
  • Works with both exact and subsumption matching
  • Only successful results are cached
  1. Slot-Based Rate Limiting
  • Each validator has a maximum concurrent request limit (default: 100)
  • Async await mechanism: requests wait for available slots without polling
  • Prevents overloading individual validators
  • Automatic slot release on request completion
  1. Alternative Peer Handling
  • When multiple callers request the same data, they register as "alternative peers"
  • If the original request times out (>200ms), any alternative peer can complete the request
  • The result is broadcast to all waiting requesters
  • Provides resilience against slow validators
  1. Modular Architecture

Created a new validator_manager module with clear separation of concerns:

  validator_manager/
  ├── mod.rs              - Module exports and constants
  ├── manager.rs          - ValidatorManager orchestration logic
  ├── in_flight_tracker.rs - In-flight request tracking and deduplication
  ├── node_info.rs        - Per-validator performance tracking
  ├── request.rs          - Request types and result extraction
  └── scoring.rs          - Configurable scoring weights

API

High-level APIs:

  // Execute with best available validator
  manager.with_best(request_key, |peer| async {
      peer.download_certificates(chain_id, start, limit).await
  }).await

  // Execute with specific validator
  manager.with_peer(request_key, peer, |peer| async {
      peer.download_blob(blob_id).await
  }).await

  Configuration:
  let manager = ValidatorManager::with_config(
      validator_nodes,
      max_requests_per_node: 100,
      weights: ScoringWeights { latency: 0.4, success: 0.4, load: 0.2 },
      alpha: 0.1,              // EMA smoothing factor
      max_expected_latency_ms: 5000.0,
      cache_ttl: Duration::from_secs(2),
      max_cache_size: 1000,
  );

Benefits

  • Reduced network load: Deduplication and caching eliminate redundant requests
  • Better performance: Intelligent peer selection routes to fastest validators
  • Improved reliability: Alternative peer mechanism provides resilience
  • Protection for validators: Rate limiting prevents overload
  • Efficient resource usage: EMA-based scoring optimizes validator selection
  • Clean architecture: Modular design makes code maintainable and testable

Metrics

In production usage, this should significantly reduce:

  • Network traffic between clients and validators
  • Validator CPU/memory usage from redundant requests
  • Client request latency through caching and smart routing
  • Failed requests through performance tracking and rate limiting

The following metrics have been added to Prometheus (with compiled with --features metrics):

  • validator_manager_response_time_ms - Response time for requests to validators in milliseconds
  • validator_manager_request_total - Total number of requests made to each validator
  • validator_manager_request_success - Number of successful requests to each validator ((validator_manager_request_total - validator_manager_request_success) / validator_manager_request_total is an error rate)
  • validator_manager_request_deduplication_total - Number of requests that were deduplicated by joining an in-flight request
  • validator_manager_request_cache_hit_total - Number of requests that were served from cache

Test Plan

Existing CI makes sure we maintain backwards compatibility. Some tests have been added to the new modules.

Release Plan

  • Nothing to do / These changes follow the usual release cycle.

Links

@deuszx deuszx force-pushed the conway_validator-manager branch 8 times, most recently from f922f82 to b345f8c Compare October 20, 2025 10:50
@deuszx deuszx requested review from Twey, afck and bart-linera October 20, 2025 10:51
Comment on lines +361 to +371
.with_peer(
RequestKey::Certificates {
chain_id,
start: next_height,
limit,
},
remote_node.clone(),
async move |peer| {
peer.download_certificates_from(chain_id, next_height, limit)
.await
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to be passing the same info twice: once in the RequestKey, once in the operation to be executed. Maybe it would be possible to make the RequestKey determine the operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but I couldn't find a way to get rid of this duplication. Caller chooses the operation (the closure) but he code that executes the closure (track_request) and caching logic (deduplicated_request) are separate functions. What I could do is merge those two but then testing will be much more difficult.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was thinking was that maybe the operation could be chosen by matching on the RequestKey inside the ValidatorManager (or some method on the RequestKey, even?), instead of by passing a closure. But maybe that wouldn't work with some use cases? I haven't reviewed most of the code yet, those are loose ideas I'm getting while reading the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Initially ValidatorManager had the matching API to RemoteNode – i.e. it had methods like download_certificate, download_blob, download_certificate_for_blob but with time I moved away from this and worked with the closure (although there are still methods for downloading blobs). I could revert to that approach but I thought that it's more limiting.

@deuszx deuszx force-pushed the conway_validator-manager branch from a016601 to 0fa1c8a Compare October 20, 2025 15:31
@deuszx deuszx force-pushed the conway_validator-manager branch from 0fa1c8a to e7638a2 Compare October 20, 2025 16:00
/// # Returns
/// - `Some((chain_id, heights))` for certificate requests, where heights are sorted
/// - `None` for non-certificate requests (Blob, PendingBlob, CertificateForBlob)
fn height_range(&self) -> Option<Vec<BlockHeight>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could return a BTreeSet instead, to make contains more efficient in the subsumes implementation. Or, if they are always ordered anyway, we could avoid calling contains below at all, and just iterate over both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was meant to change the impl to use the ordering and I forgot 🙈

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I addressed this in a9c836a

Comment on lines 161 to 168
let certificates = match result {
RequestResult::Certificates(certs) => certs,
_ => return None,
};

if self.chain_id().is_none() || from.chain_id().is_none() {
return None;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of all this, shouldn't we just if !self.subsumes(result) { return None; }?

Otherwise it isn't guaranteed that the filtered results below are complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I addressed this in a9c836a

+ (self.weights.success * success_score)
+ (self.weights.load * load_score);

// Apply confidence factor to penalize nodes with too few samples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that necessary? If a node has too few samples, this makes us avoid getting more samples, doesn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to not prioritise nodes that haven't received enough requests to trust the score.

let elapsed = Instant::now().duration_since(entry.started_at);

let outcome = if elapsed > self.timeout {
SubscribeOutcome::TimedOut(elapsed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there might still be a more recent, subsuming (non-exact) one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in bbbd690

"request completed; broadcasting result to waiters",
);
if waiter_count != 0 {
let _ = entry.sender.send(result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe log something if that fails. (In general, we are trying to avoid let _ = .)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in aab9839

});

/// Counter for requests that were deduplicated (joined an in-flight request)
pub static REQUEST_CACHE_DEDUPLICATION: LazyLock<IntCounter> = LazyLock::new(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this always the same as REQUEST_CACHE_HIT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really – REQUESTS_CACHE_HIT is incremented when the request (key) is found in the response cache and REQUESTS_CACHE_DEDUPLICATION is incremented when the request is deduplicated with another one in-flight.

});

// Give the first request time to register as in-flight
tokio::time::sleep(Duration::from_millis(10)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why use spawn at all? In 10 ms the request will probably have completed?

assert_eq!(result1, result2);

// Operation should only have been executed once (deduplication worked)
assert_eq!(execution_count.load(Ordering::SeqCst), 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this test add compared to test_cache_hit_returns_cached_result (and vice versa)?


// Operation should only have been executed once (all requests were deduplicated)
assert_eq!(execution_count.load(Ordering::SeqCst), 1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this just extend the first two tests, i.e. make them obsolete?

tokio::time::sleep(Duration::from_millis(10)).await;

// Wait for the timeout to elapse
tokio::time::sleep(Duration::from_millis(MAX_REQUEST_TTL_MS + 1)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to avoid sleep in tests. I guess it's difficult in this case, but maybe there's a way to use the fake clock?

// - Completes the first request to release a slot
// - Verifies the third request now acquires the freed slot and executes (execution count becomes 3)
// - Confirms all requests complete successfully
use linera_base::identifiers::BlobType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be at the start of the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in this single test 🤔

Copy link
Contributor

@afck afck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! No blockers so far. 👍

stream::iter(blob_ids.into_iter().map(|blob_id| {
communicate_concurrently(
remote_nodes,
async move |remote_node| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this task gets canceled, does that cause any issues, e.g. with the in-flight tracker? (Maybe fixed if we use a Tokio semaphore anyway.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, hopefully when the task with the permit is dropped the semaphore permit is released.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but good point!

// Filter nodes that can accept requests and calculate their scores
let mut scored_nodes = Vec::new();
for info in nodes.values() {
if info.can_accept_request(self.max_requests_per_node).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we already try to acquire permits here? Otherwise there's no guarantee they can still accept requests below.

Copy link
Contributor Author

@deuszx deuszx Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not always we follow with the request after this call so I didn't want to do it. Maybe this method should be changed – it's not guaranteed that returned peers will be able to accept a request when the time comes (since the slots can be used in the meantime) – and instead we'd calculate score for everyone and the peers with higher current load will be scored lower due to lack of availble slots:

let current_load =
            (self.max_in_flight as f64) - (self.in_flight_semaphore.available_permits() as f64);
let load_score =
            1.0 - (current_load.min(self.max_in_flight as f64) / self.max_in_flight as f64);

ie the load_score part would be 0 for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deuszx deuszx force-pushed the conway_validator-manager branch from 59e2df1 to d4c0f2c Compare October 21, 2025 17:14
@deuszx deuszx force-pushed the conway_validator-manager branch from 808b2b5 to 7fc2842 Compare October 21, 2025 19:03
@deuszx deuszx force-pushed the conway_validator-manager branch from 7fc2842 to bbbd690 Compare October 21, 2025 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants