Introduce ValidatorManager to track all requests and validators' scores #4752

deuszx · 2025-10-08T15:05:29Z

Motivation

The Linera client needs to interact with multiple validator nodes efficiently. Previously, the
client would make individual requests to validators without:

Performance tracking: No mechanism to prefer faster, more reliable validators
Request deduplication: Concurrent requests for the same data would all hit the network, wasting
bandwidth and validator resources
Response caching: Repeated requests for the same data would always go to validators
Load balancing: No rate limiting per validator, risking overload
Resilience: No fallback mechanism when a validator is slow or unresponsive

This led to:

Unnecessary network traffic and validator load
Poor user experience with redundant waiting
No optimization based on validator performance
Risk of overwhelming validators with too many concurrent requests
No recovery mechanism when validators are slow

Proposal

This PR introduces ValidatorManager, a sophisticated request orchestration layer that provides
intelligent peer selection, request deduplication, caching, and performance-based routing.

Key Features

Performance Tracking with Exponential Moving Averages (EMA)

Tracks latency, success rate, and current load for each validator
Uses configurable weights to compute a composite performance score
Intelligently selects the best available validator for each request
Weighted random selection from top performers to avoid hotspots

Request Deduplication

Exact matching: Multiple concurrent requests for identical data are deduplicated
Subsumption-based matching: Smaller requests are satisfied by larger in-flight requests that
contain the needed data (e.g., a request for blocks 10-12 can be satisfied by an in-flight request
for blocks 10-20)
Broadcast mechanism ensures all waiting requesters receive the result when the request completes
Timeout handling: Stale in-flight requests (>200ms) are not deduplicated against, allowing fresh
attempts

Response Caching

Successfully completed requests are cached with configurable TTL (default: 2 seconds)
LRU eviction when cache reaches maximum size (default: 1000 entries)
Works with both exact and subsumption matching
Only successful results are cached

Slot-Based Rate Limiting

Each validator has a maximum concurrent request limit (default: 100)
Async await mechanism: requests wait for available slots without polling
Prevents overloading individual validators
Automatic slot release on request completion

Alternative Peer Handling

When multiple callers request the same data, they register as "alternative peers"
If the original request times out (>200ms), any alternative peer can complete the request
The result is broadcast to all waiting requesters
Provides resilience against slow validators

Modular Architecture

Created a new validator_manager module with clear separation of concerns:

  validator_manager/
  ├── mod.rs              - Module exports and constants
  ├── manager.rs          - ValidatorManager orchestration logic
  ├── in_flight_tracker.rs - In-flight request tracking and deduplication
  ├── node_info.rs        - Per-validator performance tracking
  ├── request.rs          - Request types and result extraction
  └── scoring.rs          - Configurable scoring weights

API

High-level APIs:

  // Execute with best available validator
  manager.with_best(request_key, |peer| async {
      peer.download_certificates(chain_id, start, limit).await
  }).await

  // Execute with specific validator
  manager.with_peer(request_key, peer, |peer| async {
      peer.download_blob(blob_id).await
  }).await

  Configuration:
  let manager = ValidatorManager::with_config(
      validator_nodes,
      max_requests_per_node: 100,
      weights: ScoringWeights { latency: 0.4, success: 0.4, load: 0.2 },
      alpha: 0.1,              // EMA smoothing factor
      max_expected_latency_ms: 5000.0,
      cache_ttl: Duration::from_secs(2),
      max_cache_size: 1000,
  );

Benefits

Reduced network load: Deduplication and caching eliminate redundant requests
Better performance: Intelligent peer selection routes to fastest validators
Improved reliability: Alternative peer mechanism provides resilience
Protection for validators: Rate limiting prevents overload
Efficient resource usage: EMA-based scoring optimizes validator selection
Clean architecture: Modular design makes code maintainable and testable

Metrics

In production usage, this should significantly reduce:

Network traffic between clients and validators
Validator CPU/memory usage from redundant requests
Client request latency through caching and smart routing
Failed requests through performance tracking and rate limiting

The following metrics have been added to Prometheus (with compiled with --features metrics):

validator_manager_response_time_ms - Response time for requests to validators in milliseconds
validator_manager_request_total - Total number of requests made to each validator
validator_manager_request_success - Number of successful requests to each validator ((validator_manager_request_total - validator_manager_request_success) / validator_manager_request_total is an error rate)
validator_manager_request_deduplication_total - Number of requests that were deduplicated by joining an in-flight request
validator_manager_request_cache_hit_total - Number of requests that were served from cache

Test Plan

Existing CI makes sure we maintain backwards compatibility. Some tests have been added to the new modules.

Release Plan

Nothing to do / These changes follow the usual release cycle.

Links

reviewer checklist

bart-linera · 2025-10-20T11:06:16Z

linera-core/src/client/mod.rs

+                .with_peer(
+                    RequestKey::Certificates {
+                        chain_id,
+                        start: next_height,
+                        limit,
+                    },
+                    remote_node.clone(),
+                    async move |peer| {
+                        peer.download_certificates_from(chain_id, next_height, limit)
+                            .await
+                    },


We seem to be passing the same info twice: once in the RequestKey, once in the operation to be executed. Maybe it would be possible to make the RequestKey determine the operation?

I know, but I couldn't find a way to get rid of this duplication. Caller chooses the operation (the closure) but he code that executes the closure (track_request) and caching logic (deduplicated_request) are separate functions. What I could do is merge those two but then testing will be much more difficult.

What I was thinking was that maybe the operation could be chosen by matching on the RequestKey inside the ValidatorManager (or some method on the RequestKey, even?), instead of by passing a closure. But maybe that wouldn't work with some use cases? I haven't reviewed most of the code yet, those are loose ideas I'm getting while reading the PR.

Sure. Initially ValidatorManager had the matching API to RemoteNode – i.e. it had methods like download_certificate, download_blob, download_certificate_for_blob but with time I moved away from this and worked with the closure (although there are still methods for downloading blobs). I could revert to that approach but I thought that it's more limiting.

… the network

…previous one is stalled

afck · 2025-10-21T08:45:10Z

linera-core/src/client/validator_manager/request.rs

+    /// # Returns
+    /// - `Some((chain_id, heights))` for certificate requests, where heights are sorted
+    /// - `None` for non-certificate requests (Blob, PendingBlob, CertificateForBlob)
+    fn height_range(&self) -> Option<Vec<BlockHeight>> {


We could return a BTreeSet instead, to make contains more efficient in the subsumes implementation. Or, if they are always ordered anyway, we could avoid calling contains below at all, and just iterate over both?

I was meant to change the impl to use the ordering and I forgot 🙈

I think I addressed this in a9c836a

afck · 2025-10-21T08:47:16Z

linera-core/src/client/validator_manager/request.rs

+        let certificates = match result {
+            RequestResult::Certificates(certs) => certs,
+            _ => return None,
+        };
+
+        if self.chain_id().is_none() || from.chain_id().is_none() {
+            return None;
+        }


Instead of all this, shouldn't we just if !self.subsumes(result) { return None; }?

Otherwise it isn't guaranteed that the filtered results below are complete.

Good point.

I think I addressed this in a9c836a

linera-core/src/client/validator_manager/mod.rs

linera-core/src/client/validator_manager/node_info.rs

afck · 2025-10-21T08:57:48Z

linera-core/src/client/validator_manager/node_info.rs

+            + (self.weights.success * success_score)
+            + (self.weights.load * load_score);
+
+        // Apply confidence factor to penalize nodes with too few samples


Is that necessary? If a node has too few samples, this makes us avoid getting more samples, doesn't it?

The idea was to not prioritise nodes that haven't received enough requests to trust the score.

linera-core/src/client/validator_manager/node_info.rs

afck · 2025-10-21T09:15:30Z

linera-core/src/client/validator_manager/in_flight_tracker.rs

+            let elapsed = Instant::now().duration_since(entry.started_at);
+
+            let outcome = if elapsed > self.timeout {
+                SubscribeOutcome::TimedOut(elapsed)


But there might still be a more recent, subsuming (non-exact) one?

You're right.

Done in bbbd690

afck · 2025-10-21T09:17:49Z

linera-core/src/client/validator_manager/in_flight_tracker.rs

+                "request completed; broadcasting result to waiters",
+            );
+            if waiter_count != 0 {
+                let _ = entry.sender.send(result);


Maybe log something if that fails. (In general, we are trying to avoid let _ = .)

Done in aab9839

linera-core/src/client/validator_manager/cache.rs

linera-core/src/client/validator_manager/manager.rs

afck · 2025-10-21T09:35:15Z

linera-core/src/client/validator_manager/manager.rs

+    });
+
+    /// Counter for requests that were deduplicated (joined an in-flight request)
+    pub static REQUEST_CACHE_DEDUPLICATION: LazyLock<IntCounter> = LazyLock::new(|| {


Isn't this always the same as REQUEST_CACHE_HIT?

Not really – REQUESTS_CACHE_HIT is incremented when the request (key) is found in the response cache and REQUESTS_CACHE_DEDUPLICATION is incremented when the request is deduplicated with another one in-flight.

afck · 2025-10-21T10:36:40Z

linera-core/src/client/validator_manager/manager.rs

+        });
+
+        // Give the first request time to register as in-flight
+        tokio::time::sleep(Duration::from_millis(10)).await;


Then why use spawn at all? In 10 ms the request will probably have completed?

afck · 2025-10-21T10:37:12Z

linera-core/src/client/validator_manager/manager.rs

+        assert_eq!(result1, result2);
+
+        // Operation should only have been executed once (deduplication worked)
+        assert_eq!(execution_count.load(Ordering::SeqCst), 1);


What does this test add compared to test_cache_hit_returns_cached_result (and vice versa)?

afck · 2025-10-21T10:38:03Z

linera-core/src/client/validator_manager/manager.rs

+
+        // Operation should only have been executed once (all requests were deduplicated)
+        assert_eq!(execution_count.load(Ordering::SeqCst), 1);
+    }


Doesn't this just extend the first two tests, i.e. make them obsolete?

afck · 2025-10-21T10:39:00Z

linera-core/src/client/validator_manager/manager.rs

+        tokio::time::sleep(Duration::from_millis(10)).await;
+
+        // Wait for the timeout to elapse
+        tokio::time::sleep(Duration::from_millis(MAX_REQUEST_TTL_MS + 1)).await;


We're trying to avoid sleep in tests. I guess it's difficult in this case, but maybe there's a way to use the fake clock?

afck · 2025-10-21T10:39:24Z

linera-core/src/client/validator_manager/manager.rs

+        // - Completes the first request to release a slot
+        // - Verifies the third request now acquires the freed slot and executes (execution count becomes 3)
+        // - Confirms all requests complete successfully
+        use linera_base::identifiers::BlobType;


Should probably be at the start of the file.

It's used in this single test 🤔

afck

Looks great! No blockers so far. 👍

afck · 2025-10-21T10:45:27Z

linera-core/src/client/mod.rs

        stream::iter(blob_ids.into_iter().map(|blob_id| {
            communicate_concurrently(
                remote_nodes,
                async move |remote_node| {


If this task gets canceled, does that cause any issues, e.g. with the in-flight tracker? (Maybe fixed if we use a Tokio semaphore anyway.)

Yes, hopefully when the task with the permit is dropped the semaphore permit is released.

but good point!

linera-core/src/client/validator_manager/manager.rs

afck · 2025-10-21T12:47:21Z

linera-core/src/client/validator_manager/manager.rs

        // Filter nodes that can accept requests and calculate their scores
        let mut scored_nodes = Vec::new();
        for info in nodes.values() {
-            if info.can_accept_request(self.max_requests_per_node).await {


Should we already try to acquire permits here? Otherwise there's no guarantee they can still accept requests below.

not always we follow with the request after this call so I didn't want to do it. Maybe this method should be changed – it's not guaranteed that returned peers will be able to accept a request when the time comes (since the slots can be used in the meantime) – and instead we'd calculate score for everyone and the peers with higher current load will be scored lower due to lack of availble slots:

let current_load = (self.max_in_flight as f64) - (self.in_flight_semaphore.available_permits() as f64); let load_score = 1.0 - (current_load.min(self.max_in_flight as f64) / self.max_in_flight as f64);

ie the load_score part would be 0 for them.

Done in Remove can_accept_request

deuszx force-pushed the conway_validator-manager branch 8 times, most recently from f922f82 to b345f8c Compare October 20, 2025 10:50

deuszx requested review from Twey, afck and bart-linera October 20, 2025 10:51

bart-linera reviewed Oct 20, 2025

View reviewed changes

deuszx force-pushed the conway_validator-manager branch from a016601 to 0fa1c8a Compare October 20, 2025 15:31

deuszx added 17 commits October 20, 2025 16:42

Introduce ValidatorManager to track all requests and validators' scores

73594da

Deduplicate requets for the same data

d687af3

Cache responses to queries and resolve subsequent ones w/o going over…

d94e204

… the network

Expose remote node's address and use when logging

d5cbdd6

Simplify RequestKey enum

7c5359d

Add ValidatorManager::with_peer method

ccb8a8d

Simplify ValidatorManager::deduplicated_request

93d06b1

Rearrange the methods

b2a424c

Add ValidatorManager::with_peer method

963d944

Evict all expired entries from cache at once

b903345

Fix compilation clippy

dd76271

Remove unused methods from ValidatorManager

0353a8b

Use specific peers when downloading data from.

00411a6

Download blobs using communicate_concurrently

1cbfd24

Rename remote_nodes to validator_manager

8486bda

Clean up the API of ValidatorManager

810c63f

Cache only sucessful results

b6026e6

deuszx added 9 commits October 20, 2025 16:57

Use alternative peer if in-flight is too slow

9b34c13

Stop logging grpc error status at INFO

372e26d

Extract in-flight map to a separate file

7ae897c

Add basic metrics to ValidatorManager

cf33cf7

Simplify register_alternative_and_check_timeout function

f63c02b

Refactor RequestResult type

13af7de

Simplify code for registering new peers and issuing new queries when …

d540d52

…previous one is stalled

Extract requests' result cache to a file

de12a02

Parameterize RequestsCache with types for key and result.

e7638a2

deuszx force-pushed the conway_validator-manager branch from 0fa1c8a to e7638a2 Compare October 20, 2025 16:00

afck reviewed Oct 21, 2025

View reviewed changes

Use Sempahore to limit per-node requests in flight

0fe1a36

afck reviewed Oct 21, 2025

View reviewed changes

deuszx added 8 commits October 21, 2025 15:10

Harden subsumption logic

a9c836a

Limit visibility of fields in structs.

d661539

Move all ValidatorManager config consts to ClientContextOptions

ebbec06

Remove Facebook copyright line from the new file

ef765f2

Fix comment on REQUEST_CACHE_DEDUPLICATION

d6eb605

Change visiblity of metrics statics

f736af1

Drop default_ prefix from ValidatorManager fields

e8f7176

Remove can_accept_request

d4c0f2c

deuszx force-pushed the conway_validator-manager branch from 59e2df1 to d4c0f2c Compare October 21, 2025 17:14

deuszx added 2 commits October 21, 2025 20:03

Move alpha smoothing factor to client options

08224ae

Typos and small corrections

9334a71

deuszx force-pushed the conway_validator-manager branch from 808b2b5 to 7fc2842 Compare October 21, 2025 19:03

deuszx added 2 commits October 21, 2025 20:03

Log a warning when a supposedly unreachable code is executed

aab9839

Check if there's a subsuming in-flight request. Simplify types

bbbd690

deuszx force-pushed the conway_validator-manager branch from 7fc2842 to bbbd690 Compare October 21, 2025 19:04

Introduce ValidatorManager to track all requests and validators' scores #4752

Are you sure you want to change the base?

Introduce ValidatorManager to track all requests and validators' scores #4752

Conversation

deuszx commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Proposal

Test Plan

Release Plan

Links

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

afck left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

deuszx commented Oct 8, 2025 •

edited

Loading

deuszx Oct 21, 2025 •

edited

Loading