Simulate impact of shard movement using shard-level write load #131406

nicktindall · 2025-07-17T06:00:22Z

I've been back and forth on this a bit, but I think going for something simple is best. When we start receiving shard write load estimates from the nodes that should be able to plug those in and this should "just work" (assuming I've understood shard-level write load correctly.

We ignore queue latency in the modelling because I don't think we're going to look at it in the decider, and I can't see how we could estimate how it would change in response to shard movements (it's a function of the amount the node is overloaded AND how long it's been like that, and back-pressure should ideally keep a lid on it).

elasticsearchmachine · 2025-07-17T06:00:46Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

…ad_modeling_to_balancer

mhl-b · 2025-07-17T17:31:34Z

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

+
+public class WriteLoadPerShardSimulator {
+
+    private final ObjectFloatMap<String> writeLoadDeltas;


nit: simulatedNodesLoad?

👍 I changed to simulatedWriteLoadDeltas, we only store the delta from the reported/original write load here. The idea there is that if no delta is present, we can just return the original NodeUsageStatsForThreadPools instance.

mhl-b · 2025-07-17T22:42:24Z

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

+                }
+            }
+            writeShardsOnNode.forEach(
+                shardId -> writeLoadPerShard.computeIfAbsent(shardId, k -> new Average()).add(writeUtilisation / writeShardsOnNode.size())


Do you equally divide write-load across all write shards on node?

Yeah, this is just a stop-gap until we get actual shard loads, which should work as a drop-in replacement.

Ok, I was thinking maybe we should have some heuristic from already available data. Otherwise signal/noise ratio is too high. It's not uncommon to have hundreds of shards, and estimation has little to no impact on a single shard.

For example use shardSize heuristic, the larger size more likely it would have write-load. Lets say linearly increase weight of those shards as size approaches 15GB. And then decrease weight as they approach to 30GB since we would roll-over them (most of the time) if size <15GB then size/15GB else max(0, 1-size/30GB)

We'll have actual shard write loads shortly. Hopefully we can avoid all this guessing entirely.

#131496

…ad_modeling_to_balancer

mhl-b

LGTM

nicktindall · 2025-07-18T03:56:14Z

I might hold off merging until we get #131496 merged, I think we can avoid fudging the shard write loads

DiannaHohensee

I left one comment where I'm concerned there might be a bug, but the other requests are just improvements.

We ignore queue latency in the modelling because I don't think we're going to look at it in the decider, and I can't see how we could estimate how it would change in response to shard movements (it's a function of the amount the node is overloaded AND how long it's been like that, and back-pressure should ideally keep a lid on it).

I was originally imagining that we could (in future, not now) collect some per shard stats for queuing, and make some kind of estimate for additional shard write load based on that, like auto-scaling except at the shard instead of node level. But it may turn out that we don't need something like that: probably see how it goes in production. And I haven't explored the feasibility of collecting a stat like that.

Alternatively, we could choose to be more generous with how much write load is moved away from a node, based on the queue latency: we don't know how much load to attribute to a particular shard, but we could extrapolate that when the queue latency is X seconds, we then need to move off X seconds of additional thread execution time translated into shard write load (which is thread execution time).

DiannaHohensee · 2025-07-21T19:59:33Z

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class WriteLoadPerShardSimulator {


Comment explaining the purpose/use of the class, please.

DiannaHohensee · 2025-07-21T20:04:46Z

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

+
+public class WriteLoadPerShardSimulator {
+
+    private final ObjectDoubleMap<String> simulatedWriteLoadDeltas;


Is there a performance gain over Map<String, Double>? I'm wondering why use it, essentially.

DiannaHohensee · 2025-07-21T20:06:33Z

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

+                simulatedWriteLoadDeltas.addTo(shardRouting.relocatingNodeId(), -1 * writeLoadForShard);
+                simulatedWriteLoadDeltas.addTo(shardRouting.currentNodeId(), writeLoadForShard);
+            } else {
+                // not sure how this would come about, perhaps when allocating a replica after a delay?


It would be good to explain your reasoning here in the comment.

If I understand correctly, you're wondering how it would be possible to have a write load estimate for a new index? I haven't verified, but I expect data stream rollover to create a new index with an estimated write load calculated from the previous index in the data stream. But the write load estimate we're feeding into this simulation is currently the peak write load estimate, right? I think we'll want to be able to make an estimate for a new index, eventually. Could you file a ticket to keep track of that work, and throw it into Milestone 3 (https://elasticco.atlassian.net/browse/ES-11977) epic? I'm currently thinking of that milestone as a bucket for follow up optimizations.

DiannaHohensee · 2025-07-21T20:43:46Z

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

+            if (shardRouting.relocatingNodeId() != null) {
+                // relocating
+                simulatedWriteLoadDeltas.addTo(shardRouting.relocatingNodeId(), -1 * writeLoadForShard);
+                simulatedWriteLoadDeltas.addTo(shardRouting.currentNodeId(), writeLoadForShard);


Could you add some testing to ClusterInfoSimulatorTests, to verify that we're adding and subtracting with the same expectations?

This looks right, but it took a little digging for me to realize that we're tracking free space in DiskUsage and that flipping adding/subtracting here makes sense. So having a bit of testing in ClusterInfoSimulatorTests seems like it'd remove any doubts.

I don't feel too strongly about this, if you disagree.

DiannaHohensee · 2025-07-21T22:53:26Z

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

+    }
+
+    public Map<String, NodeUsageStatsForThreadPools> nodeUsageStatsForThreadPools() {
+        return routingAllocation.clusterInfo()


My understanding of streams is that Armin identified a while ago (a tech talk) that they perform poorly compared to for/while loops. Since this code will be run frequently, can we use some kind of loop instead of a stream?

DiannaHohensee · 2025-07-22T19:46:34Z

server/src/test/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulatorTests.java

+                    .flatMap(index -> IntStream.range(0, 3).mapToObj(shardNum -> new ShardId(index.getIndex(), shardNum)))
+                    .collect(Collectors.toUnmodifiableMap(shardId -> shardId, shardId -> randomDoubleBetween(0.1, 5.0, true)))
+            )
+            .build();


Could you log the ClusterInfo to a string? There isn't any debug information to look at if any of the tests fail (I think?), and some logging of the values might help.

DiannaHohensee · 2025-07-22T19:48:40Z

server/src/test/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulatorTests.java

+        final var writeLoadPerShardSimulator = new WriteLoadPerShardSimulator(allocation);
+
+        // Relocate a random shard from node_0 to node_1
+        final var randomShard = randomFrom(StreamSupport.stream(allocation.routingNodes().node("node_0").spliterator(), false).toList());


log randomShard? For debug purposes, then we can match it with the ClusterInfo info I suggest logging in createRoutingAllocation.

DiannaHohensee · 2025-07-22T20:01:59Z

server/src/test/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulatorTests.java

+        );
+    }
+
+    public void testMovementFollowedByMovementBackWillNotChangeAnything() {


This test is essentially identical to testMovementOfAShardWillReduceThreadPoolUtilisation, except some additional work at the end to move the shard back. It seems redundant? Maybe delete the shorter one, and rename the remaining test?

DiannaHohensee · 2025-07-22T20:08:11Z

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

+
+    public void simulateShardStarted(ShardRouting shardRouting) {
+        final Double writeLoadForShard = writeLoadsPerShard.get(shardRouting.shardId());
+        if (writeLoadForShard != null) {


Could you test the case where a shard write load is 0/null? Like would be reported for a non-data stream index shard.

DiannaHohensee · 2025-07-22T20:38:35Z

server/src/main/java/org/elasticsearch/cluster/routing/WriteLoadPerShardSimulator.java

+            .getNodeUsageStatsForThreadPools()
+            .entrySet()
+            .stream()
+            .collect(Collectors.toUnmodifiableMap(Map.Entry::getKey, e -> {


So the ClusterInfoSimulator#getClusterInfo method will replace the ClusterInfo set on the RoutingAllocation object repeatedly (inside a while loop) throughout a desired balance computation. The ClusterInfoSimulator (and consequently this shard write load simulator) is created once at the start of compute(), it's never reset.

So, if I am understanding the code correctly, the diffs in this write load simulator will continue to be added to, never reset, but will be applied on top of a ClusterInfo that keeps getting updated by the diffs. This doesn't seem like what we want?

The ClusterInfoSimulator never looks at the RoutingAllocation's ClusterInfo again after initializing the ClusterInfoSimulator's private variables as a starting point..

Estimate impact of shard movement using node-level write load

0ebd75d

nicktindall added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Jul 17, 2025

nicktindall requested a review from DiannaHohensee July 17, 2025 06:00

elasticsearchmachine added Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0 labels Jul 17, 2025

nicktindall added 3 commits July 17, 2025 16:02

Naming

e589db4

More randomness

beb2611

Merge remote-tracking branch 'origin/main' into ES-12000_add_write_lo…

4e0fd1d

…ad_modeling_to_balancer

nicktindall requested a review from mhl-b July 17, 2025 06:11

Pedantry

3f90889

mhl-b reviewed Jul 17, 2025

View reviewed changes

nicktindall added 2 commits July 18, 2025 09:27

Naming

0b1d4a2

Merge remote-tracking branch 'origin/main' into ES-12000_add_write_lo…

9527720

…ad_modeling_to_balancer

nicktindall mentioned this pull request Jul 18, 2025

Add shard write-load to cluster info #131496

Merged

mhl-b approved these changes Jul 18, 2025

View reviewed changes

nicktindall added 2 commits July 21, 2025 16:14

Merge branch 'main' into ES-12000_add_write_load_modeling_to_balancer

9e36975

Use shard write loads instead of estimating

9ca9b4b

nicktindall changed the title ~~Estimate impact of shard movement using node-level write load~~ Simulate impact of shard movement using shard-level write load Jul 21, 2025

DiannaHohensee reviewed Jul 22, 2025

View reviewed changes


		public class WriteLoadPerShardSimulator {

		private final ObjectFloatMap<String> writeLoadDeltas;


		public class WriteLoadPerShardSimulator {

		private final ObjectDoubleMap<String> simulatedWriteLoadDeltas;

Simulate impact of shard movement using shard-level write load #131406

Are you sure you want to change the base?

Simulate impact of shard movement using shard-level write load #131406

Conversation

nicktindall commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jul 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b left a comment

Choose a reason for hiding this comment

Uh oh!

nicktindall commented Jul 18, 2025

Uh oh!

DiannaHohensee left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nicktindall commented Jul 17, 2025 •

edited

Loading

mhl-b Jul 17, 2025 •

edited

Loading

DiannaHohensee left a comment •

edited

Loading