Skip to content

Commit 4c97d87

Browse files
jan-elasticalbertzaharovits
authored andcommitted
Inference autoscaling (#109667)
* Python dev tool for inference autoscaling simulation. Squashed commit of the following: commit d98bd3d39d833329ab83a8274885473db41ed08a Author: Jan Kuipers <[email protected]> Date: Mon May 13 17:27:38 2024 +0200 Increase measurement interval to 10secs commit e808ae5be52c5ea4d5ff8ccb881a4a80de0254f9 Author: Jan Kuipers <[email protected]> Date: Mon May 13 17:09:33 2024 +0200 jump -> jumps commit c38cbdebfcec43e6982bb8bd1670519293161154 Author: Jan Kuipers <[email protected]> Date: Mon May 13 14:32:42 2024 +0200 Remove unused estimator commit 16101f32b539481cd4d648ebb5637a3309853552 Author: Jan Kuipers <[email protected]> Date: Mon May 13 14:31:30 2024 +0200 Measure latency periodically + documentation commit bc73bf29fde1d772701f0b71a7c8a0908669eb0f Author: Jan Kuipers <[email protected]> Date: Mon May 13 12:53:19 2024 +0200 Init variance to None commit 0e73fa836fa9deec6ba55ef1161cc0dd71f35044 Author: Jan Kuipers <[email protected]> Date: Mon May 13 11:18:21 2024 +0200 No autodetection of dynamics changes for latency commit 75924a744d26a72835529598a6df1a2d22bdaddc Author: Jan Kuipers <[email protected]> Date: Mon May 13 10:10:34 2024 +0200 Move autoscaling code to own class commit 23553bb8cccd6ed80ac667b12ec38a6d5562dd29 Author: Jan Kuipers <[email protected]> Date: Wed May 8 18:01:55 2024 +0200 Improved autoscaling simulation commit 2db606b2bba69d741fa231f369c633ea793294d5 Author: Tom Veasey <[email protected]> Date: Tue Apr 30 15:01:40 2024 +0100 Correct the dependency on allocations commit 0e45cfbaf901cf9d440efa9b404058a67d000653 Author: Tom Veasey <[email protected]> Date: Tue Apr 30 11:11:05 2024 +0100 Tweak commit a0f23a4a05875cd5df3863e5ad067b46a67c8cda Author: Tom Veasey <[email protected]> Date: Tue Apr 30 11:09:30 2024 +0100 Correction commit f9cdb140d298bd99c64c79f020c058d60bfba134 Author: Tom Veasey <[email protected]> Date: Tue Apr 30 09:57:59 2024 +0100 Allow extrapolation commit 57eb1a661a2b97412f479606c23c54dfb7887f52 Author: Tom Veasey <[email protected]> Date: Tue Apr 30 09:55:17 2024 +0100 Simplify and estimate average duration rather than rate commit 36dff17194f2bcf816013b112cf07d70c9eec161 Author: Tom Veasey <[email protected]> Date: Mon Apr 29 21:42:25 2024 +0100 Kalman filter for simple state model for average inference duration as a function of time and allocation count commit a1b85bd0deeabd5162f2ccd5a28672299025cee5 Author: Jan Kuipers <[email protected]> Date: Mon Apr 29 12:15:59 2024 +0200 Improvements commit 51040655fcfbfd221f2446542a955fb0f19fb145 Author: Jan Kuipers <[email protected]> Date: Mon Apr 29 09:33:10 2024 +0200 Account for virtual cores / hyperthreading commit 7a93407ecae6b6044108299a1d05f72cdf0d752a Author: Jan Kuipers <[email protected]> Date: Fri Apr 26 16:58:25 2024 +0200 Simulator for inference autoscaling. * Better process variance upon dynamics changes, and propagate dynamics changes to the next iteration. * Inference autoscaling (WIP) * Inference autoscaling test scripts * Debug logs * Inference autoscaling API * Update Autoscalers upon cluster changes * Polish code / fix bugs * Use correct string formatter * More fixes * Autoscaling tests * spotless * Remove scripts (moved to ml-data) * Rebrand to "adaptive allocations". * Move serialized field to end * Rebranding leftover * Improve adaptive allocation timing * SystemAuditor for scaling messages * Fix test * Add documentation * Update docs/changelog/109667.yaml * Cooldown of 5mins after scaleup * Polish code * High-variance adaptive allocations test * Fix AdaptiveAllocationsScalerServiceTests * Fix typo in package name * Wire adaptive allocations setting into put inference API * Checkstyle * Fix serialization of ElserInternalServiceSettings. * Propagate adaptive allocations settings from put inference request to create trained model request * Fix CustomElandInternalTextEmbeddingServiceSettingsTests * Javadocs * Improvements / fixes * Disallow setting num_allocations when adaptive allocations is enabled * Fix AdaptiveAllocationsScalerServiceTests * spotless * NPE fixes * spotless * Allow autoscaler to update num allocations * Fix AdaptiveAllocationsScalerServiceTests. * Fix bug in inference stats api * Fix PyTorchResultProcessorTests
1 parent 479dd67 commit 4c97d87

File tree

72 files changed

+2517
-445
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

72 files changed

+2517
-445
lines changed

docs/changelog/109667.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 109667
2+
summary: Inference autoscaling
3+
area: Machine Learning
4+
type: feature
5+
issues: []

server/src/main/java/org/elasticsearch/TransportVersions.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,7 @@ static TransportVersion def(int id) {
210210
public static final TransportVersion VERSIONED_MASTER_NODE_REQUESTS = def(8_701_00_0);
211211
public static final TransportVersion ML_INFERENCE_AMAZON_BEDROCK_ADDED = def(8_702_00_0);
212212
public static final TransportVersion ML_INFERENCE_DONT_DELETE_WHEN_SEMANTIC_TEXT_EXISTS = def(8_703_00_0);
213+
public static final TransportVersion INFERENCE_ADAPTIVE_ALLOCATIONS = def(8_704_00_0);
213214

214215
/*
215216
* STOP! READ THIS FIRST! No, really,

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/action/CreateTrainedModelAssignmentAction.java

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77

88
package org.elasticsearch.xpack.core.ml.action;
99

10+
import org.elasticsearch.TransportVersions;
1011
import org.elasticsearch.action.ActionRequestValidationException;
1112
import org.elasticsearch.action.ActionResponse;
1213
import org.elasticsearch.action.ActionType;
@@ -18,6 +19,7 @@
1819
import org.elasticsearch.xcontent.ToXContentObject;
1920
import org.elasticsearch.xcontent.XContentBuilder;
2021
import org.elasticsearch.xcontent.XContentParser;
22+
import org.elasticsearch.xpack.core.ml.inference.assignment.AdaptiveAllocationsSettings;
2123
import org.elasticsearch.xpack.core.ml.inference.assignment.TrainedModelAssignment;
2224
import org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper;
2325

@@ -34,15 +36,22 @@ private CreateTrainedModelAssignmentAction() {
3436

3537
public static class Request extends MasterNodeRequest<Request> {
3638
private final StartTrainedModelDeploymentAction.TaskParams taskParams;
39+
private final AdaptiveAllocationsSettings adaptiveAllocationsSettings;
3740

38-
public Request(StartTrainedModelDeploymentAction.TaskParams taskParams) {
41+
public Request(StartTrainedModelDeploymentAction.TaskParams taskParams, AdaptiveAllocationsSettings adaptiveAllocationsSettings) {
3942
super(TRAPPY_IMPLICIT_DEFAULT_MASTER_NODE_TIMEOUT);
4043
this.taskParams = ExceptionsHelper.requireNonNull(taskParams, "taskParams");
44+
this.adaptiveAllocationsSettings = adaptiveAllocationsSettings;
4145
}
4246

4347
public Request(StreamInput in) throws IOException {
4448
super(in);
4549
this.taskParams = new StartTrainedModelDeploymentAction.TaskParams(in);
50+
if (in.getTransportVersion().onOrAfter(TransportVersions.INFERENCE_ADAPTIVE_ALLOCATIONS)) {
51+
this.adaptiveAllocationsSettings = in.readOptionalWriteable(AdaptiveAllocationsSettings::new);
52+
} else {
53+
this.adaptiveAllocationsSettings = null;
54+
}
4655
}
4756

4857
@Override
@@ -54,24 +63,32 @@ public ActionRequestValidationException validate() {
5463
public void writeTo(StreamOutput out) throws IOException {
5564
super.writeTo(out);
5665
taskParams.writeTo(out);
66+
if (out.getTransportVersion().onOrAfter(TransportVersions.INFERENCE_ADAPTIVE_ALLOCATIONS)) {
67+
out.writeOptionalWriteable(adaptiveAllocationsSettings);
68+
}
5769
}
5870

5971
@Override
6072
public boolean equals(Object o) {
6173
if (this == o) return true;
6274
if (o == null || getClass() != o.getClass()) return false;
6375
Request request = (Request) o;
64-
return Objects.equals(taskParams, request.taskParams);
76+
return Objects.equals(taskParams, request.taskParams)
77+
&& Objects.equals(adaptiveAllocationsSettings, request.adaptiveAllocationsSettings);
6578
}
6679

6780
@Override
6881
public int hashCode() {
69-
return Objects.hash(taskParams);
82+
return Objects.hash(taskParams, adaptiveAllocationsSettings);
7083
}
7184

7285
public StartTrainedModelDeploymentAction.TaskParams getTaskParams() {
7386
return taskParams;
7487
}
88+
89+
public AdaptiveAllocationsSettings getAdaptiveAllocationsSettings() {
90+
return adaptiveAllocationsSettings;
91+
}
7592
}
7693

7794
public static class Response extends ActionResponse implements ToXContentObject {

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/action/StartTrainedModelDeploymentAction.java

Lines changed: 77 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,10 @@
2929
import org.elasticsearch.xcontent.XContentParser;
3030
import org.elasticsearch.xpack.core.ml.MlConfigVersion;
3131
import org.elasticsearch.xpack.core.ml.inference.TrainedModelConfig;
32+
import org.elasticsearch.xpack.core.ml.inference.assignment.AdaptiveAllocationsSettings;
3233
import org.elasticsearch.xpack.core.ml.inference.assignment.AllocationStatus;
3334
import org.elasticsearch.xpack.core.ml.inference.assignment.Priority;
35+
import org.elasticsearch.xpack.core.ml.inference.assignment.TrainedModelAssignment;
3436
import org.elasticsearch.xpack.core.ml.job.messages.Messages;
3537
import org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper;
3638
import org.elasticsearch.xpack.core.ml.utils.MlTaskParams;
@@ -40,7 +42,6 @@
4042
import java.util.Optional;
4143
import java.util.concurrent.TimeUnit;
4244

43-
import static org.elasticsearch.xcontent.ConstructingObjectParser.optionalConstructorArg;
4445
import static org.elasticsearch.xpack.core.ml.MlTasks.trainedModelAssignmentTaskDescription;
4546

4647
public class StartTrainedModelDeploymentAction extends ActionType<CreateTrainedModelAssignmentAction.Response> {
@@ -99,6 +100,7 @@ public static class Request extends MasterNodeRequest<Request> implements ToXCon
99100
public static final ParseField QUEUE_CAPACITY = TaskParams.QUEUE_CAPACITY;
100101
public static final ParseField CACHE_SIZE = TaskParams.CACHE_SIZE;
101102
public static final ParseField PRIORITY = TaskParams.PRIORITY;
103+
public static final ParseField ADAPTIVE_ALLOCATIONS = TrainedModelAssignment.ADAPTIVE_ALLOCATIONS;
102104

103105
public static final ObjectParser<Request, Void> PARSER = new ObjectParser<>(NAME, Request::new);
104106

@@ -117,6 +119,12 @@ public static class Request extends MasterNodeRequest<Request> implements ToXCon
117119
ObjectParser.ValueType.VALUE
118120
);
119121
PARSER.declareString(Request::setPriority, PRIORITY);
122+
PARSER.declareObjectOrNull(
123+
Request::setAdaptiveAllocationsSettings,
124+
(p, c) -> AdaptiveAllocationsSettings.PARSER.parse(p, c).build(),
125+
null,
126+
ADAPTIVE_ALLOCATIONS
127+
);
120128
}
121129

122130
public static Request parseRequest(String modelId, String deploymentId, XContentParser parser) {
@@ -140,7 +148,8 @@ public static Request parseRequest(String modelId, String deploymentId, XContent
140148
private TimeValue timeout = DEFAULT_TIMEOUT;
141149
private AllocationStatus.State waitForState = DEFAULT_WAITFOR_STATE;
142150
private ByteSizeValue cacheSize;
143-
private int numberOfAllocations = DEFAULT_NUM_ALLOCATIONS;
151+
private Integer numberOfAllocations;
152+
private AdaptiveAllocationsSettings adaptiveAllocationsSettings = null;
144153
private int threadsPerAllocation = DEFAULT_NUM_THREADS;
145154
private int queueCapacity = DEFAULT_QUEUE_CAPACITY;
146155
private Priority priority = DEFAULT_PRIORITY;
@@ -160,7 +169,11 @@ public Request(StreamInput in) throws IOException {
160169
modelId = in.readString();
161170
timeout = in.readTimeValue();
162171
waitForState = in.readEnum(AllocationStatus.State.class);
163-
numberOfAllocations = in.readVInt();
172+
if (in.getTransportVersion().onOrAfter(TransportVersions.INFERENCE_ADAPTIVE_ALLOCATIONS)) {
173+
numberOfAllocations = in.readOptionalVInt();
174+
} else {
175+
numberOfAllocations = in.readVInt();
176+
}
164177
threadsPerAllocation = in.readVInt();
165178
queueCapacity = in.readVInt();
166179
if (in.getTransportVersion().onOrAfter(TransportVersions.V_8_4_0)) {
@@ -171,12 +184,16 @@ public Request(StreamInput in) throws IOException {
171184
} else {
172185
this.priority = Priority.NORMAL;
173186
}
174-
175187
if (in.getTransportVersion().onOrAfter(TransportVersions.V_8_8_0)) {
176188
this.deploymentId = in.readString();
177189
} else {
178190
this.deploymentId = modelId;
179191
}
192+
if (in.getTransportVersion().onOrAfter(TransportVersions.INFERENCE_ADAPTIVE_ALLOCATIONS)) {
193+
this.adaptiveAllocationsSettings = in.readOptionalWriteable(AdaptiveAllocationsSettings::new);
194+
} else {
195+
this.adaptiveAllocationsSettings = null;
196+
}
180197
}
181198

182199
public final void setModelId(String modelId) {
@@ -212,14 +229,34 @@ public Request setWaitForState(AllocationStatus.State waitForState) {
212229
return this;
213230
}
214231

215-
public int getNumberOfAllocations() {
232+
public Integer getNumberOfAllocations() {
216233
return numberOfAllocations;
217234
}
218235

219-
public void setNumberOfAllocations(int numberOfAllocations) {
236+
public int computeNumberOfAllocations() {
237+
if (numberOfAllocations != null) {
238+
return numberOfAllocations;
239+
} else {
240+
if (adaptiveAllocationsSettings == null || adaptiveAllocationsSettings.getMinNumberOfAllocations() == null) {
241+
return DEFAULT_NUM_ALLOCATIONS;
242+
} else {
243+
return adaptiveAllocationsSettings.getMinNumberOfAllocations();
244+
}
245+
}
246+
}
247+
248+
public void setNumberOfAllocations(Integer numberOfAllocations) {
220249
this.numberOfAllocations = numberOfAllocations;
221250
}
222251

252+
public AdaptiveAllocationsSettings getAdaptiveAllocationsSettings() {
253+
return adaptiveAllocationsSettings;
254+
}
255+
256+
public void setAdaptiveAllocationsSettings(AdaptiveAllocationsSettings adaptiveAllocationsSettings) {
257+
this.adaptiveAllocationsSettings = adaptiveAllocationsSettings;
258+
}
259+
223260
public int getThreadsPerAllocation() {
224261
return threadsPerAllocation;
225262
}
@@ -258,7 +295,11 @@ public void writeTo(StreamOutput out) throws IOException {
258295
out.writeString(modelId);
259296
out.writeTimeValue(timeout);
260297
out.writeEnum(waitForState);
261-
out.writeVInt(numberOfAllocations);
298+
if (out.getTransportVersion().onOrAfter(TransportVersions.INFERENCE_ADAPTIVE_ALLOCATIONS)) {
299+
out.writeOptionalVInt(numberOfAllocations);
300+
} else {
301+
out.writeVInt(numberOfAllocations);
302+
}
262303
out.writeVInt(threadsPerAllocation);
263304
out.writeVInt(queueCapacity);
264305
if (out.getTransportVersion().onOrAfter(TransportVersions.V_8_4_0)) {
@@ -270,6 +311,9 @@ public void writeTo(StreamOutput out) throws IOException {
270311
if (out.getTransportVersion().onOrAfter(TransportVersions.V_8_8_0)) {
271312
out.writeString(deploymentId);
272313
}
314+
if (out.getTransportVersion().onOrAfter(TransportVersions.INFERENCE_ADAPTIVE_ALLOCATIONS)) {
315+
out.writeOptionalWriteable(adaptiveAllocationsSettings);
316+
}
273317
}
274318

275319
@Override
@@ -279,7 +323,12 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
279323
builder.field(DEPLOYMENT_ID.getPreferredName(), deploymentId);
280324
builder.field(TIMEOUT.getPreferredName(), timeout.getStringRep());
281325
builder.field(WAIT_FOR.getPreferredName(), waitForState);
282-
builder.field(NUMBER_OF_ALLOCATIONS.getPreferredName(), numberOfAllocations);
326+
if (numberOfAllocations != null) {
327+
builder.field(NUMBER_OF_ALLOCATIONS.getPreferredName(), numberOfAllocations);
328+
}
329+
if (adaptiveAllocationsSettings != null) {
330+
builder.field(ADAPTIVE_ALLOCATIONS.getPreferredName(), adaptiveAllocationsSettings);
331+
}
283332
builder.field(THREADS_PER_ALLOCATION.getPreferredName(), threadsPerAllocation);
284333
builder.field(QUEUE_CAPACITY.getPreferredName(), queueCapacity);
285334
if (cacheSize != null) {
@@ -301,12 +350,25 @@ public ActionRequestValidationException validate() {
301350
+ Strings.arrayToCommaDelimitedString(VALID_WAIT_STATES)
302351
);
303352
}
304-
if (numberOfAllocations < 1) {
305-
validationException.addValidationError("[" + NUMBER_OF_ALLOCATIONS + "] must be a positive integer");
353+
if (numberOfAllocations != null) {
354+
if (numberOfAllocations < 1) {
355+
validationException.addValidationError("[" + NUMBER_OF_ALLOCATIONS + "] must be a positive integer");
356+
}
357+
if (adaptiveAllocationsSettings != null && adaptiveAllocationsSettings.getEnabled()) {
358+
validationException.addValidationError(
359+
"[" + NUMBER_OF_ALLOCATIONS + "] cannot be set if adaptive allocations is enabled"
360+
);
361+
}
306362
}
307363
if (threadsPerAllocation < 1) {
308364
validationException.addValidationError("[" + THREADS_PER_ALLOCATION + "] must be a positive integer");
309365
}
366+
ActionRequestValidationException autoscaleException = adaptiveAllocationsSettings == null
367+
? null
368+
: adaptiveAllocationsSettings.validate();
369+
if (autoscaleException != null) {
370+
validationException.addValidationErrors(autoscaleException.validationErrors());
371+
}
310372
if (threadsPerAllocation > MAX_THREADS_PER_ALLOCATION || isPowerOf2(threadsPerAllocation) == false) {
311373
validationException.addValidationError(
312374
"[" + THREADS_PER_ALLOCATION + "] must be a power of 2 less than or equal to " + MAX_THREADS_PER_ALLOCATION
@@ -322,7 +384,7 @@ public ActionRequestValidationException validate() {
322384
validationException.addValidationError("[" + TIMEOUT + "] must be positive");
323385
}
324386
if (priority == Priority.LOW) {
325-
if (numberOfAllocations > 1) {
387+
if (numberOfAllocations != null && numberOfAllocations > 1) {
326388
validationException.addValidationError("[" + NUMBER_OF_ALLOCATIONS + "] must be 1 when [" + PRIORITY + "] is low");
327389
}
328390
if (threadsPerAllocation > 1) {
@@ -344,6 +406,7 @@ public int hashCode() {
344406
timeout,
345407
waitForState,
346408
numberOfAllocations,
409+
adaptiveAllocationsSettings,
347410
threadsPerAllocation,
348411
queueCapacity,
349412
cacheSize,
@@ -365,7 +428,8 @@ public boolean equals(Object obj) {
365428
&& Objects.equals(timeout, other.timeout)
366429
&& Objects.equals(waitForState, other.waitForState)
367430
&& Objects.equals(cacheSize, other.cacheSize)
368-
&& numberOfAllocations == other.numberOfAllocations
431+
&& Objects.equals(numberOfAllocations, other.numberOfAllocations)
432+
&& Objects.equals(adaptiveAllocationsSettings, other.adaptiveAllocationsSettings)
369433
&& threadsPerAllocation == other.threadsPerAllocation
370434
&& queueCapacity == other.queueCapacity
371435
&& priority == other.priority;
@@ -430,7 +494,7 @@ public static boolean mayAssignToNode(@Nullable DiscoveryNode node) {
430494
PARSER.declareInt(ConstructingObjectParser.optionalConstructorArg(), THREADS_PER_ALLOCATION);
431495
PARSER.declareInt(ConstructingObjectParser.constructorArg(), QUEUE_CAPACITY);
432496
PARSER.declareField(
433-
optionalConstructorArg(),
497+
ConstructingObjectParser.optionalConstructorArg(),
434498
(p, c) -> ByteSizeValue.parseBytesSizeValue(p.text(), CACHE_SIZE.getPreferredName()),
435499
CACHE_SIZE,
436500
ObjectParser.ValueType.VALUE

0 commit comments

Comments
 (0)