Implement dynamic capacity for kubernetes task runner #18591

GabrielCWT · 2025-10-01T09:51:09Z

Description

The aim of the PR is to enable changing capacity for the KubernetesTaskRunner. This would be done through the existing POST API /druid/indexer/v1/k8s/taskrunner/executionconfig.

K8 configuration changes

In order to do this, I have added a new interface KubernetesTaskRunnerConfig and renamed the existing config to KubernetesTaskRunnerStaticConfig. The interface will be implemented by the existing static config and a new KubernetesTaskRunnerEffectiveConfig which will be a wrapper class to encapsulate both the dynamic and static configs.

The effective config will fall back to the static config's capacity if the dynamic config has not been set.

Changes to `/druid/indexer/v1/k8s/taskrunner/executionconfig` behaviour

The API will now take a new capacity field. On top of this, the fields will now be optional. If any field is null or not passed, we will use the existing dynamic config values.

Release note

New capacity field for /druid/indexer/v1/k8s/taskrunner/executionconfig POST API. It will change the capacity for KubernetesTaskRunner.

Challenges

In order to update the capacity for the task runner, I am calling a new function syncCapacityWithDynamicConfig before every task is run in order to update the thread pool to the newest config.

The issue with this is that any changes by the user will not be immediately reflected on the web console's homepage under the "Tasks" widget. The "task slots" would only be updated after a new task has been run.

I could not find a way to add a callback to the updating of dynamic configurations and felt that having a check every few seconds to see if the dynamic configuration had been updated was unnecessarily complex. I am open to suggestions if there are better ways to update the task runner.

...ensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunnerEffectiveConfig.java

+  @Override
+  public boolean isSidecarSupport()
+  {
+    return staticConfig.isSidecarSupport();


...extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunnerStaticConfig.java

...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java

...d-extensions/src/test/java/org/apache/druid/k8s/overlord/taskadapter/K8sTaskAdapterTest.java

GWphua

LGTM, can remove the deprecated defaultIfNull calls.

docs/development/extensions-core/k8s-jobs.md

...extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunnerStaticConfig.java

...ensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunnerEffectiveConfig.java

kfaraz · 2025-10-03T09:57:54Z

@GabrielCWT , thanks for putting this together.

I might be mistaken but I am not entirely sure if the task capacity of a cluster should be a dynamic config.
For any cluster admin, this can have large cost implications.
It seems odd to be able to override the capacity of the cluster defined in the Overlord runtime property
simply by calling an API.

Changing the task capacity should not be a very frequent requirement for any cluster.
And when needed, it should be fairly reasonable to require an Overlord restart.

Could you please elaborate on why you feel the current setup is not adequate?

GabrielCWT · 2025-10-03T10:11:05Z

@GabrielCWT , thanks for putting this together.

I might be mistaken but I am not entirely sure if the task capacity of a cluster should be a dynamic config. For any cluster admin, this can have large cost implications. It seems odd to be able to override the capacity of the cluster defined in the Overlord runtime property simply by calling an API.

Changing the task capacity should not be a very frequent requirement for any cluster. And when needed, it should be fairly reasonable to require an Overlord restart.

Could you please elaborate on why you feel the current setup is not adequate?

Restarting the Overlord can be time-consuming and potentially risky as we could face issues when trying to redeploy the Overlord instance. Updating the task capacity dynamically allows admins to adjust the cluster safely, reducing operational downtime and complexity. While changes to task capacity are infrequent, I feel that providing a safer runtime option would help to minimize disruptions.

FrankChen021 · 2025-10-03T10:24:51Z

@GabrielCWT , thanks for putting this together.

I might be mistaken but I am not entirely sure if the task capacity of a cluster should be a dynamic config. For any cluster admin, this can have large cost implications. It seems odd to be able to override the capacity of the cluster defined in the Overlord runtime property simply by calling an API.

Changing the task capacity should not be a very frequent requirement for any cluster. And when needed, it should be fairly reasonable to require an Overlord restart.

Could you please elaborate on why you feel the current setup is not adequate?

I think it's true that changing the task slot for middle manager requires the restart as we might need a redeployment of middelmanager to some servers with bigger resources. but for K8S-based task scheduling, the resources is allocated at K8S side which is out of druid, restarting of the overlord does not make any sense, we should have the ablity to reload the capacity dynamically. As @GabrielCWT has stated above, restarting overlord is a heavy and risky operation in production.

kfaraz · 2025-10-03T11:40:37Z

Thanks for the responses, @FrankChen021 , @GabrielCWT .

Even though with the K8s task runner, the task pods are not technically a part of the Druid cluster, the Overlord still has to manage those tasks. The KubernetesTaskRunner itself keeps separate threads for each running task to track their status (this PR also updates that thread count, if I am not mistaken).

So I should imagine that a major change in task capacity would also require some kind of scaling of the Overlord itself.

Also, why is the Overlord restart "risky" or even "slow"? Doesn't a version upgrade require an Overlord restart too?

FrankChen021 · 2025-10-03T11:50:32Z

too?

In a k8s deployment, after increasing the task capability, increasing the overlord resources may not be always needed. For example, we generally set the cpu limit to a higher value while keeping the cpu request relatively low at the initial deployment, when capacity is increased, no need to increase the cpu resource. I mean risky because overlord needs to restore all tasks, previously we had some problems (maybe bug) that after switching leaders, overlord failed to elect a new leader. We try our best not to restart coordinator/overlord in production.

kfaraz · 2025-10-06T14:35:19Z

I mean risky because overlord needs to restore all tasks, previously we had some problems (maybe bug) that after switching leaders, overlord failed to elect a new leader.

Yes, there might be some bugs around that. Also, the K8s task runner makes certain list pod calls, which are pretty heavy
and needs to be addressed. I think @capistrant is doing some work to improve that code flow.

We try our best not to restart coordinator/overlord in production.

Oh, how frequently do you upgrade your cluster?
Is changing the task capacity going to be much more frequent than that.

I agree that K8s task runner is buggy and we should improve upon it.
But making the task capacity dynamic doesn't seem like the best solution.
It will open a whole another can of worms and make this piece only more complicated.

Instead, we should trying to fix up the actual problems in the task runner which make Overlord leader switch erroneous.

What are your thoughts, @FrankChen021 ?

FrankChen021 · 2025-10-07T02:16:18Z

I mean risky because overlord needs to restore all tasks, previously we had some problems (maybe bug) that after switching leaders, overlord failed to elect a new leader.

Yes, there might be some bugs around that. Also, the K8s task runner makes certain list pod calls, which are pretty heavy and needs to be addressed. I think @capistrant is doing some work to improve that code flow.

We try our best not to restart coordinator/overlord in production.

Oh, how frequently do you upgrade your cluster? Is changing the task capacity going to be much more frequent than that.

We don't upgrade clusters very frequently, may be once a year or more than 1 year. We do adjust the capacity (upsize or downsize) regularly based on load/requirement.

I agree that K8s task runner is buggy and we should improve upon it. But making the task capacity dynamic doesn't seem like the best solution. It will open a whole another can of worms and make this piece only more complicated.

Instead, we should trying to fix up the actual problems in the task runner which make Overlord leader switch erroneous.

What are your thoughts, @FrankChen021 ?

The main idea of dynamic configuration is not to circumvent problems at restarting phase, it's about reducing the operation complexity and saving time. even restarting overlord is smooth, I don't think changing such configuration requires a restart from users/operators' view. for static configurations, operators have to change configuration files, sync files to kubenetes, restarting components, it's a heavy work flow.

kfaraz · 2025-10-08T07:32:03Z

Thanks for the clarification, @FrankChen021 !

I am just a little apprehensive since the K8s task runner is already pretty buggy.
Also, it feels weird to have a config be specified by both static and dynamic means.
But I suppose the static config can be thought of as the default value.

I haven't gone through the whole PR yet. Will try to do a thorough review today.
@GabrielCWT , it would be nice if you could hold off on merging this PR until then.

FrankChen021 · 2025-10-10T03:00:40Z

Hi @kfaraz are u reviewing this PR? I hope we can merge it into druid 35.

cryptoe · 2025-10-10T05:45:16Z

...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java

    }
  }

+  private void syncCapacityWithDynamicConfig()


This code is not maintainable.
Can we do this in a seperate thread instead of calling sync everywhere.
Is the currentCapacity thread safe ?

It is possible to make currentCapacity to be thread safe.

Can we do this in a seperate thread instead of calling sync everywhere.

This was an alternative solution though I am not sure if we are thinking of the same implementation. My solution was to have a thread which periodically checks whether there has been any changes to the capacity and update currentCapacity accordingly. However, I wasn't sure if it was a waste of resources as there would need to be a trade off between responsiveness (how quick will the changes be visible) and resources (since we need to wake the thread up every X seconds).

using a separate thread to check is also a bad idea , it does not solve the real problem here but increase the complexity.

the real problem here is that the config manager does not provide a notification mechanism when it detects configuration changes.

If we look at the config manager implementation, it provides a method swapIfNew to check and set new values. This is a place where we can add notification.
I think we can add a new overridden watch method which accepts a Runnable as callback.
This callback is kept in the internal ConfigHolder.

I've added observer support for ConfigHolders

kfaraz · 2025-10-10T07:16:24Z

Hi @kfaraz are u reviewing this PR? I hope we can merge it into druid 35.

Yes, @FrankChen021 , I will go through the changes either today or tomorrow.

Implement dynamic capacity for kubernetes task runner

cb25fd9

github-actions bot added the Kubernetes label Oct 1, 2025

github-advanced-security bot found potential problems Oct 1, 2025

View reviewed changes

Update docs

c4386f1

github-actions bot added the Area - Documentation label Oct 2, 2025

FrankChen021 reviewed Oct 3, 2025

View reviewed changes

...es-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java Outdated Show resolved Hide resolved

...d-extensions/src/test/java/org/apache/druid/k8s/overlord/taskadapter/K8sTaskAdapterTest.java Outdated Show resolved Hide resolved

GWphua approved these changes Oct 3, 2025

View reviewed changes

GabrielCWT added 5 commits October 3, 2025 16:59

Refactor to use getIfNull

579a12c

Update docs wording

9f9e845

Fix based on comments

5ae5d4e

Update wording for docs

d0e0081

Upate static config java doc

4053083

Undo removal of constructor

1da285c

FrankChen021 added this to the 35.0.0 milestone Oct 8, 2025

FrankChen021 approved these changes Oct 8, 2025

View reviewed changes

cryptoe reviewed Oct 10, 2025

View reviewed changes

GabrielCWT added 4 commits October 21, 2025 10:57

Initial config observer implementation

128eb6e

Merge branch 'master' into apachegh-56-dynamic-runner-capacity

bc093fb

Use StringUtils.format

afec8ec

Add missing import

c02abc3

Implement dynamic capacity for kubernetes task runner #18591

Are you sure you want to change the base?

Implement dynamic capacity for kubernetes task runner #18591

Uh oh!

Conversation

GabrielCWT commented Oct 1, 2025

Description

K8 configuration changes

Changes to /druid/indexer/v1/k8s/taskrunner/executionconfig behaviour

Release note

Challenges

Uh oh!

Check notice

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GWphua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfaraz commented Oct 3, 2025

Uh oh!

GabrielCWT commented Oct 3, 2025

Uh oh!

FrankChen021 commented Oct 3, 2025

Uh oh!

kfaraz commented Oct 3, 2025

Uh oh!

FrankChen021 commented Oct 3, 2025

Uh oh!

kfaraz commented Oct 6, 2025

Uh oh!

FrankChen021 commented Oct 7, 2025

Uh oh!

kfaraz commented Oct 8, 2025

Uh oh!

FrankChen021 commented Oct 10, 2025

Uh oh!

cryptoe Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

GabrielCWT Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

FrankChen021 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

GabrielCWT Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

kfaraz commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Changes to `/druid/indexer/v1/k8s/taskrunner/executionconfig` behaviour