Skip to content

Conversation

GabrielCWT
Copy link
Contributor

Description

The aim of the PR is to enable changing capacity for the KubernetesTaskRunner. This would be done through the existing POST API /druid/indexer/v1/k8s/taskrunner/executionconfig.

K8 configuration changes

In order to do this, I have added a new interface KubernetesTaskRunnerConfig and renamed the existing config to KubernetesTaskRunnerStaticConfig. The interface will be implemented by the existing static config and a new KubernetesTaskRunnerEffectiveConfig which will be a wrapper class to encapsulate both the dynamic and static configs.

The effective config will fall back to the static config's capacity if the dynamic config has not been set.

Changes to /druid/indexer/v1/k8s/taskrunner/executionconfig behaviour

The API will now take a new capacity field. On top of this, the fields will now be optional. If any field is null or not passed, we will use the existing dynamic config values.

Release note

New capacity field for /druid/indexer/v1/k8s/taskrunner/executionconfig POST API. It will change the capacity for KubernetesTaskRunner.

Challenges

In order to update the capacity for the task runner, I am calling a new function syncCapacityWithDynamicConfig before every task is run in order to update the thread pool to the newest config.

The issue with this is that any changes by the user will not be immediately reflected on the web console's homepage under the "Tasks" widget. The "task slots" would only be updated after a new task has been run.

I could not find a way to add a callback to the updating of dynamic configurations and felt that having a check every few seconds to see if the dynamic configuration had been updated was unnecessarily complex. I am open to suggestions if there are better ways to update the task runner.

Screenshot 2025-10-01 at 5 47 05 PM

@Override
public boolean isSidecarSupport()
{
return staticConfig.isSidecarSupport();

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation Note

Invoking
KubernetesTaskRunnerStaticConfig.isSidecarSupport
should be avoided because it has been deprecated.
Copy link
Contributor

@GWphua GWphua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, can remove the deprecated defaultIfNull calls.

@kfaraz
Copy link
Contributor

kfaraz commented Oct 3, 2025

@GabrielCWT , thanks for putting this together.

I might be mistaken but I am not entirely sure if the task capacity of a cluster should be a dynamic config.
For any cluster admin, this can have large cost implications.
It seems odd to be able to override the capacity of the cluster defined in the Overlord runtime property
simply by calling an API.

Changing the task capacity should not be a very frequent requirement for any cluster.
And when needed, it should be fairly reasonable to require an Overlord restart.

Could you please elaborate on why you feel the current setup is not adequate?

@GabrielCWT
Copy link
Contributor Author

@GabrielCWT , thanks for putting this together.

I might be mistaken but I am not entirely sure if the task capacity of a cluster should be a dynamic config. For any cluster admin, this can have large cost implications. It seems odd to be able to override the capacity of the cluster defined in the Overlord runtime property simply by calling an API.

Changing the task capacity should not be a very frequent requirement for any cluster. And when needed, it should be fairly reasonable to require an Overlord restart.

Could you please elaborate on why you feel the current setup is not adequate?

Restarting the Overlord can be time-consuming and potentially risky as we could face issues when trying to redeploy the Overlord instance. Updating the task capacity dynamically allows admins to adjust the cluster safely, reducing operational downtime and complexity. While changes to task capacity are infrequent, I feel that providing a safer runtime option would help to minimize disruptions.

@FrankChen021
Copy link
Member

@GabrielCWT , thanks for putting this together.

I might be mistaken but I am not entirely sure if the task capacity of a cluster should be a dynamic config. For any cluster admin, this can have large cost implications. It seems odd to be able to override the capacity of the cluster defined in the Overlord runtime property simply by calling an API.

Changing the task capacity should not be a very frequent requirement for any cluster. And when needed, it should be fairly reasonable to require an Overlord restart.

Could you please elaborate on why you feel the current setup is not adequate?

I think it's true that changing the task slot for middle manager requires the restart as we might need a redeployment of middelmanager to some servers with bigger resources. but for K8S-based task scheduling, the resources is allocated at K8S side which is out of druid, restarting of the overlord does not make any sense, we should have the ablity to reload the capacity dynamically. As @GabrielCWT has stated above, restarting overlord is a heavy and risky operation in production.

@kfaraz
Copy link
Contributor

kfaraz commented Oct 3, 2025

Thanks for the responses, @FrankChen021 , @GabrielCWT .

Even though with the K8s task runner, the task pods are not technically a part of the Druid cluster, the Overlord still has to manage those tasks. The KubernetesTaskRunner itself keeps separate threads for each running task to track their status (this PR also updates that thread count, if I am not mistaken).

So I should imagine that a major change in task capacity would also require some kind of scaling of the Overlord itself.

Also, why is the Overlord restart "risky" or even "slow"? Doesn't a version upgrade require an Overlord restart too?

@FrankChen021
Copy link
Member

too?

In a k8s deployment, after increasing the task capability, increasing the overlord resources may not be always needed. For example, we generally set the cpu limit to a higher value while keeping the cpu request relatively low at the initial deployment, when capacity is increased, no need to increase the cpu resource. I mean risky because overlord needs to restore all tasks, previously we had some problems (maybe bug) that after switching leaders, overlord failed to elect a new leader. We try our best not to restart coordinator/overlord in production.

@kfaraz
Copy link
Contributor

kfaraz commented Oct 6, 2025

I mean risky because overlord needs to restore all tasks, previously we had some problems (maybe bug) that after switching leaders, overlord failed to elect a new leader.

Yes, there might be some bugs around that. Also, the K8s task runner makes certain list pod calls, which are pretty heavy
and needs to be addressed. I think @capistrant is doing some work to improve that code flow.

We try our best not to restart coordinator/overlord in production.

Oh, how frequently do you upgrade your cluster?
Is changing the task capacity going to be much more frequent than that.

I agree that K8s task runner is buggy and we should improve upon it.
But making the task capacity dynamic doesn't seem like the best solution.
It will open a whole another can of worms and make this piece only more complicated.

Instead, we should trying to fix up the actual problems in the task runner which make Overlord leader switch erroneous.

What are your thoughts, @FrankChen021 ?

@FrankChen021
Copy link
Member

I mean risky because overlord needs to restore all tasks, previously we had some problems (maybe bug) that after switching leaders, overlord failed to elect a new leader.

Yes, there might be some bugs around that. Also, the K8s task runner makes certain list pod calls, which are pretty heavy and needs to be addressed. I think @capistrant is doing some work to improve that code flow.

We try our best not to restart coordinator/overlord in production.

Oh, how frequently do you upgrade your cluster? Is changing the task capacity going to be much more frequent than that.

We don't upgrade clusters very frequently, may be once a year or more than 1 year. We do adjust the capacity (upsize or downsize) regularly based on load/requirement.

I agree that K8s task runner is buggy and we should improve upon it. But making the task capacity dynamic doesn't seem like the best solution. It will open a whole another can of worms and make this piece only more complicated.

Instead, we should trying to fix up the actual problems in the task runner which make Overlord leader switch erroneous.

What are your thoughts, @FrankChen021 ?

The main idea of dynamic configuration is not to circumvent problems at restarting phase, it's about reducing the operation complexity and saving time. even restarting overlord is smooth, I don't think changing such configuration requires a restart from users/operators' view. for static configurations, operators have to change configuration files, sync files to kubenetes, restarting components, it's a heavy work flow.

@FrankChen021 FrankChen021 added this to the 35.0.0 milestone Oct 8, 2025
@kfaraz
Copy link
Contributor

kfaraz commented Oct 8, 2025

Thanks for the clarification, @FrankChen021 !

I am just a little apprehensive since the K8s task runner is already pretty buggy.
Also, it feels weird to have a config be specified by both static and dynamic means.
But I suppose the static config can be thought of as the default value.

I haven't gone through the whole PR yet. Will try to do a thorough review today.
@GabrielCWT , it would be nice if you could hold off on merging this PR until then.

@FrankChen021
Copy link
Member

Hi @kfaraz are u reviewing this PR? I hope we can merge it into druid 35.

}
}

private void syncCapacityWithDynamicConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is not maintainable.
Can we do this in a seperate thread instead of calling sync everywhere.
Is the currentCapacity thread safe ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to make currentCapacity to be thread safe.

Can we do this in a seperate thread instead of calling sync everywhere.

This was an alternative solution though I am not sure if we are thinking of the same implementation. My solution was to have a thread which periodically checks whether there has been any changes to the capacity and update currentCapacity accordingly. However, I wasn't sure if it was a waste of resources as there would need to be a trade off between responsiveness (how quick will the changes be visible) and resources (since we need to wake the thread up every X seconds).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a separate thread to check is also a bad idea , it does not solve the real problem here but increase the complexity.

the real problem here is that the config manager does not provide a notification mechanism when it detects configuration changes.

If we look at the config manager implementation, it provides a method swapIfNew to check and set new values. This is a place where we can add notification.
I think we can add a new overridden watch method which accepts a Runnable as callback.
This callback is kept in the internal ConfigHolder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added observer support for ConfigHolders

@kfaraz
Copy link
Contributor

kfaraz commented Oct 10, 2025

Hi @kfaraz are u reviewing this PR? I hope we can merge it into druid 35.

Yes, @FrankChen021 , I will go through the changes either today or tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants