Skip to content

[WIP] implement autoscaling #242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
125977c
feat: implement the core auto-scaling functionality
knave Jun 19, 2025
1de2fe9
test: refactor test when processing workloads
knave Jun 20, 2025
dbea96d
feat: implement LeaderElectonRunnable explicitly and add compile-time…
knave Jun 23, 2025
d5f3053
feat: aggregate samples into histogram per tflops
knave Jun 23, 2025
db9c6ce
feat: implement metrics provider
knave Jun 25, 2025
4142a36
feat: add allocator logic
knave Jun 27, 2025
3e8076a
refactor: optimize update worker method
knave Jun 27, 2025
36b76a5
feat: add config parsing
knave Jun 28, 2025
c748df6
feat: apply updates to specified target resources
knave Jun 28, 2025
e22b847
feat: add auto-scaling switch config parsing and apply, TargetResourc…
knave Jun 29, 2025
03a3267
feat: merge AutoSetLimits and AutoSetRequests into AutoSetResources
knave Jul 3, 2025
f73e0cb
feat: implement adjust allocation
knave Jul 4, 2025
c25d013
fix: linter issues
knave Jul 4, 2025
3bca869
fix: linter issues
knave Jul 5, 2025
9be998c
refactor: support multiple recommenders
knave Jul 11, 2025
e326a41
refactor: code organization
knave Jul 19, 2025
acca957
feat: define cron scaler crd
knave Jul 21, 2025
1d16f92
feat: implement cron scaling
knave Jul 28, 2025
714284b
feat: implement cron scaling
knave Jul 28, 2025
22cd81e
feat: implement merging recommendations
knave Jul 30, 2025
a8a3a7c
feat: implement restoring resources upon cron scaling termination
knave Jul 30, 2025
84a9a46
fix: properly handle the isScaleUp
knave Jul 30, 2025
23b5c03
refactor: each recommender is responsible for managing its own annota…
knave Jul 31, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 64 additions & 6 deletions api/v1/schedulingconfigtemplate_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -86,17 +86,75 @@ type GPUFilter struct {
}

type AutoScalingConfig struct {
// layer 1 vertical auto-scaling, turbo burst to existing GPU cards quickly
// VPA-like, aggregate metrics data <1m
AutoSetLimits AutoSetLimits `json:"autoSetLimits,omitempty"`
// layer 1 adjusting, to match the actual usage in the long run, only for N:M remote vGPU mode
// Adjust baseline requests to match the actual usage in longer period, such as 1day - 2weeks
AutoSetResources AutoSetResources `json:"autoSetResources,omitempty"`

// layer 2 horizontal auto-scaling, scale up to more GPU cards if max limits threshold hit
// HPA-like, aggregate metrics data 1m-1h (when tf-worker scaled-up, should also trigger client pod's owner[Deployment etc.]'s replica increasing, check if KNative works)
AutoSetReplicas AutoSetReplicas `json:"autoSetReplicas,omitempty"`

// layer 3 adjusting, to match the actual usage in the long run, only for N:M remote vGPU mode, not impl yet
// Adjust baseline requests to match the actual usage in longer period, such as 1day - 2weeks
AutoSetRequests AutoSetRequests `json:"autoSetRequests,omitempty"`
// CronScalingRules defines a list of CronScaling rules used to schedule scaling actions based on cron expressions.
CronScalingRules []CronScalingRule `json:"cronScalingRules,omitempty"`
}

// CronScalingRule defines the rule for scaling resources based on a cron schedule.
// It allows enabling/disabling the scaler, specifying the time window for scaling,
// and configuring the desired resources and replicas during the scheduled period.
type CronScalingRule struct {
// Enable specifies whether the cron scaler is enabled.
Enable bool `json:"enable,omitempty"`
// Name is the identifier for the cron scaler.
Name string `json:"name,omitempty"`
// Start is the start time for the scaling schedule, in cron format.
Start string `json:"start,omitempty"`
// End is the end time for the scaling schedule, in cron format.
End string `json:"end,omitempty"`
// DesiredResources specifies the target resources to scale to during the schedule.
DesiredResources Resources `json:"desiredResources,omitempty"`
// ResourceMultiplier is a string representing the multiplier to apply to resources.
ResourceMultiplier string `json:"resourceMultiplier,omitempty"`
// DesiredReplicas is the target number of replicas during the schedule.
DesiredReplicas *int32 `json:"desiredReplicas,omitempty"`
// ReplicasMultiplier is a string representing the multiplier to apply to replicas.
ReplicasMultiplier string `json:"replicasMultiplier,omitempty"`
}

type AutoSetResources struct {
Enable bool `json:"enable,omitempty"`

// Target resource to scale, such as "tflops", "vram", or "all" by default
TargetResource string `json:"targetResource,omitempty"`

// Tflops usage percentile that will be used as a base for tflops target recommendation. Default: 0.9
TargetTflopsPercentile string `json:"targettflopspercentile,omitempty"`

// Tflops usage percentile that will be used for the lower bound on tflops recommendation. Default: 0.5
LowerBoundTflopsPercentile string `json:"lowerboundtflopspercentile,omitempty"`

// Tflops usage percentile that will be used for the upper bound on tflops recommendation. Default: 0.95
UpperBoundTflopsPercentile string `json:"upperboundtflopspercentile,omitempty"`

// Vram usage percentile that will be used as a base for vram target recommendation. Default: 0.9
TargetVramPercentile string `json:"targetvrampercentile,omitempty"`

// Vram usage percentile that will be used for the lower bound on vram recommendation. Default: 0.5
LowerBoundVramPercentile string `json:"lowerboundvrampercentile,omitempty"`

// Vram usage percentile that will be used for the upper bound on vram recommendation. Default: 0.95
UpperBoundVramPercentile string `json:"upperboundvrampercentile,omitempty"`

// Fraction of usage added as the safety margin to the recommended request. Default: 0.15
RequestMarginFraction string `json:"requestMarginFraction,omitempty"`

// The time interval used for computing the confidence multiplier for the lower and upper bound. Default: 24h
ConfidenceInterval string `json:"confidenceInterval,omitempty"`

// How much time back TSDB have to be queried to get historical metrics. Default: 1d
HistoryLength string `json:"historyLength,omitempty"`

// Resolution at which TSDB is queried for historical metrics. Default: 1m
HistoryResolution string `json:"historyResolution,omitempty"`
}

// A typical autoLimits algorithm could be checking every 5m, look back 1 day data,
Expand Down
21 changes: 21 additions & 0 deletions api/v1/tensorfusionconnection_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,13 @@ import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

type ResourceName string

const (
ResourceTflops ResourceName = "tflops"
ResourceVram ResourceName = "vram"
)

type Resource struct {
Tflops resource.Quantity `json:"tflops"`
Vram resource.Quantity `json:"vram"`
Expand All @@ -31,6 +38,20 @@ type Resources struct {
Limits Resource `json:"limits"`
}

func (r *Resources) Equal(t *Resources) bool {
return r.Requests.Tflops.Equal(t.Requests.Tflops) &&
r.Requests.Vram.Equal(t.Requests.Vram) &&
r.Limits.Tflops.Equal(t.Limits.Tflops) &&
r.Limits.Vram.Equal(t.Limits.Vram)
}

func (r *Resources) IsZero() bool {
return r.Requests.Tflops.IsZero() &&
r.Requests.Vram.IsZero() &&
r.Limits.Tflops.IsZero() &&
r.Limits.Vram.IsZero()
}

// TensorFusionConnectionSpec defines the desired state of TensorFusionConnection.
type TensorFusionConnectionSpec struct {
WorkloadName string `json:"workloadName"`
Expand Down
46 changes: 44 additions & 2 deletions api/v1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
Expand Up @@ -50,41 +50,6 @@ spec:
autoScaling:
description: scale the workload based on the usage and traffic
properties:
autoSetLimits:
description: |-
layer 1 vertical auto-scaling, turbo burst to existing GPU cards quickly
VPA-like, aggregate metrics data <1m
properties:
enable:
type: boolean
evaluationPeriod:
type: string
extraTFlopsBufferRatio:
type: string
ignoredDeltaRange:
type: string
maxRatioToRequests:
description: the multiplier of requests, to avoid limit set
too high, like 5.0
type: string
prediction:
properties:
enable:
type: boolean
historyDataPeriod:
type: string
model:
type: string
predictionPeriod:
type: string
type: object
scaleUpStep:
type: string
targetResource:
description: target resource to scale limits, such as "tflops",
"vram", or "all" by default
type: string
type: object
autoSetReplicas:
description: |-
layer 2 horizontal auto-scaling, scale up to more GPU cards if max limits threshold hit
Expand All @@ -105,40 +70,141 @@ spec:
targetTFlopsOfLimits:
type: string
type: object
autoSetRequests:
autoSetResources:
description: |-
layer 3 adjusting, to match the actual usage in the long run, only for N:M remote vGPU mode, not impl yet
layer 1 adjusting, to match the actual usage in the long run, only for N:M remote vGPU mode
Adjust baseline requests to match the actual usage in longer period, such as 1day - 2weeks
properties:
aggregationPeriod:
confidenceInterval:
description: 'The time interval used for computing the confidence
multiplier for the lower and upper bound. Default: 24h'
type: string
enable:
type: boolean
evaluationPeriod:
historyLength:
description: 'How much time back TSDB have to be queried to
get historical metrics. Default: 1d'
type: string
extraBufferRatio:
description: the request buffer ratio, for example actual
usage is 1.0, 10% buffer will be 1.1 as final preferred
requests
historyResolution:
description: 'Resolution at which TSDB is queried for historical
metrics. Default: 1m'
type: string
percentileForAutoRequests:
lowerboundtflopspercentile:
description: 'Tflops usage percentile that will be used for
the lower bound on tflops recommendation. Default: 0.5'
type: string
lowerboundvrampercentile:
description: 'Vram usage percentile that will be used for
the lower bound on vram recommendation. Default: 0.5'
type: string
requestMarginFraction:
description: 'Fraction of usage added as the safety margin
to the recommended request. Default: 0.15'
type: string
prediction:
properties:
enable:
type: boolean
historyDataPeriod:
type: string
model:
type: string
predictionPeriod:
type: string
type: object
targetResource:
description: target resource to scale requests, such as "tflops",
"vram", or "all" by default
description: Target resource to scale, such as "tflops", "vram",
or "all" by default
type: string
targettflopspercentile:
description: 'Tflops usage percentile that will be used as
a base for tflops target recommendation. Default: 0.9'
type: string
targetvrampercentile:
description: 'Vram usage percentile that will be used as a
base for vram target recommendation. Default: 0.9'
type: string
upperboundtflopspercentile:
description: 'Tflops usage percentile that will be used for
the upper bound on tflops recommendation. Default: 0.95'
type: string
upperboundvrampercentile:
description: 'Vram usage percentile that will be used for
the upper bound on vram recommendation. Default: 0.95'
type: string
type: object
cronScalingRules:
description: CronScalingRules defines a list of CronScaling rules
used to schedule scaling actions based on cron expressions.
items:
description: |-
CronScalingRule defines the rule for scaling resources based on a cron schedule.
It allows enabling/disabling the scaler, specifying the time window for scaling,
and configuring the desired resources and replicas during the scheduled period.
properties:
desiredReplicas:
description: DesiredReplicas is the target number of replicas
during the schedule.
format: int32
type: integer
desiredResources:
description: DesiredResources specifies the target resources
to scale to during the schedule.
properties:
limits:
properties:
tflops:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
vram:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
required:
- tflops
- vram
type: object
requests:
properties:
tflops:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
vram:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
required:
- tflops
- vram
type: object
required:
- limits
- requests
type: object
enable:
description: Enable specifies whether the cron scaler is
enabled.
type: boolean
end:
description: End is the end time for the scaling schedule,
in cron format.
type: string
name:
description: Name is the identifier for the cron scaler.
type: string
replicasMultiplier:
description: ReplicasMultiplier is a string representing
the multiplier to apply to replicas.
type: string
resourceMultiplier:
description: ResourceMultiplier is a string representing
the multiplier to apply to resources.
type: string
start:
description: Start is the start time for the scaling schedule,
in cron format.
type: string
type: object
type: array
type: object
hypervisor:
description: single GPU device multi-process queuing and fair scheduling
Expand Down
Loading