Skip to content

Conversation

@tedcm
Copy link

@tedcm tedcm commented Feb 16, 2023

Which component this PR applies to?

Vertical-pod-autoscaler

What type of PR is this?

/kind documentation

What this PR does / why we need it:

adds "pci" labels to vpa PRs, requiring at least one non-contributing reviewer.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


bpineau and others added 23 commits May 31, 2022 17:07
This is meant to isolate, contain and wrap access to our custom
resource (storageclass/local-data) and label (/local-storage)
in a single place, separated in our processors/datadog/ namespace.

Avoid spreading local names and constraints over all the code base.
Main functional change compared to upstream's is the "storageclass/local-data"
hack, meant to support pods claiming local-data (no-provisioner) volumes.

filterOutSchedulable (+test) is mostly that of upstream (copied for later
modifications), lightly modified to compose with older clusters where taint
based eviction isn't enabled.

The frontend is meant to hook in future sub-processors for pods.
This gives a foothold for some local-only improvements, without touching upstream code; plans includes:
* Runaway upscale prevention
* Schedulable metrics (pending pods pressure, long pending pods)
* Pods labels selectors (for dedicated autoscaler instances)
* Possibly: unpriorize pods from cronjobs
Hooking it in the less intrusive way we could.
Goals of that PodListProcessor are twofold:
* Lower presure on runonce loops by evaluating long pending pods less frequently
* More importantly: free autoscaler cycles to recover from scaledown cooldown, so nodes created for pods causing infinite upscales get gc'ed rather than filling a cluster (and those pods are slowed down)

The delay penalty could be made progressive (eg. 2m then 5m then 10m etc), but for now a static value makes evaluating benefits and impacts easier.

Metrics to come in follow-up PR.
This is meant to replace a patch that was setting new nodes as NotReady
until their lvp pod was there, with something bound to and contained in
our podsListProcessor, not touching autoscaler core or cloudproviders at
all anymore. Decision is entirely based of the local-data:true label: no
guessing or per cloud provider instances types allowlists needed anymore.

Instead of setting them NotReady, the new nodes that just joined are now
considered as schedulable for pods requesting local-data once they are
ready, which naturaly prevents spurious re-upscales.

For now the change is restricted to new local-data nodes that just became
ready for less than 5mn, as we're assessing wider impact.

The downside is we need to modify the clusterSnapshot content before
filterOutSchedulable runs scheduler predicates with those nodes, which
happens later, also in our own podsListProcessor.
And stop trying to make that a nodeinfo processor: this was an attempt
to follow upstream suggestions, but excessively intrusive (not rebase
friendly) for a feature we might keep localy/forked for a long time.

We're also submiting a "nodeinfos provider processor" to upstream, which
(if accepted) will help integrate that kind of changes much more cleanly
(eg. not breaking tests, not touching core/). For now, let's assume we
might not have an upstream  processor entry point for a long time.
The new option `--node-infos-processor-podtemplate` will be use to enable the
support of the PodTemplate processor.
The PodTemplate processor will be here to extra from PodTemplate resources
Pod that should be considered as Daemonset Pod. This solution will allow custom
Daemonset controllers to have there workloads considered as a Daemonset workload.
The podTemplateProcess watches `PodTemplates` with a specific label on any namespace.
From a `PodTemplate` the processor generates `Pod` that will be considered as Daemonset Pod
by the cluster-autoscaler.
The PodTemplate processor is plug inside the Datadog NodeInfosProcessors to benefit from
the cache mecanism to limit the simulation overhead processing.

It also limit the possible merge conflict with the upstream cluster-autoscaler code base.
Various cloudproviders' `NodeGroupForNode()` implementations (including
aws, azure, and gce) can returns a `nil` error _and_ a `nil` nodegroup.
Eg. we're seeing AWS returning that on failed upscales on live clusters.
Checking that `deleteCreatedNodesWithErrors` doesn't return an error is
not enough to safely dereference the nodegroup (as returned by
`NodeGroupForNode()`) by calling nodegroup.Id().

In that situation, logging and returning early seems the safest option,
to give various caches (eg. clusterstateregistry's and cloud provider's)
the opportunity to eventually converge.
Brings a few recent Standard_L*s_v3, Standard_HB120 and Standard_NC* instances types.
/!\ This is an unfortunately unavoidable change to vendor/, not meant to stay indefinitely.

Implementation tries to be non intrusive and to avoids refactoring, to ease future rebases; that change should be removed when the cluster-autoscaler don't need to support Kubernetes clusters < k8s v1.24 anymore.

The spreadtopology constraints' skew accounting changed slightly, which can lead cluster-autoscaler (>= 1.24) to leave pending pods on k8s clusters < 1.24:

When evaluating nodes options for a pending pod having topology spread constraints, Kubernetes used (and continues) to inventory all the possible topology domains (eg. zones us-east-1a, us-east-1b, etc which it will try to use in a balanced way, with respect to the configured skew) by listing nodes running pods matching the provided labelSelector, and filtering out those that don't pass the tested pod's nodeAffinities.

But when computing the number of instances per topology domain to evaluate skew, the Kubernetes scheduler (< 1.24) used to count all nodes having pods that matchs the labelSelector, irrespective of their conformance to the tested pod's nodeaffinity.
This changed with Kubernetes commit 935cbc8e625e6f175a44e4490fecc7a25edf6d45 (refactored later on) which I think is part of k8s v1.24: now the scheduler also filters out nodes that don't match the tested pod nodeAffinities when counting pods per topology domain (computing the skew).

Since the cluster-autoscaler 1.24 uses upstream's scheduler framework, it inherited that behaviour, and this can lead to diverging evaluations vs k8s scheduler (if the cluster's scheduler is < 1.24): one node could be considered as schedulable by the autoscaler (not triggering an upscale) while k8s scheduler would consider it wouldn't satisfy skew constraints.

One example that can trigger that situation would be a deployment changing it's affinities (eg. to move to a new set of nodes) while older pods/nodes are already at maximum skew tolerance (eg. slightly unbalanced). For instance in that situation:

We have a deployment configured like so:
```
  labels:
    app: myapp

  replicas: 4

  topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        app: myapp
    maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule

  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
          - key: nodeset
            operator: In
            values: [foo]
```

At first the application could end-up distributed on nodeset=foo nodes as such:
```
node1 nodeset=foo  zone=1           pod-1 app=myapp
node2 nodeset=foo  zone=2           pod-2 app=myapp
node3 nodeset=foo  zone=3           pod-3 app=myapp
node4 nodeset=foo  zone=1 # again   pod-4 app=myapp

node5 nodeset=bar  zone=1   not used yet because doesn't match nodeaffinity
node6 nodeset=bar  zone=2   unschedulable (eg. full, or cordoned, ...)
node7 nodeset=bar  zone=3   unschedulable (eg. full, or cordoned, ...)
```

Then the application's affinity is updated to eg. `values: [bar]` and creates a new pod, part of its rollout (or is upscaled).

With the older scheduler `podtopologyspread` predicate, we'd count:
```
  zone 1: 2 app=myapp pods
  zone 2: 1 app=myapp pod
  zone 3: 1 app=myapp pod
```
so we can't use node5 on zone 1, because we're already hitting `maxSkew: 1` budget (have one excess pod) on that zone: we need a nodeset=bar upscale on zone 2 or 3.

While the newer scheduler would only count pods running on nodeset=bar to compute skew, which would give:
```
  zone 1: 0 app=myapp pods
  zone 2: 0 app=myapp pods
  zone 3: 0 app=myapp pods
```
which means the new pod can use any node, including the already available node5: no need for an upscale.
The skewer's library cache is re-created at every call, which causes
pressure on Azure API, and slows down the cluster-autoscaler startup
time by two minutes on my small (120 nodes, 300 VMSS) test cluster.

This was hitting the API twice on cache miss to look for non-promo
instance types (even when the instance name doesn't ends with "_Promo").
First draft to support lvm storage (topolvm)
This commit adds the possibility to define extended resources for a node group on GCE,
so that the cluster-autoscaler can account for them when taking scaling decisions.

This is done through the `extended_resources` key inside the AUTOSCALER_ENV_VARS variable set on a MIG template.

Signed-off-by: Mayeul Blanzat <[email protected]>
…add more tests

* Malformed extended resource definition should not fail the template building function. Instead, log the error and ignore extended resources
* Remove useless existence check
* Add tests around the extractExtendedResourcesFromKubeEnv function
* Add a test case to verify that malformed extended resource definition does not fail the template build function

Signed-off-by: Mayeul Blanzat <[email protected]>
…esource-support-in-gce

Cherry-pick: add extended resource support in GCE
There's a small window between the time the ASG list is refreshed (happens
every 1mn), and the time expired or new instance-types cache entries are
fetched again from ASG's LaunchConfigurations or LaunchTemplates.

An ASG's LC might have been replaced during that window; in which case
attempts to refresh that ASG instance-type would use the stale LC name
we got when we last ASGs list, possibly deleted since then.

DescribeLaunchConfigurations would not err if some of the provided
LaunchConfigurationNames are missing from the result set. Which is fine
as we can cache what we could retrieve, and try again/converge the missing
entries once we retry with a refreshed ASG list (at most 1mn in the future),
avoiding collecting everything again (expansive API calls).

The issue is getInstanceTypesForAsgs() (the only place we call
getInstanceTypeByLaunchConfigNames (-> DescribeLaunchConfigurations) from,
itself called for missing cache entries) would set entries for each
ASGs, irrespective of getInstanceTypeByLaunchConfigNames() resultset size;
so we can end up caching empty ("") instance types. This causes getAsgTemplate
failures ('ASG %q uses the unknown EC2 instance type ""') and degenerates
to the cluster-autoscaler aborting its main loop cycle for as long as the
bogus entries remains in cache.

On that topic: getInstanceTypeForAsg was swallowing getInstanceTypesForAsgs
error message, which doesn't help with diagnostics.
Expander requests' payloads can be rather heavy under upscale pressure,
as they're compounding all candidates options and unschedulable pods
that could fit each options. Expander responses are a subset of the
requests' payload items.

We're allowing ourself to send arbitrary payload sizes (gRPC
`defaultClientMaxSendMessageSize` is `math.MaxInt32`), but we're prone
to drop expander servers responses to the floor, due to the `4MiB`
`defaultClientMaxReceiveMessageSize`.

The arbitrary 128MiB value is meant to be huge (enough to support eg.
several dozen fat 1MiB pods) but not unlimited. Let me know if you'd
rather see that turned to be a command line flag, or an other value.

Also logging the possible gRPC call errors, as that of great help to
diagnose that kind of issues.
@tedcm tedcm requested a review from lallydd February 16, 2023 20:25
@tedcm tedcm self-assigned this Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants