Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
153 commits
Select commit Hold shift + click to select a range
6fefa00
Chaos
Phantom-Intruder Sep 23, 2024
92604de
Chaos
Phantom-Intruder Sep 24, 2024
49533a0
Chaos
Phantom-Intruder Sep 25, 2024
4531370
Chaos
Phantom-Intruder Sep 26, 2024
5c33258
Chaos
Phantom-Intruder Sep 27, 2024
40227c4
Chaos test
Phantom-Intruder Sep 28, 2024
d03d097
Chaos
Phantom-Intruder Sep 30, 2024
e4ac79c
Prod infra config changes
Phantom-Intruder Oct 1, 2024
3c522ea
Chaos
Phantom-Intruder Oct 2, 2024
454e08a
Chaos
Phantom-Intruder Oct 3, 2024
31b6067
Chaos
Phantom-Intruder Oct 4, 2024
f731737
Chaos
Phantom-Intruder Oct 6, 2024
2ee322b
Chaos
Phantom-Intruder Oct 7, 2024
9dec56b
Chaos
Phantom-Intruder Oct 8, 2024
c5dc549
Chaos
Phantom-Intruder Oct 9, 2024
c3b42fe
Prod infra config changes
Phantom-Intruder Oct 10, 2024
b09b6aa
Chaos
Phantom-Intruder Oct 11, 2024
bbac2bc
Chaos
Phantom-Intruder Oct 12, 2024
c4edbb5
Chaos
Phantom-Intruder Oct 14, 2024
8ad4536
Chaos
Phantom-Intruder Oct 15, 2024
36a0391
Chaos
Phantom-Intruder Oct 16, 2024
6bff34d
Chaos
Phantom-Intruder Oct 17, 2024
40fa37d
Chaos
Phantom-Intruder Oct 18, 2024
7c94aa2
Chaos
Phantom-Intruder Oct 19, 2024
2ecbaf7
Chaos
Phantom-Intruder Oct 21, 2024
3719f89
Chaos
Phantom-Intruder Oct 22, 2024
39a87e0
Chaos
Phantom-Intruder Oct 23, 2024
ce1f667
Chaos
Phantom-Intruder Oct 24, 2024
4674c91
Requests and limits
Phantom-Intruder Oct 27, 2024
2827e84
Prod infra config changes
Phantom-Intruder Oct 28, 2024
978aeb8
K8 migration changes
Phantom-Intruder Oct 29, 2024
ad492f3
requests
Phantom-Intruder Oct 30, 2024
9f44f56
requests
Phantom-Intruder Oct 31, 2024
59eb57e
resource
Phantom-Intruder Nov 1, 2024
c1b8f16
Topology
Phantom-Intruder Nov 2, 2024
12ba133
resource
Phantom-Intruder Nov 4, 2024
cab7533
resource
Phantom-Intruder Nov 5, 2024
b25c556
resource
Phantom-Intruder Nov 7, 2024
18a609a
ArgoCD hooks
Phantom-Intruder Nov 9, 2024
f9eef97
Argocd hooks
Phantom-Intruder Nov 11, 2024
cbbe453
Prod infra config changes
Phantom-Intruder Nov 12, 2024
d2a22cd
Prod infra config changes
Phantom-Intruder Nov 13, 2024
ad0cccb
Argocd hooks
Phantom-Intruder Nov 14, 2024
177e118
ArgoCD hooks
Phantom-Intruder Nov 15, 2024
8a8e973
ArgoCD hooks
Phantom-Intruder Nov 17, 2024
957d5b6
Argo rollouts
Phantom-Intruder Nov 19, 2024
a4c2a27
Argo rollouts
Phantom-Intruder Nov 20, 2024
d50031b
Argo rollouts
Phantom-Intruder Nov 21, 2024
243262c
Prod infra config changes
Phantom-Intruder Nov 22, 2024
9876eb8
Argo rollouts
Phantom-Intruder Nov 23, 2024
dc79846
Argo rollouts
Phantom-Intruder Nov 25, 2024
006f0a3
Argo rollouts
Phantom-Intruder Nov 26, 2024
abd408c
Adjust resourcce usage
Phantom-Intruder Nov 27, 2024
9b1ee23
Argo rollouts
Phantom-Intruder Nov 28, 2024
4903a0e
Argo Rollouts
Phantom-Intruder Nov 29, 2024
941ab7e
Argo rollouts
Phantom-Intruder Nov 29, 2024
6b50ac5
Argo rollouts
Phantom-Intruder Nov 30, 2024
5d6f7cf
Argo rollouts
Phantom-Intruder Dec 2, 2024
d78e0ae
K8 migration changes
Phantom-Intruder Dec 3, 2024
8254de2
Argo rollouts
Phantom-Intruder Dec 4, 2024
f1f77cc
K8 migration changes
Phantom-Intruder Dec 6, 2024
05aa2ec
Argo Rollouts
Phantom-Intruder Dec 8, 2024
c00e7b5
Argo rollouts
Phantom-Intruder Dec 10, 2024
b15dd76
Argo rollouts
Phantom-Intruder Dec 11, 2024
5a83878
Argo rollouts
Phantom-Intruder Dec 12, 2024
6431064
Argo rollouts
Phantom-Intruder Dec 15, 2024
0614fe4
K8 migration changes
Phantom-Intruder Dec 16, 2024
8a7d86c
K8 migration changes
Phantom-Intruder Dec 17, 2024
b511b30
Argo rollouts
Phantom-Intruder Dec 18, 2024
f9f90eb
K8 migration changes
Phantom-Intruder Dec 19, 2024
55bae2d
K8 migration changes
Phantom-Intruder Dec 20, 2024
e272704
Argo Rollouts
Phantom-Intruder Dec 22, 2024
5bd0fe1
K8 migration changes
Phantom-Intruder Dec 23, 2024
188701d
Kyverno
Phantom-Intruder Dec 26, 2024
f031f1a
Kyverno
Phantom-Intruder Dec 27, 2024
7c862ab
Kyverno
Phantom-Intruder Dec 28, 2024
1159067
Kyverno
Phantom-Intruder Dec 30, 2024
329dc81
Kyverno
Phantom-Intruder Dec 31, 2024
875dca6
Kyverno
Phantom-Intruder Jan 1, 2025
537c5eb
fix gradle
Phantom-Intruder Jan 3, 2025
c6d6ad9
Kyverno
Phantom-Intruder Jan 4, 2025
c99f0fb
Added new validations
Phantom-Intruder Jan 6, 2025
0e3f27b
Kyverno
Phantom-Intruder Jan 7, 2025
d71d781
Kyverno validation finished
Phantom-Intruder Jan 8, 2025
2d25248
Kyverno generation policices
Phantom-Intruder Jan 9, 2025
3cd32a8
Kyverno generation policices
Phantom-Intruder Jan 11, 2025
75accd0
kYVERNO
Phantom-Intruder Jan 13, 2025
189b943
Kyverno generation policices
Phantom-Intruder Jan 15, 2025
a14a463
Kyverno generation policices
Phantom-Intruder Jan 16, 2025
9a2223e
Kyverno generation policices
Phantom-Intruder Jan 17, 2025
af22aa0
Kyverno
Phantom-Intruder Jan 19, 2025
2354014
Kyverno
Phantom-Intruder Jan 20, 2025
573ced0
Kyverno
Phantom-Intruder Jan 21, 2025
a4f6e30
Kyverno
Phantom-Intruder Jan 23, 2025
848eba2
K8 migration changes
Phantom-Intruder Jan 24, 2025
d20beda
Scaling profiles
Phantom-Intruder Jan 25, 2025
97edce3
Scaling profiles
Phantom-Intruder Jan 27, 2025
467b46c
Kyverno
Phantom-Intruder Jan 28, 2025
772ca77
Kyverno
Phantom-Intruder Jan 29, 2025
66d2f39
K8 migration changes
Phantom-Intruder Jan 30, 2025
d0d6444
scaling profiles
Phantom-Intruder Jan 30, 2025
ba9c009
scaling profiles
Phantom-Intruder Jan 31, 2025
aeaf5dd
Profiles
Phantom-Intruder Feb 1, 2025
43b1d5c
scaling profiles
Phantom-Intruder Feb 5, 2025
77dc706
scaling modifiers
Phantom-Intruder Feb 7, 2025
6e781ee
Scaling modifiers
Phantom-Intruder Feb 8, 2025
4b6210c
scaling modifiers
Phantom-Intruder Feb 10, 2025
aecb5fe
scaling modifiers
Phantom-Intruder Feb 11, 2025
73ed0b2
Adjust pod counts
Phantom-Intruder Feb 13, 2025
5a05815
Adjust pod counts
Phantom-Intruder Feb 14, 2025
47ca947
Scaling options
Phantom-Intruder Feb 15, 2025
c089503
scaling modifiers
Phantom-Intruder Feb 17, 2025
0b58262
scaling modifiers
Phantom-Intruder Feb 18, 2025
e31c31d
K8 migration changes
Phantom-Intruder Feb 19, 2025
6bbadfd
scaling modifiers
Phantom-Intruder Feb 20, 2025
90224d7
Scaling modifiers
Phantom-Intruder Feb 22, 2025
e60be59
scalers added
Phantom-Intruder Feb 24, 2025
e549fb2
scalers added
Phantom-Intruder Feb 25, 2025
1a30327
K8 migration changes
Phantom-Intruder Feb 26, 2025
576e264
scalers added
Phantom-Intruder Feb 27, 2025
3c854a5
scaling modifiers
Phantom-Intruder Feb 28, 2025
f9381ee
KEDA scaled objects
Phantom-Intruder Mar 3, 2025
2ea1f5c
Finished scalers
Phantom-Intruder Mar 4, 2025
5b06b5c
karpenter tuning
Phantom-Intruder Mar 5, 2025
31eb404
karpenter tuning
Phantom-Intruder Mar 6, 2025
98a4044
Karpenter tuning
Phantom-Intruder Mar 8, 2025
f869e15
502 fixes
Phantom-Intruder Mar 10, 2025
3e641b1
karpenter tuning
Phantom-Intruder Mar 11, 2025
b6412dd
karpenter tuning
Phantom-Intruder Mar 12, 2025
ccb9d9a
Karpenter fine tuning
Phantom-Intruder Mar 13, 2025
afc959e
karpenter tuning
Phantom-Intruder Mar 14, 2025
c80a434
Karpenter tuning
Phantom-Intruder Mar 15, 2025
bee922a
karpenter tuning
Phantom-Intruder Mar 17, 2025
5d05308
karpenter tuning
Phantom-Intruder Mar 18, 2025
da02ba8
karpenter tuning
Phantom-Intruder Mar 20, 2025
301a166
karpenter tuning
Phantom-Intruder Mar 24, 2025
f33c831
karpenter tuning
Phantom-Intruder Mar 25, 2025
c044cf8
scaling profiles
Phantom-Intruder Mar 26, 2025
f34741b
karpenter tuning
Phantom-Intruder Mar 27, 2025
6ec1c15
karpenter tuning
Phantom-Intruder Mar 28, 2025
8ee04cc
karpenter tuning
Phantom-Intruder Mar 31, 2025
419bb47
karpenter tuning
Phantom-Intruder Apr 1, 2025
d32d2da
karpenter tuning
Phantom-Intruder Apr 3, 2025
82e200b
karpenter tuning
Phantom-Intruder Apr 4, 2025
1716479
karpenter tuning
Phantom-Intruder Apr 7, 2025
468d156
karpenter tuning
Phantom-Intruder Apr 8, 2025
bd7fd1b
karpenter tuning
Phantom-Intruder Apr 10, 2025
97e6ddb
karpenter tuning
Phantom-Intruder Apr 11, 2025
593fc23
Karpenter tuning
Phantom-Intruder Apr 12, 2025
2eb2b29
Karpenter upgrade
Phantom-Intruder Apr 14, 2025
2bd938d
Merge branch 'master' into karpenter-tuning
Phantom-Intruder Apr 17, 2025
58af0b0
Merge pull request #14 from Phantom-Intruder/karpenter-tuning
Phantom-Intruder Apr 17, 2025
e6f3aaf
Merge branch 'master' into master
Phantom-Intruder Apr 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 62 additions & 1 deletion Autoscaler101/helpers.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ We have now discussed HPA's, VPA's, and you might have even read the section on
- Graceful shutdowns
- Annotations that help with scaling.
- Pod priority/disruption
- Pod requests/limits
- Rollover strategy
- Pod topology skews

You may have already come across these concepts before, and just about every Kubernetes-based tool uses them to ensure stability. We will discuss each of the above points and follow up with a lab where we test out the above concepts using a simple Nginx server.

Expand Down Expand Up @@ -37,6 +40,14 @@ Depending on the type of web application you are running, you may not need to co

While microservices generally take in traffic through their endpoints, your application might differ. Your application might do batch processing by reading messages off RabbitMQ, or it might occasionally read a database and transform the data within it. In cases like this, having the pod or job terminated for scaling considerations might leave your database table in an unstable state, or it might mean that the message your pod was processing never ends up finishing. In any of these cases, graceful shutdowns can keep the pod from terminating long enough for your pod to either finish what it started or ensure a different pod can pick up where it left off.

One major problem you might encounter if you are using a command string to start your application is if Kubernetes doesn't recognize your command string as a primary process and fails to propagate the sigterm command to your application. This can happen if you have multiple commands chained together before your main command runs, meaning Kubernetes is unable to determine which command is the main one. To explicitly define a command as a main command, use `exec`. For example:

```
cp somefile . && rm -rf somefolder && exec java -jar yourapplication.jar
```

Without the `exec`, the sigterm will be sent to the first `cp` command and will not affect your actual `java` command.

If the jobs you are running are mission-critical, and each of your jobs must run to completion, then even graceful shutdowns might not be enough. In this case, you can turn to annotations to help you out.

## Annotations
Expand Down Expand Up @@ -605,7 +616,57 @@ Apply the configuration:
kubectl apply -f nginx-pdb.yaml
```

Now your number of pods won't go below the minimum available pod count meaning that if pods are evicted due to autoscaling, if a new version is deployed, or if your pods are supposed to restart for any reason, at least 2 pods will always be up. This, however, will not consider a case where your pod or node goes out of memory, becomes unreachable, or unschedulable. If your node doesn't have enough resources to give, even a PDB insisting that the pod needs to stay up doesn't work. The same applies if the node suddenly were to get removed. To minimize the change of this happening, you will properly have to set pod requests and limits so that the resource requirements of a pod never exceed what the node can provide.
Now your number of pods won't go below the minimum available pod count meaning that if pods are evicted due to autoscaling, if a new version is deployed, or if your pods are supposed to restart for any reason, at least 2 pods will always be up. This, however, will not consider a case where your pod or node goes out of memory, becomes unreachable, or unschedulable. If your node doesn't have enough resources to give, even a PDB insisting that the pod needs to stay up doesn't work. The same applies if the node suddenly were to get removed. To minimize the change of this happening, you will properly have to set pod requests and limits so that the resource requirements of a pod never exceed what the node can provide. On that note, let's take a look at pod requests and limits.

## Pod requests/limits

You have undoubtedly seen pod requests and limitss and likely even used them at some point. Requests/limits define the minimum and maximum CPU & memory a pod is allowed to take. To break it down:

- Requests: A pod will not schedule on a node unless the pods' memory and CPU requets are fulfilled. If there are no nodes that can fulfill a pods requested resources, a new node will have to be added. If you use cluster autoscaler or Karpenter, this resource requirement will be noted and machines will be automatically provisioned.

- Limits: This is the maximum memory a pod can consume. Once it reaches this memory limit, the pod will be terminated to prevent breaching this memory limit.

With this in mind, it might commonly make sense to put requests to a lower value and limits to a higher value. This works fine for development workflows and is called a burstable workload. However, this method has one major downside: node OOM issues. If you have set a pod to have requests of 500MB and limits of 1GB, your pod may get scheduled on a node with 700MB remaining. However, since your pod can grow in memory up to 1GB, it will continue to do so until it hits 700 MB. At this point, the node will go into a memory pressure state since it has no more memory to give. This will cause issues for all pods inside the node. The best case here is that the node evicts a pod at random so that the node goes out of the memory pressure state. The worst case is that the node itself crashes. Generally, this will result in a new node being started, but you will experience performance degradation until that happens. This might be acceptable in a development situation but quite unacceptable in production.

This is where guaranteed resource allocation comes into play. In this scenario, you would set the resources equal to the limits. So now, when a pod is looking to schedule, the scheduler will pick a machine that has the requested amount of resources in it. However, that pod will now not go over the resource limit, meaning that the node itself will never run out of memory. This means no change in node failures or random pods getting forcibly evicted.

If a pod itself reaches its memory limit, then the pod will be rescheduled. If you have properly set graceful shutdown and termination grace periods, these will be honored. This way, none of your pods will shut down without warning and cause any ongoing transactions to fail.

Note that this only applies to memory limits. When it comes to CPU limits, a general recommendation is that you don't keep any such limits in place. The CPU is elastic and can be acquired and released as needed. Even if the pod takes the full CPU of the machine, the machine itself won't crash, and once the CPU usage drops, the available CPU will be reallocated. On the other hand, if you have a stringent CPU limit in place and the pod reaches that CPU limit, the application will slow down. This is especially true when the application is starting. During the startup, the application can consume up to 10 times the normal amount of CPU. If there is a limit in place, the application startup can end up slowing down. Due to these reasons, even for production workloads, it's advisable to not have a CPU limit.

## Rollover strategy

Now, let's talk about rollover strategies. When you perform deployments, likely because you want to deploy a new version of an application, you might have to do it in the middle of peak hours. However, you don't want your application going down or even having a performance hit, which means your deployment strategy should first create new pods that will replace the existing ones before the old pods are destroyed. Kubernetes already does this by default to a certain degree by allowing 25% of the max replicas to be created and 25% to be destroyed. However, if you only run about 2 replicas that means one of them will go down before the new one comes up, resulting in a performance degredation. So you might want to strictly specify these values yourself. You can do so with:

```yaml
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
```

The above block will specify that 1 new pod is allowed to come up in addition to the existing pods during deployment and determines that the maximum number of unavailable pods is 0. If you are using around 6 pods and you want the deployments to happen faster, you can set `maxSurge` to 3 or more, which means 3 pods will come up, 3 old ones will go down, then another 3 will come up before the last old 3 pods finally shut down. However, the downside is that the additional pods might require extra resources to run on. If this happens Kubernetes will upscale and then downscale the number of nodes as needed. But if you have node disruption budgets to prevent nodes from going down during business hours and if the budget has been reached, the extra nodes will remain.

## Pod topology skews

Next, let's look at topology skews. One of the major advantages of having a microservice architecture is high availability. To reinforce this, most cloud providers give four or more availability zones per region. Each of these availability zones operates separately from one another, meaning that if one az were to go down, the other azs would be unaffected. To use these zones effectively, you should always use 2 or more azs when scheduling your Kubernetes resources. While it is true that major cloud providers don't generally have az wide outages, this is an extra step that you can put in place to keep your business availability at a maximum.

When we talk about replicas, the idea is that if one replica were to go down, the other would serve traffic, and therefore there would not be a service outage. However, if all your replicas are in a single az and the az were to go down, you would still get a service outage. So when scheduling your pods, you want each replica to be in a separate machine, and for each machine to be in a separate az. However, note that this might not be practical from a cost perspective if you only have a few small applications. Having to use separate network gateways for each az as well as having separate machines is eventually going to cost you.

As such, a compromise between the two would be to schedule pods in a different zone if a node is available there. However, if there is no other machine, schedule on the same machine or in the same availability zone. The yaml for this would be:

```yaml
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: nginx
```

This will skew the deployments based on the zone, but if no machines are available to schedule in a skewed manner, the skew will be ignored. However, if you have several applications that can share resources in multiple machines, you can force the pods to be scheduled in separate zones by replacing `ScheduleAnyway` with `DoNotSchedule`. If no machines are available in separate zones, the pod will refuse to schedule. Once that happens, your node scaler will kick in to satisfy the requirement.

# Conclusion

Expand Down
Loading