collabnix · ajeetraina · Apr 29, 2025 · Sep 23, 2024 · Sep 24, 2024 · Sep 25, 2024
diff --git a/Autoscaler101/helpers.md b/Autoscaler101/helpers.md
@@ -6,6 +6,9 @@ We have now discussed HPA's, VPA's, and you might have even read the section on
 - Graceful shutdowns
 - Annotations that help with scaling.
 - Pod priority/disruption
+- Pod requests/limits
+- Rollover strategy
+- Pod topology skews
 
 You may have already come across these concepts before, and just about every Kubernetes-based tool uses them to ensure stability. We will discuss each of the above points and follow up with a lab where we test out the above concepts using a simple Nginx server.
 
@@ -37,6 +40,14 @@ Depending on the type of web application you are running, you may not need to co
 
 While microservices generally take in traffic through their endpoints, your application might differ. Your application might do batch processing by reading messages off RabbitMQ, or it might occasionally read a database and transform the data within it. In cases like this, having the pod or job terminated for scaling considerations might leave your database table in an unstable state, or it might mean that the message your pod was processing never ends up finishing. In any of these cases, graceful shutdowns can keep the pod from terminating long enough for your pod to either finish what it started or ensure a different pod can pick up where it left off.
 
+One major problem you might encounter if you are using a command string to start your application is if Kubernetes doesn't recognize your command string as a primary process and fails to propagate the sigterm command to your application. This can happen if you have multiple commands chained together before your main command runs, meaning Kubernetes is unable to determine which command is the main one. To explicitly define a command as a main command, use `exec`. For example:
+
+```
+cp somefile . && rm -rf somefolder && exec java -jar yourapplication.jar
+```
+
+Without the `exec`, the sigterm will be sent to the first `cp` command and will not affect your actual `java` command.
+
 If the jobs you are running are mission-critical, and each of your jobs must run to completion, then even graceful shutdowns might not be enough. In this case, you can turn to annotations to help you out.
 
 ## Annotations
@@ -605,7 +616,57 @@ Apply the configuration:
 kubectl apply -f nginx-pdb.yaml
 ```
 
-Now your number of pods won't go below the minimum available pod count meaning that if pods are evicted due to autoscaling, if a new version is deployed, or if your pods are supposed to restart for any reason, at least 2 pods will always be up. This, however, will not consider a case where your pod or node goes out of memory, becomes unreachable, or unschedulable. If your node doesn't have enough resources to give, even a PDB insisting that the pod needs to stay up doesn't work. The same applies if the node suddenly were to get removed. To minimize the change of this happening, you will properly have to set pod requests and limits so that the resource requirements of a pod never exceed what the node can provide.
+Now your number of pods won't go below the minimum available pod count meaning that if pods are evicted due to autoscaling, if a new version is deployed, or if your pods are supposed to restart for any reason, at least 2 pods will always be up. This, however, will not consider a case where your pod or node goes out of memory, becomes unreachable, or unschedulable. If your node doesn't have enough resources to give, even a PDB insisting that the pod needs to stay up doesn't work. The same applies if the node suddenly were to get removed. To minimize the change of this happening, you will properly have to set pod requests and limits so that the resource requirements of a pod never exceed what the node can provide. On that note, let's take a look at pod requests and limits.
+
+## Pod requests/limits
+
+You have undoubtedly seen pod requests and limitss and likely even used them at some point. Requests/limits define the minimum and maximum CPU & memory a pod is allowed to take. To break it down:
+
+- Requests: A pod will not schedule on a node unless the pods' memory and CPU requets are fulfilled. If there are no nodes that can fulfill a pods requested resources, a new node will have to be added. If you use cluster autoscaler or Karpenter, this resource requirement will be noted and machines will be automatically provisioned.
+
+- Limits: This is the maximum memory a pod can consume. Once it reaches this memory limit, the pod will be terminated to prevent breaching this memory limit.
+
+With this in mind, it might commonly make sense to put requests to a lower value and limits to a higher value. This works fine for development workflows and is called a burstable workload. However, this method has one major downside: node OOM issues. If you have set a pod to have requests of 500MB and limits of 1GB, your pod may get scheduled on a node with 700MB remaining. However, since your pod can grow in memory up to 1GB, it will continue to do so until it hits 700 MB. At this point, the node will go into a memory pressure state since it has no more memory to give. This will cause issues for all pods inside the node. The best case here is that the node evicts a pod at random so that the node goes out of the memory pressure state. The worst case is that the node itself crashes. Generally, this will result in a new node being started, but you will experience performance degradation until that happens. This might be acceptable in a development situation but quite unacceptable in production.
+
+This is where guaranteed resource allocation comes into play. In this scenario, you would set the resources equal to the limits. So now, when a pod is looking to schedule, the scheduler will pick a machine that has the requested amount of resources in it. However, that pod will now not go over the resource limit, meaning that the node itself will never run out of memory. This means no change in node failures or random pods getting forcibly evicted.
+
+If a pod itself reaches its memory limit, then the pod will be rescheduled. If you have properly set graceful shutdown and termination grace periods, these will be honored. This way, none of your pods will shut down without warning and cause any ongoing transactions to fail.
+
+Note that this only applies to memory limits. When it comes to CPU limits, a general recommendation is that you don't keep any such limits in place. The CPU is elastic and can be acquired and released as needed. Even if the pod takes the full CPU of the machine, the machine itself won't crash, and once the CPU usage drops, the available CPU will be reallocated. On the other hand, if you have a stringent CPU limit in place and the pod reaches that CPU limit, the application will slow down. This is especially true when the application is starting. During the startup, the application can consume up to 10 times the normal amount of CPU. If there is a limit in place, the application startup can end up slowing down. Due to these reasons, even for production workloads, it's advisable to not have a CPU limit.
+
+## Rollover strategy
+
+Now, let's talk about rollover strategies. When you perform deployments, likely because you want to deploy a new version of an application, you might have to do it in the middle of peak hours. However, you don't want your application going down or even having a performance hit, which means your deployment strategy should first create new pods that will replace the existing ones before the old pods are destroyed. Kubernetes already does this by default to a certain degree by allowing 25% of the max replicas to be created and 25% to be destroyed. However, if you only run about 2 replicas that means one of them will go down before the new one comes up, resulting in a performance degredation. So you might want to strictly specify these values yourself. You can do so with:
+
+```yaml
+strategy:
+  type: RollingUpdate
+  rollingUpdate:
+    maxSurge: 1
+    maxUnavailable: 0
+```
+
+The above block will specify that 1 new pod is allowed to come up in addition to the existing pods during deployment and determines that the maximum number of unavailable pods is 0. If you are using around 6 pods and you want the deployments to happen faster, you can set `maxSurge` to 3 or more, which means 3 pods will come up, 3 old ones will go down, then another 3 will come up before the last old 3 pods finally shut down. However, the downside is that the additional pods might require extra resources to run on. If this happens Kubernetes will upscale and then downscale the number of nodes as needed. But if you have node disruption budgets to prevent nodes from going down during business hours and if the budget has been reached, the extra nodes will remain.
+
+## Pod topology skews
+
+Next, let's look at topology skews. One of the major advantages of having a microservice architecture is high availability. To reinforce this, most cloud providers give four or more availability zones per region. Each of these availability zones operates separately from one another, meaning that if one az were to go down, the other azs would be unaffected. To use these zones effectively, you should always use 2 or more azs when scheduling your Kubernetes resources. While it is true that major cloud providers don't generally have az wide outages, this is an extra step that you can put in place to keep your business availability at a maximum.
+
+When we talk about replicas, the idea is that if one replica were to go down, the other would serve traffic, and therefore there would not be a service outage. However, if all your replicas are in a single az and the az were to go down, you would still get a service outage. So when scheduling your pods, you want each replica to be in a separate machine, and for each machine to be in a separate az. However, note that this might not be practical from a cost perspective if you only have a few small applications. Having to use separate network gateways for each az as well as having separate machines is eventually going to cost you.
+
+As such, a compromise between the two would be to schedule pods in a different zone if a node is available there. However, if there is no other machine, schedule on the same machine or in the same availability zone. The yaml for this would be:
+
+```yaml
+topologySpreadConstraints:
+  - maxSkew: 1
+    topologyKey: topology.kubernetes.io/zone
+    whenUnsatisfiable: ScheduleAnyway
+    labelSelector:
+      matchLabels:
+        app: nginx
+```
+
+This will skew the deployments based on the zone, but if no machines are available to schedule in a skewed manner, the skew will be ignored. However, if you have several applications that can share resources in multiple machines, you can force the pods to be scheduled in separate zones by replacing `ScheduleAnyway` with `DoNotSchedule`. If no machines are available in separate zones, the pod will refuse to schedule. Once that happens, your node scaler will kick in to satisfy the requirement.
 
 # Conclusion