One key benefit of running application containers on Kubernetes is the flexibility K8s provides to scale application pods. When working with microservices, you may not need to scale every single application component. Let’s say that you had an e-commerce application that consisted of a frontend and backend microservice. During peak hours, the load on the application increases exponentially and hence, the backend application has a lot of requests to process. This naturally means that the backend application requires more resources. However, the load on the frontend is not as great, and hence, it may require fewer resources.
When running your apps in Kubernetes, the application container is deployed inside pods. A single application can have multiple pods across which the entire load is distributed. Since pods are controlled by the deployment object, as a DevOps engineer, you would need to update the total number of pod replicas in the Deployment object’s manifest. Manually updating the deployment can be time-consuming, and you may even wrongly estimate the number of replicas required for handling the incoming requests.
Kubernetes has a number of different autoscaling mechanisms that can help dynamically adjust the number of resources available for managing incoming traffic requests. In this blog post, you will learn about the Horizontal Pod Autoscaler (HPA), one of the many autoscaling mechanisms present within Kubernetes, that helps make your cluster reliable even at scale.
Types of Kubernetes Autoscaling
Before we go in-depth into the Kubernetes HPA, it is important to understand some of the basic types of Kubernetes autoscaling methods. Autoscaling is used to automatically scale certain Kubernetes resources up or down based on when certain conditions are met. For example, the number of pods can be increased or decreased based on the amount of CPU or Memory resources that are being used in each pod.
There are three main types of Kubernetes Autoscaling techniques:
- Horizontal Pod Scaling (HPA): Increase or decrease the number of pods for a certain application.
- Vertical Pod Scaling (VPA): Increase or decrease the number of system resources such as CPU and Memory allocated to a pod.
- Cluster Scaling: Increase the number of nodes that are a part of the cluster.
These different types of autoscaler work at both the pod level as well as the cluster level. Multiple different types of autoscalers can be combined to achieve a more robust autoscaling behavior for the cluster. This can help prevent any downtime in the cluster because of insufficient resources.
What is Kubernetes Horizontal Pod Autoscaler (HPA)?
The Kubernetes Horizontal Pod Autoscaler (HPA) is one of the types of Kubernetes autoscaler. It is used to scale the number of pods of deployment, replicaset, or a stateful set based on certain criteria such as CPU or Memory utilization of the pod. As this type of scaling increases the number of instances of the pod, it is known as horizontal scaling.
The primary goal of the Kubernetes HPA is to ensure that the deployed workloads have the right amount of resources and that the number of resources is automatically increased or decreased for handling the incoming traffic load. This in turn ensures that all end users get a smooth experience while using your applications, and also reduces costs by removing unused resources.
Kubernetes HPA uses the Kubernetes Metrics server for determining which resources should be scaled up, or scaled down. Using the metrics server, Kubernetes HPA observes the CPU and memory utilization of a configured pod. If the resource utilization is greater than the set threshold, the number of pods are increased. Conversely, if the resource utilization is under the defined threshold, Kubernetes HPA will reduce the number of pods running to optimize resource utilization.
Kubernetes HPA Use Cases
The Horizontal Pod Autoscaler (HPA) has several different use cases especially when working in a dynamic environment where the traffic load is unpredictable. Here are some of the common use cases where HPA proves to be valuable:
- Handling traffic spikes in web applications: Environments with variable and unpredictable traffic can benefit from implementing Kubernetes HPA to meet the increased traffic demands and reduce costs when resources are underutilized. The best example of this is an e-commerce website during any sales event.
- Batch Processing Jobs: For systems that handle batch data processing or stream analytics Kubernetes HPA adjusts the number of pods based on queue sizes or processing rates ensuring timely data processing during high-load periods without over-provisioning resources when the workload is lighter.
- Scaling based on Custom Metrics: Kubernetes HPA can use custom metrics instead of CPU and Memory to determine when to scale the pods up or down. This is useful for Applications with unique performance characteristics. For example, in gaming applications, the HPA might scale pods based on metrics like active users or match instances.
- Distributed Microservice Architecture: A microservice architecture involves many tiny components that work together. Each microservice handles a different functionality of the application. Kubernetes HPA can independently scale each microservice based on its specific resource usage patterns, ensuring that critical services receive the necessary resources without impacting less demanding services.
Configure Kubernetes HPA
Prerequisites:
- Ensure that you have a running Kubernetes Cluster and kubectl, version 1.2 or later.
- Deploy Metrics-Server Monitoring in the cluster to provide metrics via resource metrics API. You can also install this using Devtron's Helm capabilities.
- If you want to make use of custom metrics, your cluster must be able to communicate with the API server providing the custom metrics API.
Below are the steps of how you deploy an application and Configure HPA on Kubernetes Cluster:
Deploy an Application using Docker
- Here, we are using a custom Docker image based on the
php-apache
image. - Create a Docker file with the following content:
FROM php:5-apache
ADD index.php /var/www/html/index.php
RUN chmod a+rx index.php
Below is the index.php
page, which performs calculations to generate intensive CPU load.
<?php $x = 0.0001;
Then start a deployment running the image and expose it as a service using the following YAML Configuration. Let's say it in a file called php-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: php-apache
spec:
selector:
matchLabels:
run: php-apache
replicas: 1
template:
metadata:
labels:
run: php-apache
spec:
containers:
- name: php-apache
image: k8s.gcr.io/hpa-example
ports:
- containerPort: 80
resources:
limits:
cpu: 500m
requests:
cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
name: php-apache
labels:
run: php-apache
spec:
ports:
- port: 80
selector:
run: php-apache
Then, run the following command:
kubectl apply -f php-app.yaml
Create the Horizontal Pod Autoscaler (HPA)
Now that the server is running, you can create a Kubernetes HPA resource using kubectl autoscale
.
When you will create a Horizontal Pod Autoscaler (HPA) it will maintain between 1 to 10 replicas of Pods controlled by php-apache
deployment that you created in the above step.
Kubernetes HPA will continue to increase or decrease the number of replicas to maintain an average CPU Utilization across all pods of 50%. You can create a Horizontal Pod Autoscaler, using the following kubectl autoscale command:
kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10
You can check the current status of Autoscaler using the command :
kubectl get hpa
Kubernetes HPA Behaviour
Let's simulate a scenario where the application faces an increase in load to see how the Horizontal Pod Autoscaler (HPA) that we created will behave.
Ensure that you run all the following commands in the different Terminal;
- First, start the Docker Container
kubectl run --generator=run-pod/v1 -it --rm load-generator --image=busybox
- Then, send an infinite loop of queries to the php-apache service
while true;
do wget -q -O- http://php-apache.default.svc.cluster.local; done
- You can check the higher CPU load by executing:
kubectl get hpa
Terminate the Load
You can now stop the user load. Switch to the terminal, where you had created the Docker Container with busybox image and press + C.
You can verify, if you have terminated the increased load using the command:
kubectl get hpa
The CPU utilization will be dropped to 0% and Kubernetes HPA will autoscale the number of replicas back down to 1. The autoscaling of replicas may take a few minutes.
Result State:
NAME REFERENCE TARGET MINPODS MAXPODS REPLICAS AGE
php-apache Deployment/php-apache/scale 0% / 50% 1 10 1 11
Conclusion
The Kubernetes Horizontal Pod Autoscaler (HPA) offers a powerful way to scale the pods and maintain application performance during fluctuating resource demands. By automating the scaling of pods based on real-time metrics, it enhances the resilience and responsiveness of applications and reduces costs.
Kubernetes HPA depends on accurate metrics to ensure proper scaling of pods. It is an essential tool for most application where traffic can spike and is unpredictable. By understanding its capabilities and constraints, teams can effectively leverage the Horizontal Pod Autoscaler and even combine it with complementary tools like Vertical Pod Autoscaler (VPA) or Kubernetes Event-Driven Autoscaler (KEDA) for more advanced use cases.