Kubernetes in Production: Avoid 11 Common Misconfigurations

We know how scary it is when the production is down due to some misconfigurations. Do you know what is even scarier than misconfigurations? It is presenting RCA (root cause analysis) to your manager.

In production, there is a mountain of configurations, when production is down someone has to find that single annoying misconfiguration. Finding it feels like looking for a needle in a haystack.

It's important to know the common misconfiguration which can be done by anyone. This blog will show those and how to prevent scary situations. Here are 11 common misconfigurations that should be avoided in any environment, be it Production or UAT.

1. Not enough available IPs in the vnet/subnet

IPs are required for communication within a cluster and outside of a cluster as well. While creating a Kubernetes cluster using EKS, AKS or GKE, VPC or Vnet is used for IP Address management, network isolation, and security. The IPs for pods, services, and nodes are taken from the subnet.

Each node requires an IP address within the network. If the available IP addresses are limited, there will be a point where no more nodes can be added to the cluster, which restricts the cluster's ability to handle additional load and autoscaling.

Difficulty in Maintenance: Maintenance activities, such as updating nodes or migrating workloads, become more challenging if IP addresses are scarce. The lack of available IPs can complicate tasks, like gracefully moving pods during maintenance.

To solve the issues caused by fewer available IPs in the subnet, take the precautionary actions that before doing any activity such as upgrading a cluster, adding a node, increasing the replica count, or deploying a large applications, check the numbers of IPs available and how many IPs will be need.

2. No or misconfigured liveness or readiness

Liveness and readiness probes are mechanisms used to determine the health and availability of application. They play a crucial role in ensuring the reliability and stability of applications.

If an application encounters a deadlock, memory leak, or any other internal issue that causes it to become unresponsive, the liveness probe will detect this state and restart the pod, attempting to bring the application back to a healthy state.
A readiness probe determines whether or not a pod is ready to receive traffic. Before a pod is added to the load balancer, the readiness probe checks if it is ready to handle incoming traffic. This prevents traffic from being routed to pods that may not be fully initialized or ready to serve requests.
When it comes to scaling down an application or performing upgrades, readiness probes are essential. When a pod's readiness probes fail, it will not receive additional traffic, allowing it to finish processing existing requests and perform a graceful shutdown without impacting the user experience.

These are some issues which may be encountered when configuring liveness and readiness incorrectly:

Unnecessary Pod Restarts
Delayed Failure Detection
Delayed Scaling
Rollback Issues

3. PDBs not configured/ incorrect configurations

PDBs (Pod Distribution Budgets) are particularly useful for applications that require high availability and where maintaining a minimum number of operational instances is critical. They help prevent scenarios where all pods of an application are taken offline simultaneously, ensuring that a certain level of service availability is maintained even during maintenance or unexpected failures.

Incorrect PDB configurations can also be problematic. For instance, when upgrading the Kubernetes version of a cluster, if there is a wrong configuration in PDB that says ReplicaCount = 1 and maxUnavailability = 1, Kubernetes is not allowed to perform pod evictions. This will become a blockage for cluster upgrade activity. It's important to find a balance between availability and maintenance requirements.

4. Horizontal Pod Autoscaler not enabled

HPA (Horizontal Pod Autoscaler) is a resource controller in Kubernetes that automatically adjusts the number of replica pods in a deployment, statefulset, or replica set, based on observed CPU/Memory utilization or custom metrics. Key components and concepts related to the Horizontal Pod Autoscaler:

Metrics: indicate the current utilization of resources or custom application metrics
Target Resource Utilization
Current Resource Utilization
Scaling Algorithm

Some basic misconfigurations that may cause issues:

Inappropriate “--horizontal-pod-autoscaler-sync-period”, this will create a delay in scaling of pods.
If some of the Pod's containers do not have the relevant resource request set, CPU utilization for the Pod will not be defined and the autoscaler will not take any action for that metric.
Selection of fluctuating metrics, this will create confusion for HPA when deciding whether or not to scale up or scale down the resources.
If the wrong metrics are chosen, or metrics that don't accurately reflect application's behavior, the HPA might make incorrect scaling decisions. This can result in over or under-scaling, affecting performance and resource utilization.

5. Incorrect Autoscaling configurations

We all know how important autoscaling is, but all the efforts are in vain when auto-scaling doesn't work with correct configurations. Here are some basic misconfigurations:

Defining the wrong tag or region for a nodegroup or node-pool. In simple words, the autoscaler doesn't know which nodegroups or node-pool needs autoscaling.
The max size of nodegroups or node-pool was already reached, thus autoscaler is not able to increase the node count.
Incorrect IAM permission where the cluster autoscaling pod is running. In the case of AWS, check the cluster autoscaler pod logs for permission issues. Node IAM role must have a policy defined. https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#iam-policy
Node type is not available in the availability zone. Lets say autoscaler wants to add a new node of type “t3.medium” and our region is “ap-south-1”, the “t3.medium” node type is not available in “ap-south-1”. To overcome this, configure node type correctly based on the region.

Feel free to check out this blog for detailed understanding of autoscaling and how to scale applications in Kubernetes.

6. Not using the latest Kubernetes API versions

Using old/ deprecated Kubernetes API versions will open the door to various unexpected issues such as:

Security Vulnerabilities
Compatibility Issues
Lack of Support
Limited Access to Tools and Integrations
Incompatible Add-ons
Configuration Complexity

Plan on regular maintenance and upgrades of Kubernetes versions for reducing the vulnerabilities. While updating the cluster, make a backup and have a well-defined rollback plan in case any issues arise during the upgrade process.

7. Running "kubectl apply" manually in clusters and editing resources directly from cloud console

When we talk about production being down, what is the first question that your manager asks? “ Who did this ?” If this is the question then most managers are the same. We need an audit and better monitoring system for giving response to the manager. When someone runs the “kubectl apply” or “kubectl edit”, it becomes difficult to track the changes and auditing them is even more difficult.

Common issues that can been seen:

Configuration Drift: The actual state of resources diverges from the desired state.
Inconsistent Changes: When multiple team members manually apply changes, there's a risk of conflicting updates or unintentional changes, leading to confusion and unexpected behavior.
Overwriting Changes: When manually editing resources, critical configurations might accidentally be overwritten, leading to outages or unexpected behavior.
Difficulty in Reproducing Issues: Troubleshooting issues becomes harder when changes are applied manually without proper records, as it's challenging to reproduce the state where the issue occurred.

A Platform Engineering tool like Devtron is tailored to provide all the audit and monitoring of an application and its whereabouts. Feel free to check this out for access management at scale in Kubernetes

In a Kubernetes, sharing an admin Kubeconfig or failing to provide distinct credentials for each team member could cause serious security and accountability problems. Sharing administrator credentials allows uncontrolled access, which can lead to misconfigurations, safety hazards, and a lack of accountability for actions committed. On the other side, not having different credentials restricts proper access control and segregation of roles, making tracking and auditing user actions difficult. To solve these problems, it's critical to build granular Role-Based Access Control (RBAC), create individual user accounts, and enforce best practices such as credential rotation and multi-factor authentication. This guarantees that access is properly restricted, operations are traceable, and the cluster's overall security posture is maintained.

To learn more about securing Kubernetes CI/CD, feel free to check this blog.

9. Not setting up logging and monitoring stacks

In a Kubernetes cluster, neglecting to set up appropriate logging and monitoring stacks might result in multiple problems and risks. Without these critical tools in place, visibility into application behaviour, system health, and performance is hampered, making debugging issues and assuring optimal performance difficult. Furthermore, the lack of extensive surveillance may hinder security efforts, restrict compliance with industry standards, and gaining clear visibility of the values of the deployed applications is difficult. To prevent these consequences, it is critical to have strong logging and monitoring practices, use technologies such as Prometheus, Grafana, Elasticsearch, and Fluentd to efficiently collect and analyze metrics, logs, and events.

Check out this blog for setting up a monitoring stack on Kubernetes.

10. Not setting-up autoscaling for cluster infra essentials

In a Kubernetes cluster, failing to enable autoscaling for cluster essentials such as the nginx-ingress controller , kube-prometheus-stack, elasticsearch, CoreDNS, etc. might result in significant performance bottlenecks and decreased reliability. These components play key roles in network traffic management and DNS resolution, making them vulnerable to increasing demand during traffic spikes or heavy workloads. If autoscaling is not implemented, the cluster may struggle to handle additional traffic efficiently, resulting in latency, interruptions, and even application unavailability. Enabling autoscaling for these critical services guarantees that resources are dynamically assigned to meet demand, boosting performance, optimizing resource utilization, and ensuring uninterrupted service delivery even under variable workloads.

11. Not adding deletion protection on ALB

Failure to include deletion protection in an AWS Application Load Balancer (ALB) can result in accidental data loss, service outages, and unwanted consequences. Deletion protection is a security feature that prevents essential resources such as ALBs from being removed mistakenly, hence protecting the infrastructure from human errors or unauthorized acts. Without deletion protection, a vital ALB used to route traffic to the applications could be erased, resulting in application downtime, altered user experiences, and potential revenue loss. Enabling deletion protection on ALBs adds an extra layer of security against accidental or malicious deletions, assuring the stability and availability of applications.

Conclusion

Preventing common misconfigurations is essential for maintaining operational stability. This blog highlighted 11 such pitfalls, including IP scarcity, improper probes, and inadequate logging. Proper strategies, like Pod Disruption Budgets and accurate HPA setups, can enhance availability. Staying updated with Kubernetes API versions, avoiding manual changes, and securing credentials are vital for a robust system. Robust monitoring, autoscaling essentials, and deletion protection for ALBs, ensure optimal performance. Proactive avoidance of these misconfigurations enhances reliability, security, and overall operational efficiency.

If you have any questions feel free to reach out to us. Our thriving Discord Community is just a click away!