Troubleshooting Pod CrashLoopBackOff Errors in K8s
What is the CrashLoopBackOff error?
A pod stuck in a CrashLoopBackOff is a very common error faced while deploying applications to Kubernetes. While in CrashLoopBackOff, the pod keeps crashing at one point right after it is deployed and run. It usually occurs because the pod is not starting correctly.
Every Pod has a spec field which in turn has a RestartPolicy field. The possible values of this field are: Always, OnFailure, and Never. This value is applicable for all the containers in one particular pod. This policy refers to the restarts of the containers by the kubelet. In case the RestartPolicy is Always or OnFailure, every time a failed pod is restarted by the kublet, it is restarted with an exponential back-off delay. This delay is capped at 5 minutes and if the pod is executed successfully for 10 minutes, this delay is reset to the initial value.
How to spot a CrashLoopBackOff error?
This error can be spotted by a single command. All you have to do is run your standard kubectl get pods -n <namespace>
command and you will be able to see if any of your pods are in CrashLoopBackOff in the status section.
Once you have narrowed down the pods in CrashLoopBackOff, run the following command:
kubectl describe po <pod name> -n <namespace>
And in the output of the above command,
- Check for the events section if any of the probes(liveness, readiness, startup) are failing.
- Check for the events section for the event - OOM Killed.
- Look in the status section of the pod and spot if there is ‘error’ displayed along with the error code
The output you get will be similar to the below examples:
Events: Type Reason Age From Message ----- ---- ---- ---- ---- Normal Pulled 35m(x14 over 73m) kubectl, ip-172-31-190.us-east-2.compute.internal Container image '66244538589.dkr.ecr.us-east2.amazon.com/devtron:eefccal191-223-191" already present on machine Warning Unhealthy 10m(x124 over 144m) kubectl, ip-172-31-190.us-east-2.compute.internal Readiness probe failed: HTTPS probe failed with statuscode: 404 Warning BackOff 32s(x217 over 67m) kubectl, ip-172-31-190.us-east-2.compute.internal Back-off restarting failed container
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 28s default-scheduler Successfully assigned dev/demo-app-dev-6cfd6448db-v86d6 to ip-172-31-10-124.us-east-2.compute.internal Warning BackOff 14s kubelet, ip-172-31-10-124.us-east-2.compute.internal Back-off restarting failed container Normal Pulled 1s (x3 over 28s) kubelet, ip-172-31-10-124.us-east-2.compute.internal Container image "686244538589.dkr.ecr.us-east-2.amazonaws.com/devtron:4773aaa2-165-3023" already present on machine Normal Created 0s (x3 over 28s) kubelet, ip-172-31-10-124.us-east-2.compute.internal Created container Normal Started 0s (x3 over 27s) kubelet, ip-172-31-10-124.us-east-2.compute.internal Started container
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 2m34s default-scheduler Successfully assigned dev/demo-app-dev-5c46b97695-wcvr8 to ip-172-31-10-124.us-east-2.compute.internal Normal Pulling 2m33s kubelet, ip-172-31-10-124.us-east-2.compute.internal pulling image "686244538589.dkr.ecr.us-east-2.amazonaws.com/devtron:588bf608-165-2589" Normal Pulled 2m30s kubelet, ip-172-31-10-124.us-east-2.compute.internal Successfully pulled image "686244538589.dkr.ecr.us-east-2.amazonaws.com/devtron:588bf608-165-2589" Normal Pulled 67s(x4 over 2m29s) kubelet, ip-172-31-10-124.us-east-2.compute.internal Container image "686244538589.dkr.ecr.us-east-2.amazonaws.com/devtron:588bf608-165-2589" already present on machine Normal Created 66s(x5 over 2m30s) kubelet, ip-172-31-10-124.us-east-2.compute.internal Created container Normal Started 66s (x5 over 2m30s) kubelet, ip-172-31-10-124.us-east-2.compute.internal Started container Warning BackOff 53s (x9 over 2m28s) kubelet, ip-172-31-10-124.us-east-2.compute.internal Back-off restarting failed container
This information will help you get to the root of the error.
Why does CrashLoopBackOff usually occur?
This error can be caused due to different reasons. But, there are a few commonly spotted reasons:
1. Probe failure
The kubelet uses liveness, readiness, and startup probes to keep checks on the container. If the liveness or the startup probe fails, the container gets restarted at that point.
To solve this, you need to first check if the probes have been properly configured. Ensure that all the specs (endpoint, port, SSL config, timeout, command) are correctly specified. If there are no errors there, check the logs for any. If you are still not able to spot any errors, use ephemeral containers and execute the curl or other relevant commands to ensure the application is running correctly.
Host Port: 0/TCP State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 137 Started: Tue, 26 Jan 2021 03:54:43 +0000 Finished: Tue, 26 Jan 2021 03:56:02 +0000 Ready: False
2. Out of memory failure (OOM)
Every pod has a specified memory space and when it tries to consume more memory than what has been allocated to it, the pod will keep crashing. This can occur if the pod is allocated less memory than it actually requires to run or if there an error in the pod and it keeps on consuming all the memory space while in its run state.
When you take a look at the events log, you will notice that there is an ‘OOM killed event’ error displayed which will clearly indicate that the pod crashed because it used up all the RAM designated to it.
To solve this error, you can increase the ram allocated to the pod. This would do the trick in usual cases. But, in case the pod is consuming excessive amounts of RAM, you will have to look into the application and look for the cause. If it is a Java application, check the heap configuration.
State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Tue, 26 Jan 2021 06:50:20 +0000 Finished: Tue, 26 Jan 2021 06:50:26 +0000 Ready: False Restart Count: 1
3. The application failure
At times, the application within the container itself keeps crashing because of some error and that can cause the pod to crash on repeat. In this case you will have to look at the application code and debug it. Run the following command:
Kubectl logs -n <namespace> <podName> -c <containerName> --previous
The last line of the logs usually helps in narrowing down the source of the error within the application.
State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 127 Started: Tue, 26 Jan 2021 06:58:47 +0000 Finished: Tue, 26 Jan 2021 06:58:47 +0000 Ready: False Restart Count: 4
If none of these above-mentioned 3 are the causes for your pod being in CrashLoopBackoff, you would need to increase the logging level of application and check the logs as mentioned in step 3.