Troubleshooting Pod CrashLoopBackOff Errors in K8s

What is the CrashLoopBackOff error?

A pod stuck in a CrashLoopBackOff is a very common error faced while deploying applications to Kubernetes. While in CrashLoopBackOff, the pod keeps crashing at one point right after it is deployed and run. It usually occurs because the pod is not starting correctly.

Every Pod has a spec field which in turn has a RestartPolicy field. The possible values of this field are: Always, OnFailure, and Never. This value is applicable for all the containers in one particular pod. This policy refers to the restarts of the containers by the kubelet. In case the RestartPolicy is Always or OnFailure, every time a failed pod is restarted by the kublet, it is restarted with an exponential back-off delay. This delay is capped at 5 minutes and if the pod is executed successfully for 10 minutes, this delay is reset to the initial value.

How to spot a CrashLoopBackOff error?

This error can be spotted by a single command. All you have to do is run your standard kubectl get pods -n <namespace> command and you will be able to see if any of your pods are in CrashLoopBackOff in the status section.

Once you have narrowed down the pods in CrashLoopBackOff, run the following command:

kubectl describe po <pod name> -n <namespace>

And in the output of the above command,

Check for the events section if any of the probes(liveness, readiness, startup) are failing.
Check for the events section for the event - OOM Killed.
Look in the status section of the pod and spot if there is ‘error’ displayed along with the error code

The output you get will be similar to the below examples:

Events:
Type         Reason            Age                    From									Message
-----        ----              ----                   ----									----
Normal       Pulled            35m(x14 over 73m)      kubectl, ip-172-31-190.us-east-2.compute.internal				Container image '66244538589.dkr.ecr.us-east2.amazon.com/devtron:eefccal191-223-191" already present on machine
Warning      Unhealthy         10m(x124 over 144m)    kubectl, ip-172-31-190.us-east-2.compute.internal				Readiness probe failed: HTTPS probe failed with statuscode: 404   
Warning      BackOff           32s(x217 over 67m)     kubectl, ip-172-31-190.us-east-2.compute.internal				Back-off restarting failed container

Events:
Type				Reason				Age				From										Message
----				------				----				----										-------
Normal				Scheduled			28s				default-scheduler								Successfully assigned dev/demo-app-dev-6cfd6448db-v86d6 to ip-172-31-10-124.us-east-2.compute.internal
Warning				BackOff				14s				kubelet, ip-172-31-10-124.us-east-2.compute.internal				Back-off restarting failed container
Normal				Pulled				1s (x3 over 28s)		kubelet, ip-172-31-10-124.us-east-2.compute.internal				Container image "686244538589.dkr.ecr.us-east-2.amazonaws.com/devtron:4773aaa2-165-3023" already present on machine
Normal				Created				0s (x3 over 28s)		kubelet, ip-172-31-10-124.us-east-2.compute.internal				Created container
Normal				Started				0s (x3 over 27s)		kubelet, ip-172-31-10-124.us-east-2.compute.internal				Started container

Events:
Type				Reason				Age				From									Message
----				------				----				----									-------
Normal				Scheduled			2m34s				default-scheduler							Successfully assigned dev/demo-app-dev-5c46b97695-wcvr8 to ip-172-31-10-124.us-east-2.compute.internal
Normal				Pulling				2m33s				kubelet, ip-172-31-10-124.us-east-2.compute.internal			pulling image "686244538589.dkr.ecr.us-east-2.amazonaws.com/devtron:588bf608-165-2589"
Normal				Pulled				2m30s				kubelet, ip-172-31-10-124.us-east-2.compute.internal			Successfully pulled image "686244538589.dkr.ecr.us-east-2.amazonaws.com/devtron:588bf608-165-2589"
Normal				Pulled				67s(x4 over 2m29s)		kubelet, ip-172-31-10-124.us-east-2.compute.internal			Container image "686244538589.dkr.ecr.us-east-2.amazonaws.com/devtron:588bf608-165-2589" already present on machine
Normal				Created				66s(x5 over 2m30s)		kubelet, ip-172-31-10-124.us-east-2.compute.internal			Created container
Normal				Started				66s (x5 over 2m30s)		kubelet, ip-172-31-10-124.us-east-2.compute.internal			Started container
Warning				BackOff				53s (x9 over 2m28s)		kubelet, ip-172-31-10-124.us-east-2.compute.internal			Back-off restarting failed container

This information will help you get to the root of the error.

Why does CrashLoopBackOff usually occur?

This error can be caused due to different reasons. But, there are a few commonly spotted reasons:

1. Probe failure

The kubelet uses liveness, readiness, and startup probes to keep checks on the container. If the liveness or the startup probe fails, the container gets restarted at that point.

To solve this, you need to first check if the probes have been properly configured. Ensure that all the specs (endpoint, port, SSL config, timeout, command) are correctly specified. If there are no errors there, check the logs for any. If you are still not able to spot any errors, use ephemeral containers and execute the curl or other relevant commands to ensure the application is running correctly.

Host Port:		0/TCP
State:			Waiting
Reason:			CrashLoopBackOff
Last State:		Terminated
Reason:			Error
Exit Code:		137
Started:		Tue, 26 Jan 2021 03:54:43 +0000
Finished:		Tue, 26 Jan 2021 03:56:02 +0000
Ready:			False

2. Out of memory failure (OOM)

Every pod has a specified memory space and when it tries to consume more memory than what has been allocated to it, the pod will keep crashing. This can occur if the pod is allocated less memory than it actually requires to run or if there an error in the pod and it keeps on consuming all the memory space while in its run state.

When you take a look at the events log, you will notice that there is an ‘OOM killed event’ error displayed which will clearly indicate that the pod crashed because it used up all the RAM designated to it.

To solve this error, you can increase the ram allocated to the pod. This would do the trick in usual cases. But, in case the pod is consuming excessive amounts of RAM, you will have to look into the application and look for the cause. If it is a Java application, check the heap configuration.

State:			Waiting
Reason:			CrashLoopBackOff
Last State:		Terminated
Reason:			OOMKilled
Exit Code:		137
Started:		Tue, 26 Jan 2021 06:50:20 +0000
Finished:		Tue, 26 Jan 2021 06:50:26 +0000
Ready:			False
Restart Count:	1

3. The application failure

At times, the application within the container itself keeps crashing because of some error and that can cause the pod to crash on repeat. In this case you will have to look at the application code and debug it. Run the following command:

Kubectl logs -n <namespace> <podName> -c <containerName> --previous

The last line of the logs usually helps in narrowing down the source of the error within the application.

State:			Waiting
Reason:			CrashLoopBackOff
Last State:		Terminated
Reason:			Error
Exit Code:		127
Started:		Tue, 26 Jan 2021 06:58:47 +0000
Finished:		Tue, 26 Jan 2021 06:58:47 +0000
Ready:			False
Restart Count:		4

If none of these above-mentioned 3 are the causes for your pod being in CrashLoopBackoff, you would need to increase the logging level of application and check the logs as mentioned in step 3.

Continue the Kubernetes learning journey.