Prometheus Alertmanager for Kubernetes

Start your free trial

In dynamic Kubernetes environments, monitoring and alerting are crucial aspects of maintaining reliable Kubernetes clusters. The Kubernetes administrators, DevOps, and SREs have to rely on these metrics and alerts as they act as they look into their Kubernetes environments. Open-source tools like Prometheus excel at collecting the metrics from Kubernetes clusters and the Prometheus Alertmanager transforms these raw metrics into actionable alerts and notifications. For the Kubernetes administrators, DevOps, and SREs configuring the Prometheus Alertmanager is a key step to build a robust incident response while ensuring critical issues never go unnoticed. In this blog, we'll focus on setting up alerts for two critical scenarios: production application restarts and resource throttles.

Schedule a Demo →

Why Alerts in Kubernetes?

Kubernetes clusters and their workloads are essentially black boxes running our applications in highly dynamic and complex environments. While tools like Prometheus effectively monitor these environments and provide actionable insights through Grafana dashboards, the challenge lies in continuous observation. It's impractical and inefficient for teams to constantly watch these dashboards for potential issues. This is where alerts and notifications become crucial. Without an automated alerting system, critical issues could go unnoticed until they cause significant disruption, as teams can't maintain 24/7 manual monitoring of metrics dashboards. Alerts transform passive monitoring into active incident response, enabling teams to address issues before they impact service reliability. Some of the major incidents where we need a proper alert system in Kubernetes can be:

Production Application Restarts

Unexpected restarts can indicate issues with your application, such as crashes, resource constraints, or configuration problems. Monitoring these restarts helps you maintain high availability and quickly respond to potential issues.

Throttles

Throttling occurs when your application hits resource limits, potentially affecting performance and user experience. Alerts for throttles can help you identify and address resource bottlenecks before they become critical.

What is Prometheus Alertmanager?

Prometheus Alertmanager is part of an open-source project i.e. Prometheus, a monitoring solution. The Prometheus Alertmanager is a crucial part that works alongside the Prometheus monitoring system to handle and route the alerts generated. You can think of it as a notification system that handles alerts generated by the Prometheus server in a Kubernetes environment. Prometheus Alertmanager takes care of:

Alert Handling

Receives alerts from one or multiple Prometheus servers
Deduplicates alerts to avoid notification spam
Groups related alerts together (like multiple server failures in the same region)
Silence alerts during planned maintenance periods

Notification Management

Routes alerts to the right teams based on predefined rules
Supports multiple notification channels like:
Email
Slack
PagerDuty
OpsGenie
Discord
Custom Webhooks

How Prometheus Alertmanager works?

Key components of Prometheus Alertmanager:

Prometheus Server: This is the main Prometheus server which collects key metrics of Kubernetes infrastructure.
Alertmanager: A component of The Prometheus ecosystem that handles the alerts and notifications.
Grouping: A mechanism that combines multiple related alerts (like multiple pod failures in the same namespace) into a single notification to reduce alert noise.
Deduplication: A process that automatically identifies and removes duplicate alerts from multiple Prometheus servers to prevent notification spam and ensure each unique issue is reported only once.
Silencing: A feature that temporarily suppresses specific alerts during planned maintenance windows.
Route to Receiver: A configuration system that directs specific alerts to appropriate notification channels (receivers).

How to Configure Prometheus Alertmanager in Kubernetes

Let’s dive deep into how we can set alerts for various incidents on our Kubernetes infrastructure.

High CPU Throttling

An alert condition that triggers when containers hit their CPU limits and face forced CPU restrictions, indicating your application is trying to use more processing power than allocated in your Kubernetes configuration. When throttling occurs, it directly impacts application performance and responsiveness.

Configuring alerts for High CPU Throttling ensures that your applications do not overconsume CPU resources, which can degrade performance.

- alert: NodeCPUUsage
  expr: instance:node_cpu_utilisation:rate5m > 85
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "{{ $labels.node }}: High CPU usage detected"
    description: "`{{ $labels.node }}`:- CPU usage is above 85% (current value is: {{ $value | printf \"%.2f\" }}%)"
    resolved_desc: "`{{ $labels.node }}`:- *RESOLVED*: CPU usage is within thresholds (current value is: {{ $value | printf \"%.2f\" }}%)"

High Memory Usage

These alerts are triggered when your Kubernetes pods are consuming memory close to their defined limits generally, it’s around 80-90% utilization.

The High Memory Usage alerts are vital because they anticipate situations like Out of Memory (OOM) that can cause application instability or service disruptions.

- alert: NodeMemoryUsage
  expr: (100 * instance:node_memory_utilisation:ratio) > 85
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "{{ $labels.node }}: High memory usage detected"
    description: "{{ $labels.node }}: Memory usage is above 85% (current value is: {{ $value | printf \"%.2f\" }}%)"
    resolved_desc: "{{ $labels.node }}: *RESOLVED*: Memory usage is within thresholds (current value is: {{ $value | printf \"%.2f\" }}%)"

Node Disk Usage

Monitors and alerts when a node's disk space consumption reaches critical thresholds across Kubernetes nodes and container storages.

Node disk usage alerts are critical because full disks can paralyze node operations, prevent new container creation, and stop log collection. Monitoring disk usage helps maintain node health by allowing teams to implement cleanup procedures, adjust retention policies, or add capacity before reaching critical levels that could force unplanned downtimes or service degradation.

- alert: NodeDiskUsage
  expr: 100 - (100 * (node_filesystem_avail_bytes / node_filesystem_size_bytes)) > 85
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "{{ $labels.node }}: High Disk Usage detected"
    description: "{{ $labels.node }}: Disk usage is above 85% on {{ $labels.device }} (current value is: {{ $value | printf \"%.2f\" }}%)"
    resolved_desc: "{{ $labels.node }}: *RESOLVED*: Disk usage is within thresholds on {{ $labels.device }} (current value is: {{ $value | printf \"%.2f\" }}%)"

Container High Memory Usage

Monitoring high memory usage in containers is essential to ensure optimal performance across your cluster. Setting up alerts that notify you when memory consumption consistently exceeds defined thresholds over a specified duration can help you identify and address potential issues promptly. Excessive memory usage by a single container can lead to degraded performance for the entire cluster, resulting in application slowdowns or crashes.

- alert: ContainerMemoryUsageHigh
  expr: (sum(container_memory_working_set_bytes{namespace=~".*"}) BY (instance, name, pod, namespace) / 
         sum(container_spec_memory_limit_bytes{namespace=~".*"} > 0) BY (instance, name, pod, namespace) * 100) > 80
  for: 5m
  labels:
    severity: critical
  annotations:
    description: '{{ $value | printf "%.2f" }}% High memory usage in namespace {{$labels.namespace }} for  pod {{ $labels.pod }}.'
    resolved_desc: '*RESOLVED* :  Memory usage is normal now (current value is: {{ $value | printf "%.2f"  }})'
    summary: Container Memory Usage is too high.

Pod Restarts and OOMKilled

Monitors pod lifecycle events focusing on abnormal terminations, tracking both frequent restarts and Out-of-memory (OOM) kills in your Kubernetes cluster.

These alerts are crucial operational indicators as they signal underlying problems that can severely impact service reliability. Frequent pod restarts often reveal application stability issues, improper resource limits, or configuration problems that could affect user experience. OOM kills are particularly critical as they represent hard failures where pods are forcibly terminated, potentially losing in-flight transactions or causing service outages.

-   alert: Too many Container restarts 
    annotations: 
      summary: Container named {{ $labels.container }} in {{ $labels.pod }} in {{ $labels.namespace }} was restarted 
      description: "Namespace: {{$labels.namespace}}\nPod name: {{$labels.pod}}\nContainer name: {{$labels.container}}\n" 
    expr: sum(increase(kube_pod_container_status_restarts_total{pod_template_hash=""}[15m])) by (pod,namespace,container) > 5 
    for: 0m 
    labels: 
      severity: critical
      app: "{{ $labels.pod }}"
---     
-   alert: PodOOMKilled
    annotations: 
      summary: Pod named {{ $labels.pod }} in {{ $labels.namespace }} was OOMKilled 
      description: "Namespace: {{$labels.namespace}}\nPod name: {{$labels.pod}}\nContainer name: {{$labels.container}}\n" 
    expr: (kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}  > 0) and on (container)(increase(kube_pod_container_status_restarts_total[24h]) > 0) 
    for: 0m 
    labels: 
      severity: critical
      app: "{{ $labels.pod }}"

CPU Throttle High Override

A specialized alert configuration that creates exceptions for specific high-priority workloads to bypass standard CPU throttling restrictions. This override mechanism monitors critical containers marked as essential services, allowing them to exceed normal CPU limits temporarily.

CPU throttle overrides are essential for maintaining service prioritization in production environments.

- alert: CPUThrottlingHigh-override
  annotations:
    summary: CPUThrottlingHigh for container {{ $labels.container }} in pod {{ $labels.pod }}.
    message: "{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}."
  expr: >-
    sum(increase(container_cpu_cfs_throttled_periods_total{container!="",}[5m])) by (container, pod, namespace)
    /
    sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace)
    > ( 50 / 100 )
  for: 20m
  labels:
    severity: critical

---

- alert: 100%-CPUThrottling-Critical
  annotations:
    summary: 100% Throttling of CPU for container {{ $labels.container }} in pod {{ $labels.pod }}.
    message: "{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}."
  expr: >-
    sum(increase(container_cpu_cfs_throttled_periods_total{container!="",}[5m])) by (container, pod, namespace)
    /
    sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace)
    > ( 95 / 100 )
  for: 2m
  labels:
    severity: critical

Kubepod is Not Ready

Monitors the readiness state of pods in your Kubernetes cluster, triggering when pods fail their defined readiness probes or remain in a not-ready state beyond expected initialization times.

Pod readiness alerts are fundamental to service reliability as they directly impact application availability and user experience. When pods aren't ready, Kubernetes removes them from service load balancing, potentially reducing system capacity and increasing the load on the remaining pods.

- alert: KubePodNotReady-override
  expr: sum by(namespace, pod) (
        max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~".*",phase=~"Pending|Unknown", namespace=~".*"})
        * on(namespace, pod) group_left(owner_kind)
        topk by(namespace, pod) (1, max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"}))
      ) > 0
  for: 10m
  labels:
    severity: critical
    team: infra
  annotations:
    description: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 5 minutes.
    summary: Pod has been in a non-ready state for more than 5 minutes.

PVC Warning and Critical Alert

Monitors the health and status of Persistent Volume Claims (PVCs) across your Kubernetes cluster, detecting multiple critical scenarios including storage capacity issues, binding failures, and mounting problems.

PVC alerts are vital for data-dependent applications since storage issues can cause severe service disruptions. When storage problems occur, they often lead to application crashes, data corruption, or complete service outages as pods can't access their required data. These alerts help prevent data loss scenarios by warning teams before volumes fill up.

- alert: KubePersistentVolumeFillingUp-warning
  expr: (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.85
  for: 10m
  labels:
    severity: warning
  annotations:
    description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is filled up more than 85%. Currently {{ $value | humanizePercentage }}.
    summary: PersistentVolume {{ $labels.persistentvolumeclaim }} is filling up.

---

- alert: KubePersistentVolumeFillingUp-critical
  expr: (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.95
  for: 1m
  labels:
    severity: critical
    team: infra
  annotations:
    description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is filled up more than 95%. Currently {{ $value | humanizePercentage }}.
    summary: PersistentVolume {{ $labels.persistentvolumeclaim }} is filling up.

Container is Not Ready

Monitors container health status by tracking readiness probe responses, triggering when containers fail their configured health checks over a specified time period. The alert activates when containers can't pass their readiness criteria.

Container readiness alerts are crucial for maintaining service reliability since they detect problems before they impact user traffic.

- alert: KubeContainerWaiting-override
  annotations:
    summary: Container {{ $labels.container }} is waiting for more than an hour.
    message: Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container }} has been in waiting state for longer than 1 hour.
  expr: sum by (namespace, pod, container) (
          kube_pod_container_status_waiting_reason{job="kube-state-metrics", namespace=~".*"}
        ) > 0
  for: 1h
  labels:
    severity: critical

HPA Replica Mismatch

Monitors the discrepancy between desired and actual pod counts managed by the Horizontal Pod Autoscaler, triggering when the cluster fails to maintain the targeted pod scaling state.

HPA mismatch alerts are critical for maintaining application performance and resource efficiency. When the actual pod count doesn't match HPA targets, it often signals deeper problems like insufficient cluster resources, reaching quota limits, or issues with metric collection that prevent proper autoscaling.

- alert: KubeHpaReplicasMismatch-override
  annotations:
    summary: HPA {{ $labels.hpa }} has not matched the desired no. of replicas for more than 15 mins.
    message: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has not matched the desired number of replicas for longer than 15 minutes.
  expr: >-
    (kube_hpa_status_desired_replicas{job="kube-state-metrics", namespace=~".*"}
      !=
    kube_hpa_status_current_replicas{job="kube-state-metrics", namespace=~".*"})
    and
    changes(kube_hpa_status_current_replicas[15m]) == 0
  for: 15m
  labels:
    severity: critical

Job Failed Alert

Monitors the completion status of Kubernetes jobs, triggering when jobs fail to reach successful completion within their defined parameters or experience repeated failures. The alert activates when jobs terminate with non-zero exit codes, exceed their backoff limits, or fail to complete within their expected timeframe

Job failure alerts are essential for maintaining reliable batch processing and scheduled tasks in your cluster. When jobs fail, it often indicates data processing errors, resource constraints, or configuration issues that could impact critical business operations like data backups, report generation, or system maintenance tasks.

- alert: KubeJobFailed-override
  annotations:
    summary: Job {{ $labels.job_name }} has failed to complete success within activeDeadlineSeconds.
    message: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.
  expr: increase(kube_job_status_failed{job="kube-state-metrics", namespace=~".*"}[1h]) > 0
  for: 0m
  labels:
    severity: critical

Routing Alerts

Now that in the above section, we have configured Alerts for Kubernetes environments, let’s set up receivers. For that, we need to configure routing rules to ensure alerts are sent to the appropriate receivers.

Official Doc-: https://prometheus.io/docs/alerting/latest/configuration/#route-related-settings

alertmanager:
  config:
    route:
      routes:
        - match:
            team: devtron
          receiver: dev-team-discord

Configuring Receivers

Alertmanager receivers act as the crucial bridge between detecting issues and notifying the right teams. These receivers define the endpoints where your alerts will be delivered, allowing you to transform raw alert data into actionable notifications. Whether you need to wake up an on-call engineer through PagerDuty, notify your team via Slack, or send detailed reports via email, receivers provide the flexibility to handle various notification scenarios

Official Doc-: https://prometheus.io/docs/alerting/latest/configuration/#general-receiver-related-settings

Slack Receiver / Discord Receiver

alertmanager:
  config:
    Receivers:
      - name: null
      - name: dev-team-watchdog
        discord_configs:
          - webhook_url: https://discord.com/api/webhooks/<webhook>
            send_resolved: true
            title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}'

Email Receiver

receivers:
- name: 'email-notifications'
  email_configs:
  - to: 'oncall@yourcompany.com'
    from: 'alertmanager@yourcompany.com'
    smarthost: 'smtp.yourcompany.com:587'
    auth_username: 'alertmanager@yourcompany.com'
    auth_identity: 'alertmanager@yourcompany.com'
    auth_password: 'your-smtp-password'

Testing Your Alerts

Once you've set up your alert rules and receivers, it's relevant to test them to ensure they're working as expected. Here are a few ways to test your alerts:

Simulate alert conditions: Temporarily modify your alert rules to trigger based on current conditions.
Use Prometheus's API: Send test alerts directly through the Prometheus API.
Alertmanager's web UI: Use the Alertmanager web interface to view and manage active alerts.

Schedule a Demo →

Conclusion

Setting up basic alerts for your Kubernetes monitoring stack is a key step in maintaining the reliability and performance of your applications. By configuring alerts for issues like application restarts and CPU throttling, you can proactively identify and address potential problems before they affect your users. Keep in mind that alerting is an evolving process. As you gain deeper insights into your application's behavior and performance patterns, you’ll want to fine-tune your alert rules and add new ones to address other critical scenarios.

Badal Kumar

Badal is a DevOps Engineer, Certified Kubernetes Administrator, and Security expert. He helps enterprises ensure stable production releases with best practices in deployment and security.

Bhushan Nemade

Bhushan is an OSS Evangelist at Devtron with experience in promoting open-source adoption. He is an expert in DevOps ecosystems who actively contributes to and writes about open-source innovations.

Co-authors: Badal Kumar, Bhushan Nemade,

Tags:
Observability

Blogs

Videos

Webinars

Ebooks

About Us

Reviews

Documentation

Devtron Sandbox

Product Updates

CI/CD Pipelines

Helm

K3s

Rancher Kubernetes

FluxCD

Argo CD

GitOps

Kubernetes Dashboard

Kubernetes

Platform Engineering