How to Troubleshoot OOM Issues in Go Applications Running on Kubernetes

Kubernetes provides valuable mechanisms to prevent nodes from being affected when certain pods consume excessive CPU or memory resources, thanks to cgroups. However, memory limits can pose unique challenges. Unlike CPU limits, which can throttle usage over time, memory cannot be gradually restricted. When a container reaches its memory limit, it receives a SIGKILL signal, causing the container to restart.

In my current role, we addressed this by implementing a solution for all our Go applications using the pprof library. By enabling a listening port with pprof, we can retrieve memory heap data from any running process via the /debug/pprof/heap endpoint, which returns a memory heap dump.

We also developed a separate application that continuously monitors memory usage. This application queries the Kubernetes metrics server to check if any container is using more than 80% of its configured memory limit. If this threshold is exceeded, it connects to the pod’s IP on a pre-configured port, retrieves the heap memory data, and stores it in S3 for further analysis using the go tool pprof.

This approach works well for applications with gradual memory leaks, where memory accumulation occurs slowly over time. In such cases, the monitoring service can detect high memory usage and generate a heap dump before the application receives a SIGKILL due to an OOM event.

However, we often encounter cases where memory leaks are more aggressive, with containers jumping from 50 MB to over 1 GB in just a few seconds.

To handle these cases, we implemented a different approach using a library developed by Ricardo Maraschini. This library allows a sidecar container to monitor process metrics in real-time, querying statistics every second instead of the 30-second intervals we had with the metrics server.

We built on this with our custom solution at https://github.com/csepulveda/oom-heap-dumper which enhances Maraschini’s oomhero library by adding functionality to check the app’s listening ports, download the heap from each port, and store it in S3.

Compared to the previous approach, this method introduces some security considerations and may not be suitable for every pod in the system. For this solution to work, we need to enable shareProcessNamespace and set the oom-heap-dumper container privileges with securityContext.privileged: true.

This updated approach allows us to capture a heap memory dump and send it to S3 for later analysis, even when memory spikes occur rapidly.

Practical Example

Consider the following code, which forces a memory leak by doubling memory usage with each iteration: code

When running this application in a container on our Kubernetes cluster, we observe the following results:

Deployment Configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: go-memory-leak
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: go-memory-leak
  template:
    metadata:
      labels:
        app: go-memory-leak
    spec:
      containers:
        - name: app
          image: ghcr.io/csepulveda/go-memory-leak:v0.0.3
          resources:
            requests:
              memory: 64Mi
              cpu: 250m
            limits:
              memory: 240Mi
              cpu: 250m

Memory Stats:

memory-stats

App Logs and OOM events:

memory-events

memory-events-1

As shown, monitoring the metric server directly and attempting to take a memory dump when usage exceeds 80% is challenging. With each cycle doubling memory usage, we go from 50% to nearly 100% in one jump. These rapid changes make it difficult for the metric server to capture and respond to fluctuations in real-time.

To address this, our first step is to add pprof middleware to our application.

Using the Sidecar to auto generate the Heap File

Code Comparison:

https://github.com/csepulveda/go-memory-leak/compare/v0.0.3…v0.0.4 code

This enables us to retrieve the application’s memory heap, allowing us to analyze it locally. For example, if the app is running on port 8080, we can launch the pprof web interface on port 8081 to analyze the heap file:

go tool pprof -http=:8081 localhost:8080/debug/pprof/heap

Through this analysis, we can see the memory leak originates within the main function, specifically when calling serializeJSON. oom

oom-1

For this example, replicating the memory leak behavior is straightforward. However, in cases where memory leaks occur sporadically, capturing a memory dump at the exact moment of high memory usage is essential. This is where our sidecar solution becomes invaluable.

Deployment Configuration to Add the Sidecar:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: go-memory-leak
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: go-memory-leak
  template:
    metadata:
      labels:
        app: go-memory-leak
    spec:
      shareProcessNamespace: true
      containers:
        - name: app
          image: ghcr.io/csepulveda/go-memory-leak:v0.0.4
          resources:
            requests:
              memory: 64Mi
              cpu: 250m
            limits:
              memory: 130Mi
              cpu: 250m
        - name: oom-heap-dump
          image: ghcr.io/csepulveda/oom-heap-dumper:v0.0.5
          securityContext:
            privileged: true
          env:
          - name: CRITICAL
            value: "70"
          - name: COOLDOWN
            value: "10s"
          - name: WATCH_TIME
            value: "100ms"
          - name: BUCKET
            value: "bucket-name"
          - name: AWS_REGION
            value: "us-west-1"
          - name: AWS_ACCESS_KEY_ID
            value: "xxx"
          - name: AWS_SECRET_ACCESS_KEY
            value: "yyy"
          - name: AWS_SESSION_TOKEN
            value: "zzz"

In this YAML, the keys and bucket name are masked for security, but it’s possible to pass credentials directly as environment variables or use a service account that grants write permissions to the bucket where memory heaps will be stored. For more on setting permissions, check out my post on configuring IAM roles with service accounts: how to define iam roles on eks

When the app is running, you’ll see logs from our oom-dumper once it reaches the 70% memory threshold set as the maximum memory usage.

mem-1 mem-2

Retrieving and Visualizing the Heap

Download the heap from S3 and view it locally:

aws s3 cp s3://xxxx/go-memory-leak-54d6d47d7-6hl64-7-8080-2024-11-11-03-49.heap .
download: s3://xxxx/go-memory-leak-54d6d47d7-6hl64-7-8080-2024-11-11-03-49.heap to ./go-memory-leak-54d6d47d7-6hl64-7-8080-2024-11-11-03-49.heap

go tool pprof -http=:8081 go-memory-leak-54d6d47d7-6hl64-7-8080-2024-11-11-03-49.heap

Review the memory heap from the web interface provided by go tool pprof. pprof

Conclusion

Using go tool pprof, we can manually capture memory heaps during OOM events. Setting up the pprof wrapper for on-demand heaps is relatively straightforward. However, capturing a heap at a specific high-memory usage moment can be challenging. For these cases, monitoring tools that automatically generate/store heaps can be invaluable, making the csepulveda/oom-heap-dumper sidecar an effective tool for managing memory dumps in high-demand situations.

César Sepúlveda

I am a DevOps/SRE leader with over 13 years of experience, managing infrastructure and security to ensure scalability, reliability, and compliance.