Understanding Memory Leaks and How to Avoid Them in Backend Development

Personal Knowledge Sharing: What Go's garbage collector does not protect you from β€” and how I learned to find and fix memory leaks in production Go services


Introduction

Go has a garbage collector. That fact made me complacent early on. I assumed that if nothing was obviously wrong, memory was being managed for me. Then I started noticing a pattern in one of my Go microservices: the container would start at around 60MB, and after a few days of traffic it would be sitting at 400MB with no sign of ever coming back down. Restarting fixed it temporarily. The number climbed again.

The Go GC is excellent at cleaning up objects that are no longer referenced. What it cannot do is clean up memory that is still referenced β€” even if that reference is held by a goroutine nobody is waiting on, a cache nobody evicts, or a channel that nobody will ever read from again. From the GC's perspective, that memory is still in use.

This article covers the memory leak patterns I have actually hit in Go backend services, how to detect them with pprof and runtime, and the practices I put in place to avoid them going forward.


Table of Contents


Why Go Still Gets Memory Leaks

The Go garbage collector uses a tricolor mark-and-sweep algorithm that runs concurrently with your application. It frees any memory that is unreachable from the root set (global variables, goroutine stacks, registers).

The key word is unreachable. If a reference to an object exists anywhere β€” in a slice, a map value, a channel buffer, a goroutine's stack frame β€” the GC will not touch it. Memory leaks in Go are almost always one of two things:

  1. References that are held longer than intended β€” a cache that grows without eviction, a slice that keeps a reference to a large backing array, a map that accumulates entries and never shrinks.

  2. Goroutines that are never terminated β€” a goroutine blocked on a receive from a channel that will never send is alive from the GC's perspective; everything on its stack and heap that it references is reachable.

spinner

Understanding this distinction shifts how you think about memory management in Go. The GC is not the primary safeguard β€” your structural choices are.


Pattern 1: Goroutine Leaks

This was the root cause of the gradual memory growth I described in the introduction. Goroutines are cheap to start, which makes it easy to fire them off without thinking carefully about how and when they stop.

Blocked channel receiver

Every HTTP request to this endpoint leaks one goroutine. Under modest traffic, a few thousand goroutines accumulate in memory. The fix is ensuring the channel is always closed and the goroutine is always given a way out:

Context not propagated

A goroutine that does network I/O without respecting context cancellation will block until the remote call either completes or times out at the transport layer β€” which may be much later (or never) than the point at which the caller gave up:

Every function that launches a goroutine or makes a blocking call should accept and respect context.Context. This is not just good practice β€” it is the primary mechanism for making goroutines terminable.

Worker with no shutdown signal

Background workers started in main.go or during server initialization need an explicit shutdown path. A common pattern is passing a ctx derived from a cancellable context and listening on the Done channel:

When the server receives SIGTERM, cancel() is called, ctx.Done() fires, and the worker goroutine exits. Without this, the goroutine outlives the signal handler and holds its allocations until the process exits β€” or worse, if the process never exits cleanly.


Pattern 2: Unclosed HTTP Response Bodies

This one is subtle enough to miss in code review. When using net/http to make outbound requests, the response body must be explicitly closed, even when you do not intend to read it. Failing to do so holds the underlying TCP connection open, preventing it from being returned to the connection pool:

Under sustained traffic, the connection pool exhausts. New outbound requests block waiting for a connection. Eventually, memory climbs as pending goroutines accumulate.

The correct pattern uses defer immediately after the nil check:

The io.Copy(io.Discard, resp.Body) line is important for connection reuse. Closing the body without draining it forces the transport to close the underlying TCP connection rather than returning it to the pool, which means the pool still grows with new connections on each request.


Pattern 3: Slice Backing Array Retention

Go slices are a view over an underlying array. When you take a sub-slice, both the slice header and the original backing array remain allocated:

This was a real problem in a log-ingestion pipeline. I was extracting short substrings from large log lines and storing them in a cache. Each cache entry appeared small but silently retained the full original allocation.

The fix is to copy the data you need into a new allocation:

The same applies to string slicing from a large string:


Pattern 4: Unbounded Map Growth

Go maps do not shrink after deletion. Once a map allocates buckets for N entries, those buckets are reused but the underlying memory is not returned to the OS even if the map is emptied. If a map grows to millions of entries and then shrinks, the allocated memory stays:

If the key space is unbounded β€” request IDs, user-generated strings, session tokens β€” the map will grow indefinitely. There are two practical approaches.

Approach 1: TTL-based eviction with a background cleaner

Approach 2: LRU with a size cap

For cases where you want bounded memory regardless of TTL, use an LRU cache (I use golang.org/x/tools or github.com/hashicorp/golang-lru/v2 depending on the project):


Pattern 5: Timers and Tickers Not Stopped

time.NewTimer and time.NewTicker allocate goroutines and channels internally. If you create them without stopping them, they keep running and holding memory until the process exits:

Every request that completes before the ticker fires leaves a running goroutine inside the ticker's implementation. Fix: always stop with defer:

The same applies to time.AfterFunc. The timer returned must be stopped when no longer needed:


Pattern 6: defer Inside a Loop

defer lines up a call to run when the function returns, not when the loop iteration ends. Inside a tight loop, this accumulates deferred calls that all fire at once when the function exits β€” keeping every resource open for the full duration of the loop:

With thousands of files, this holds thousands of file handles open simultaneously. The fix is to extract the per-iteration work into a function (or a closure called immediately):


Pattern 7: String and []byte Conversion Under Load

In Go, converting between string and []byte always copies the data. Under high throughput this generates significant allocator pressure β€” many short-lived allocations that the GC must track and collect:

Prefer strings.Builder or fmt.Sprintf for string construction in moderate-frequency paths, and investigate unsafe.SliceData / unsafestring tricks only if profiling proves it is a bottleneck. For cache keys specifically, concatenation with + is often fine because the compiler can optimise it:

Use sync.Pool to reuse allocations in hot paths that create many short-lived objects (e.g. bytes.Buffer instances used during JSON encoding):


Detecting Memory Leaks with pprof

Go ships with net/http/pprof in the standard library. Registering the endpoint takes two lines:

Enabling the pprof endpoint

Heap profiling

To compare two snapshots and find what grew between them:

Goroutine profiling

When you suspect a goroutine leak, the goroutine profile tells you exactly how many goroutines are running and what they are blocked on:

A healthy long-running service should have a fairly stable goroutine count. If the count climbs steadily under traffic and does not drop when traffic stops, you have a goroutine leak.

runtime.ReadMemStats for in-process monitoring

For continuous monitoring (e.g. emitting metrics to Prometheus), poll runtime.ReadMemStats:

Plotting HeapInuse and NumGoroutine over time reveals leaks clearly: a leak shows up as a monotonically increasing line that does not return to baseline after a traffic reduction.


Building a Leak-Aware Service

The patterns above suggest a set of structural practices I now apply from the start of any Go backend service:

Always pass and respect context

Expose goroutine count as a metric

Alerting on abnormal goroutine growth catches leaks before they become incidents:

Use goleak in tests

go.uber.org/goleak fails a test if it exits with goroutines that were leaked by the code under test:

goleak is the cheapest way to catch goroutine leaks β€” it runs at test time, not in production.

Cap all caches at construction

Structure worker lifecycle explicitly

Waiting on <-w.done ensures the goroutine has released all its resources before Stop() returns β€” important during graceful shutdown when the process is about to exit anyway.


What I Learned

Working through memory growth in Go backend services changed how I think about the runtime:

  1. The GC is not a safety net for leaks β€” it is a collector of unreachable memory. Your job is to make sure that memory you no longer need becomes unreachable.

  2. Every goroutine needs an exit condition at the time you start it. If you cannot describe the exact circumstances that will cause a goroutine to return, it will likely leak.

  3. Context is the shutdown mechanism. Threading context.Context through every I/O call and goroutine is not just API hygiene β€” it is what makes goroutines terminable.

  4. Unbounded growth is the most common real-world leak. Before shipping any cache or accumulator, ask: "What is the maximum size of this thing? What removes entries from it?"

  5. pprof heap diffing is the fastest way to diagnose a live leak. Two heap snapshots before and after a traffic window show exactly which allocation sites grew.

  6. goleak in tests catches goroutine leaks at development time β€” far cheaper than finding them in production at 3 AM with a container restart.

  7. The goroutine count metric is the canary. A healthy service has a stable goroutine count at steady-state traffic. A rising count is always worth investigating immediately.

Last updated