Proposal: Goroutine leak detection via garbage collection

Author(s): Georgian-Vlad Saioc (vsaioc@uber.com), Milind Chabbi (milind@uber.com)

Last updated: 14 Aug 2025

Abstract

This proposal outlines a dynamic technique for detecting goroutine leaks within Go programs. It leverages the existing marking phase of the Go garbage collector (GC) to find goroutines blocked over concurrency primitives that are not reachable in memory from goroutines that may still be runnable.

Background

Due to its concurrency features (lightweight goroutines, message passing), Go is particularly susceptible to concurrency bugs known as goroutine leaks (also known as partial deadlocks in literature 1). Unlike global deadlocks (wherein all goroutines are blocked) that halt an entire application, goroutine leaks occur whenever a goroutine is blocked indefinitely, e.g., by reading from a channel that no other goroutine has access to, but other running goroutines keep the program operational. This issue can lead to (a) severe memory leaks, and (b) performance penalties, by over-burdening the GC with the task to mark useless memory. Goroutine leaks may be notoriously difficult to debug; in some cases even their presence alone is difficult to discern, even with otherwise thorough diagnostic information, e.g., memory and goroutine profiles. This makes tooling capable of detecting their presence valuable to the Go ecosystem.

Proposal

The change involves several modifications to key points during phases of the GC cycle, as follows:

Mark root preparation: initially treat only runnable goroutines as mark roots (the regular GC treats all goroutines as roots)
Proceed to mark memory from this set of roots.
Once all reachable memory has been marked, check whether any unmarked goroutines are blocked at operations over any concurrency primitives that have been marked as a result of step 2.
Any such goroutines are considered eventually runnable, and must be treated as mark roots. Resume marking from step 2 with the new roots.
Once a fixed point over reachable memory is computed, report any goroutines that are not treated as roots as leaks; resume from step 2 one last time with leaked goroutines as mark roots to ensure that all reachable memory is marked, like in the regular GC.
Sweeping proceeds as normal.

For an additional in-depth description of the theoretical underpinnings, refer here.

Rationale

The proposal expands the developer toolset when it comes to identifying goroutine leaks, especially in long-running systems with complex non-deterministic behavior. The advantage of this approach over other goroutine leak detection techniques is that it can be leveraged, with a minimal performance cost, in regular Go systems, e.g., production services. It is also theoretically sound, i.e., there are no false positives. Its primary limitation is that its effectiveness is reduced the more heap resources are over-exposed in memory, i.e., pair-wise reachable.

Compatibility

The feature is backwards-compatible with any Go program. Changes are strictly internal, and any extensions are only accessible on an opt-in basis via additional APIs, in this case by adding a new profile type.

Implementation

A working prototype is available at go.dev/cl/688335.

In this section we discuss various aspects of the implementation.

Opting in via profiling

Goroutine leak detection behaviour is triggered on-demand via profiling. An additional profile type, "goroutineleak", is now available. Attempting to extract it will perform the following:

Queue a leak detecting GC cycle and wait for it to complete.
Extract a goroutine profile.
Filter for goroutines with a leaked status, if debug < 2; alternatively, get a full stack dump of all goroutines, if debug >=2.
Output the results.

Otherwise, the GC preserves regular behavior, with a few exceptions described in the remainder of this section.

Temporary experimental flag

In order to avoid most performance penalties, the proposal is currently only enabled via the experimental flag goleakprofiler.

Hiding pointers from the GC

It is essential for the approach that certain pointers are only conditionally traced by the GC. In the current implementation, this is achieved via maybe-traceable pointers, expressed as type maybeTraceablePtr in the runtime.

A maybe-traceable pointer value is a pair between a unsafe.Pointer and uintptr value, stored at fields .vp and .vu, respectively, within the maybeTraceablePtr type. A maybe-traceable pointer has one of three states:

Unset: both .vp and .vu are zero values. This is homologous to nil.
Traceable: both .vp and .vu are set, where both point to the same address.
Untraceable: .vu is set to the address that is referenced, but .vp is set to nil, such that the GC does not automatically trace it when scanning the object embedding the maybe-traceable pointer.

Maybe-traceable pointers are then provided with a set of methods for setting and unsetting them, that guarantee certain invariants at runtime, e.g., that if .vp and .vu are set, they point to the same address.

The use of maybe-traceable pointers is only required for *sudog objects, specifically for the .elem and .hchan fields. This prevents the GC from inadvertendly marking channels that have not yet been deemed reachable in memory via eventually runnable goroutines. This may occur because *sudog objects are globally reachable: via the list of goroutine objects (*g) at allgs, and via the treap forest of semaphore-related *sudogs at semtable.

All uses of these fields have been updated with the methods provided by the maybeTraceablePtr type. When a goroutine leak detection GC cycle starts, it sets all maybe-traceable pointers in *sudog objects as untraceable. Once the cycle concludes, it resets all the pointers to being traceable.

Soft dependency on go.dev/issue/27993

In the current implementation of the GC, there is a check for whether marking phase must be restarted due to go.dev/issue/27993. We extend that checkpoint with additional logic: (1) to find additional eventually-runnable goroutines, or (2) to mark goroutines as leaked, both of which provide another reason to restart the marking phase. Even if #27993 is resolved, the checkpoint must be preserved for goroutine leak detection.