blob: 3e9dbc5bff2a5df915dafc249822d6482fb5b81f [file] [log] [blame] [view]
# Proposal: Goroutine leak detection via garbage collection
Author(s): Georgian-Vlad Saioc (vsaioc@uber.com), Milind Chabbi (milind@uber.com)
Last updated: 14 Aug 2025
Discussion at [issue #74609](https://go.dev/issue/74609).
## Abstract
This proposal outlines a dynamic technique for detecting goroutine
leaks within Go programs. It leverages the existing marking phase
of the Go garbage collector (GC) to find goroutines blocked over
concurrency primitives that are not reachable in memory from goroutines
that may still be runnable.
## Background
Due to its concurrency features (lightweight goroutines,
message passing), Go is particularly susceptible to concurrency bugs
known as _goroutine leaks_ (also known as _partial deadlocks_ in
literature [1](https://dl.acm.org/doi/10.1145/3676641.3715990)).
Unlike global deadlocks (wherein all goroutines are blocked) that halt
an entire application, goroutine leaks occur whenever a goroutine is
blocked indefinitely, e.g., by reading from a channel that no other
goroutine has access to, but other running goroutines keep the
program operational.
This issue can lead to (_a_) severe memory leaks, and (_b_) performance
penalties, by over-burdening the GC with the task to mark useless memory.
Goroutine leaks may be notoriously difficult to debug; in some cases
even their presence alone is difficult to discern, even with otherwise
thorough diagnostic information, e.g., memory and goroutine profiles.
This makes tooling capable of detecting their presence valuable
to the Go ecosystem.
## Proposal
The change involves several modifications to key points during phases
of the GC cycle, as follows:
1. Mark root preparation: initially treat only _runnable_ goroutines
as mark roots (the regular GC treats _all_ goroutines as roots)
2. Proceed to mark memory from this set of roots.
3. Once all reachable memory has been marked, check whether any
unmarked goroutines are blocked at operations over any concurrency
primitives that have been marked as a result of step 2.
4. Any such goroutines are considered _eventually runnable_, and
must be treated as mark roots. Resume marking from step 2 with
the new roots.
5. Once a fixed point over reachable memory is computed, report any
goroutines that are not treated as roots as leaks; resume from
step 2 one last time with leaked goroutines as mark roots to ensure
that all reachable memory is marked, like in the regular GC.
6. Sweeping proceeds as normal.
For an additional in-depth description of the theoretical
underpinnings, refer [here](https://dl.acm.org/doi/10.1145/3676641.3715990).
## Rationale
The proposal expands the developer toolset when it comes to identifying
goroutine leaks, especially in long-running systems with complex
non-deterministic behavior.
The advantage of this approach over other goroutine leak detection
techniques is that it can be leveraged, with a minimal performance
cost, in regular Go systems, e.g., production services.
It is also theoretically sound, i.e., there are no false positives.
Its primary limitation is that its effectiveness is reduced the more
heap resources are over-exposed in memory, i.e., pair-wise reachable.
## Compatibility
The feature is backwards-compatible with any Go program.
Changes are strictly internal, and any extensions are only accessible
on an opt-in basis via additional APIs, in this case by adding a
new profile type.
## Implementation
A working prototype is available at [go.dev/cl/688335](https://go.dev/cl/688335).
In this section we discuss various aspects of the implementation.
### Opting in via profiling
Goroutine leak detection behaviour is
triggered on-demand via profiling.
An additional profile type, `"goroutineleak"`, is now available.
Attempting to extract it will perform the following:
1. Queue a leak detecting GC cycle and wait for it to complete.
2. Extract a goroutine profile.
3. Filter for goroutines with a leaked status, if `debug < 2`;
alternatively, get a full stack dump of all goroutines, if `debug >=2`.
4. Output the results.
Otherwise, the GC preserves regular behavior, with a few exceptions
described in the remainder of this section.
### Temporary experimental flag
In order to avoid most performance penalties,
the proposal is currently only enabled via the
experimental flag `goleakprofiler`.
### Hiding pointers from the GC
It is essential for the approach that certain pointers are only
conditionally traced by the GC.
In the current implementation, this is achieved via
**maybe-traceable pointers**, expressed as type `maybeTraceablePtr`
in the runtime.
A maybe-traceable pointer value is a pair between a
`unsafe.Pointer` and `uintptr` value, stored at fields `.vp` and `.vu`,
respectively, within the `maybeTraceablePtr` type.
A maybe-traceable pointer has one of three states:
1) **Unset:** both `.vp` and `.vu` are zero values.
This is homologous to `nil`.
2) **Traceable:** both `.vp` and `.vu` are set, where both point to the
same address.
3) **Untraceable:** `.vu` is set to the address that is referenced, but
`.vp` is set
to `nil`, such that the GC does not automatically trace it when
scanning the object embedding the maybe-traceable pointer.
Maybe-traceable pointers are then provided with a set of methods for
setting and unsetting them, that guarantee certain invariants at
runtime, e.g., that if `.vp` and `.vu` are set, they point to the
same address.
The use of maybe-traceable pointers is only required for `*sudog`
objects, specifically for the `.elem` and `.hchan` fields.
This prevents the GC from inadvertendly marking channels that have
not yet been deemed reachable in memory via eventually runnable
goroutines.
This may occur because `*sudog` objects are globally reachable: via
the list of goroutine objects (`*g`) at `allgs`, and via the treap
forest of semaphore-related `*sudog`s at `semtable`.
All uses of these fields have been updated with the methods provided
by the `maybeTraceablePtr` type.
When a goroutine leak detection GC cycle starts, it sets all
maybe-traceable pointers in `*sudog` objects as untraceable.
Once the cycle concludes, it resets all the pointers to being traceable.
### Soft dependency on [go.dev/issue/27993](https://go.dev/issue/27993)
In the current implementation of the GC, there is a check for whether
marking phase must be restarted due to
[go.dev/issue/27993](https://go.dev/issue/27993).
We extend that checkpoint with additional logic: (1) to find
additional eventually-runnable goroutines, or (2) to mark goroutines as
leaked, both of which provide another reason to restart
the marking phase.
Even if #27993 is resolved, the checkpoint must be preserved
for goroutine leak detection.