design: add design/74609-goroutine-leak-detection-gc.md

For golang/go#74609

Changes at CL 688335

Change-Id: I605c0d4aa88cd44f42300ebe476496744d93f9ce
GitHub-Last-Rev: 49564e8f7727b67d56faf8b330f266cea67a7fcd
GitHub-Pull-Request: golang/proposal#58
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/689555
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
diff --git a/design/74609-goroutine-leak-detection-gc.md b/design/74609-goroutine-leak-detection-gc.md
new file mode 100644
index 0000000..3e9dbc5
--- /dev/null
+++ b/design/74609-goroutine-leak-detection-gc.md
@@ -0,0 +1,154 @@
+# Proposal: Goroutine leak detection via garbage collection
+
+Author(s): Georgian-Vlad Saioc (vsaioc@uber.com), Milind Chabbi (milind@uber.com)
+
+Last updated: 14 Aug 2025
+
+Discussion at [issue #74609](https://go.dev/issue/74609).
+
+## Abstract
+
+This proposal outlines a dynamic technique for detecting goroutine
+leaks within Go programs. It leverages the existing marking phase
+of the Go garbage collector (GC) to find goroutines blocked over
+concurrency primitives that are not reachable in memory from goroutines
+that may still be runnable.
+
+## Background
+
+Due to its concurrency features (lightweight goroutines,
+message passing), Go is particularly susceptible to concurrency bugs
+known as _goroutine leaks_ (also known as _partial deadlocks_ in
+literature [1](https://dl.acm.org/doi/10.1145/3676641.3715990)).
+Unlike global deadlocks (wherein all goroutines are blocked) that halt
+an entire application, goroutine leaks occur whenever a goroutine is
+blocked indefinitely, e.g., by reading from a channel that no other
+goroutine has access to, but other running goroutines keep the
+program operational.
+This issue can lead to (_a_) severe memory leaks, and (_b_) performance
+penalties, by over-burdening the GC with the task to mark useless memory.
+Goroutine leaks may be notoriously difficult to debug; in some cases
+even their presence alone is difficult to discern, even with otherwise
+thorough diagnostic information, e.g., memory and goroutine profiles.
+This makes tooling capable of detecting their presence valuable
+to the Go ecosystem.
+
+## Proposal
+
+The change involves several modifications to key points during phases
+of the GC cycle, as follows:
+1. Mark root preparation: initially treat only _runnable_ goroutines
+as mark roots (the regular GC treats _all_ goroutines as roots)
+2. Proceed to mark memory from this set of roots.
+3. Once all reachable memory has been marked, check whether any
+unmarked goroutines are blocked at operations over any concurrency
+primitives that have been marked as a result of step 2.
+4. Any such goroutines are considered _eventually runnable_, and
+must be treated as mark roots. Resume marking from step 2 with
+the new roots.
+5. Once a fixed point over reachable memory is computed, report any
+goroutines that are not treated as roots as leaks; resume from
+step 2 one last time with leaked goroutines as mark roots to ensure
+that all reachable memory is marked, like in the regular GC.
+6. Sweeping proceeds as normal.
+
+For an additional in-depth description of the theoretical
+underpinnings, refer [here](https://dl.acm.org/doi/10.1145/3676641.3715990).
+
+## Rationale
+
+The proposal expands the developer toolset when it comes to identifying
+goroutine leaks, especially in long-running systems with complex
+non-deterministic behavior.
+The advantage of this approach over other goroutine leak detection
+techniques is that it can be leveraged, with a minimal performance
+cost, in regular Go systems, e.g., production services.
+It is also theoretically sound, i.e., there are no false positives.
+Its primary limitation is that its effectiveness is reduced the more
+heap resources are over-exposed in memory, i.e., pair-wise reachable.
+
+## Compatibility
+
+The feature is backwards-compatible with any Go program.
+Changes are strictly internal, and any extensions are only accessible
+on an opt-in basis via additional APIs, in this case by adding a
+new profile type.
+
+## Implementation
+
+A working prototype is available at [go.dev/cl/688335](https://go.dev/cl/688335).
+
+In this section we discuss various aspects of the implementation.
+
+### Opting in via profiling
+
+Goroutine leak detection behaviour is
+triggered on-demand via profiling.
+An additional profile type, `"goroutineleak"`, is now available.
+Attempting to extract it will perform the following:
+
+1. Queue a leak detecting GC cycle and wait for it to complete.
+2. Extract a goroutine profile.
+3. Filter for goroutines with a leaked status, if `debug < 2`;
+alternatively, get a full stack dump of all goroutines, if `debug >=2`.
+4. Output the results.
+
+Otherwise, the GC preserves regular behavior, with a few exceptions
+described in the remainder of this section.
+
+### Temporary experimental flag
+In order to avoid most performance penalties,
+the proposal is currently only enabled via the
+experimental flag `goleakprofiler`.
+
+### Hiding pointers from the GC
+It is essential for the approach that certain pointers are only
+conditionally traced by the GC.
+In the current implementation, this is achieved via
+**maybe-traceable pointers**, expressed as type `maybeTraceablePtr`
+in the runtime.
+
+A maybe-traceable pointer value is a pair between a
+`unsafe.Pointer` and `uintptr` value, stored at fields `.vp` and `.vu`,
+respectively, within the `maybeTraceablePtr` type.
+A maybe-traceable pointer has one of three states:
+
+1) **Unset:** both `.vp` and `.vu` are zero values.
+This is homologous to `nil`.
+2) **Traceable:** both `.vp` and `.vu` are set, where both point to the
+same address.
+3) **Untraceable:** `.vu` is set to the address that is referenced, but
+`.vp` is set
+to `nil`, such that the GC does not automatically trace it when
+scanning the object embedding the maybe-traceable pointer.
+
+Maybe-traceable pointers are then provided with a set of methods for
+setting and unsetting them, that guarantee certain invariants at
+runtime, e.g., that if `.vp` and `.vu` are set, they point to the
+same address.
+
+The use of maybe-traceable pointers is only required for `*sudog`
+objects, specifically for the `.elem` and `.hchan` fields.
+This prevents the GC from inadvertendly marking channels that have
+not yet been deemed reachable in memory via eventually runnable
+goroutines.
+This may occur because `*sudog` objects are globally reachable: via
+the list of goroutine objects (`*g`) at `allgs`, and via the treap
+forest of semaphore-related `*sudog`s at `semtable`.
+
+All uses of these fields have been updated with the methods provided
+by the `maybeTraceablePtr` type.
+When a goroutine leak detection GC cycle starts, it sets all
+maybe-traceable pointers in `*sudog` objects as untraceable.
+Once the cycle concludes, it resets all the pointers to being traceable.
+
+### Soft dependency on [go.dev/issue/27993](https://go.dev/issue/27993)
+In the current implementation of the GC, there is a check for whether
+marking phase must be restarted due to
+[go.dev/issue/27993](https://go.dev/issue/27993).
+We extend that checkpoint with additional logic: (1) to find
+additional eventually-runnable goroutines, or (2) to mark goroutines as
+leaked, both of which provide another reason to restart
+the marking phase.
+Even if #27993 is resolved, the checkpoint must be preserved
+for goroutine leak detection.