|
| 1 | +# Proposal: Goroutine leak detection via garbage collection |
| 2 | + |
| 3 | +Author(s): Georgian-Vlad Saioc ( [email protected]), Milind Chabbi ( [email protected]) |
| 4 | + |
| 5 | +Last updated: 14 Aug 2025 |
| 6 | + |
| 7 | +Discussion at [issue #74609](https://go.dev/issue/74609). |
| 8 | + |
| 9 | +## Abstract |
| 10 | + |
| 11 | +This proposal outlines a dynamic technique for detecting goroutine |
| 12 | +leaks within Go programs. It leverages the existing marking phase |
| 13 | +of the Go garbage collector (GC) to find goroutines blocked over |
| 14 | +concurrency primitives that are not reachable in memory from goroutines |
| 15 | +that may still be runnable. |
| 16 | + |
| 17 | +## Background |
| 18 | + |
| 19 | +Due to its concurrency features (lightweight goroutines, |
| 20 | +message passing), Go is particularly susceptible to concurrency bugs |
| 21 | +known as _goroutine leaks_ (also known as _partial deadlocks_ in |
| 22 | +literature [1](https://dl.acm.org/doi/10.1145/3676641.3715990)). |
| 23 | +Unlike global deadlocks (wherein all goroutines are blocked) that halt |
| 24 | +an entire application, goroutine leaks occur whenever a goroutine is |
| 25 | +blocked indefinitely, e.g., by reading from a channel that no other |
| 26 | +goroutine has access to, but other running goroutines keep the |
| 27 | +program operational. |
| 28 | +This issue can lead to (_a_) severe memory leaks, and (_b_) performance |
| 29 | +penalties, by over-burdening the GC with the task to mark useless memory. |
| 30 | +Goroutine leaks may be notoriously difficult to debug; in some cases |
| 31 | +even their presence alone is difficult to discern, even with otherwise |
| 32 | +thorough diagnostic information, e.g., memory and goroutine profiles. |
| 33 | +This makes tooling capable of detecting their presence valuable |
| 34 | +to the Go ecosystem. |
| 35 | + |
| 36 | +## Proposal |
| 37 | + |
| 38 | +The change involves several modifications to key points during phases |
| 39 | +of the GC cycle, as follows: |
| 40 | +1. Mark root preparation: initially treat only _runnable_ goroutines |
| 41 | +as mark roots (the regular GC treats _all_ goroutines as roots) |
| 42 | +2. Proceed to mark memory from this set of roots. |
| 43 | +3. Once all reachable memory has been marked, check whether any |
| 44 | +unmarked goroutines are blocked at operations over any concurrency |
| 45 | +primitives that have been marked as a result of step 2. |
| 46 | +4. Any such goroutines are considered _eventually runnable_, and |
| 47 | +must be treated as mark roots. Resume marking from step 2 with |
| 48 | +the new roots. |
| 49 | +5. Once a fixed point over reachable memory is computed, report any |
| 50 | +goroutines that are not treated as roots as leaks; resume from |
| 51 | +step 2 one last time with leaked goroutines as mark roots to ensure |
| 52 | +that all reachable memory is marked, like in the regular GC. |
| 53 | +6. Sweeping proceeds as normal. |
| 54 | + |
| 55 | +For an additional in-depth description of the theoretical |
| 56 | +underpinnings, refer [here](https://dl.acm.org/doi/10.1145/3676641.3715990). |
| 57 | + |
| 58 | +## Rationale |
| 59 | + |
| 60 | +The proposal expands the developer toolset when it comes to identifying |
| 61 | +goroutine leaks, especially in long-running systems with complex |
| 62 | +non-deterministic behavior. |
| 63 | +The advantage of this approach over other goroutine leak detection |
| 64 | +techniques is that it can be leveraged, with a minimal performance |
| 65 | +cost, in regular Go systems, e.g., production services. |
| 66 | +It is also theoretically sound, i.e., there are no false positives. |
| 67 | +Its primary limitation is that its effectiveness is reduced the more |
| 68 | +heap resources are over-exposed in memory, i.e., pair-wise reachable. |
| 69 | + |
| 70 | +## Compatibility |
| 71 | + |
| 72 | +The feature is backwards-compatible with any Go program. |
| 73 | +Changes are strictly internal, and any extensions are only accessible |
| 74 | +on an opt-in basis via additional APIs, in this case by adding a |
| 75 | +new profile type. |
| 76 | + |
| 77 | +## Implementation |
| 78 | + |
| 79 | +A working prototype is available at [go.dev/cl/688335](https://go.dev/cl/688335). |
| 80 | + |
| 81 | +In this section we discuss various aspects of the implementation. |
| 82 | + |
| 83 | +### Opting in via profiling |
| 84 | + |
| 85 | +Goroutine leak detection behaviour is |
| 86 | +triggered on-demand via profiling. |
| 87 | +An additional profile type, `"goroutineleak"`, is now available. |
| 88 | +Attempting to extract it will perform the following: |
| 89 | + |
| 90 | +1. Queue a leak detecting GC cycle and wait for it to complete. |
| 91 | +2. Extract a goroutine profile. |
| 92 | +3. Filter for goroutines with a leaked status, if `debug < 2`; |
| 93 | +alternatively, get a full stack dump of all goroutines, if `debug >=2`. |
| 94 | +4. Output the results. |
| 95 | + |
| 96 | +Otherwise, the GC preserves regular behavior, with a few exceptions |
| 97 | +described in the remainder of this section. |
| 98 | + |
| 99 | +### Temporary experimental flag |
| 100 | +In order to avoid most performance penalties, |
| 101 | +the proposal is currently only enabled via the |
| 102 | +experimental flag `goleakprofiler`. |
| 103 | + |
| 104 | +### Hiding pointers from the GC |
| 105 | +It is essential for the approach that certain pointers are only |
| 106 | +conditionally traced by the GC. |
| 107 | +In the current implementation, this is achieved via |
| 108 | +**maybe-traceable pointers**, expressed as type `maybeTraceablePtr` |
| 109 | +in the runtime. |
| 110 | + |
| 111 | +A maybe-traceable pointer value is a pair between a |
| 112 | +`unsafe.Pointer` and `uintptr` value, stored at fields `.vp` and `.vu`, |
| 113 | +respectively, within the `maybeTraceablePtr` type. |
| 114 | +A maybe-traceable pointer has one of three states: |
| 115 | + |
| 116 | +1) **Unset:** both `.vp` and `.vu` are zero values. |
| 117 | +This is homologous to `nil`. |
| 118 | +2) **Traceable:** both `.vp` and `.vu` are set, where both point to the |
| 119 | +same address. |
| 120 | +3) **Untraceable:** `.vu` is set to the address that is referenced, but |
| 121 | +`.vp` is set |
| 122 | +to `nil`, such that the GC does not automatically trace it when |
| 123 | +scanning the object embedding the maybe-traceable pointer. |
| 124 | + |
| 125 | +Maybe-traceable pointers are then provided with a set of methods for |
| 126 | +setting and unsetting them, that guarantee certain invariants at |
| 127 | +runtime, e.g., that if `.vp` and `.vu` are set, they point to the |
| 128 | +same address. |
| 129 | + |
| 130 | +The use of maybe-traceable pointers is only required for `*sudog` |
| 131 | +objects, specifically for the `.elem` and `.hchan` fields. |
| 132 | +This prevents the GC from inadvertendly marking channels that have |
| 133 | +not yet been deemed reachable in memory via eventually runnable |
| 134 | +goroutines. |
| 135 | +This may occur because `*sudog` objects are globally reachable: via |
| 136 | +the list of goroutine objects (`*g`) at `allgs`, and via the treap |
| 137 | +forest of semaphore-related `*sudog`s at `semtable`. |
| 138 | + |
| 139 | +All uses of these fields have been updated with the methods provided |
| 140 | +by the `maybeTraceablePtr` type. |
| 141 | +When a goroutine leak detection GC cycle starts, it sets all |
| 142 | +maybe-traceable pointers in `*sudog` objects as untraceable. |
| 143 | +Once the cycle concludes, it resets all the pointers to being traceable. |
| 144 | + |
| 145 | +### Soft dependency on [go.dev/issue/27993](https://go.dev/issue/27993) |
| 146 | +In the current implementation of the GC, there is a check for whether |
| 147 | +marking phase must be restarted due to |
| 148 | +[go.dev/issue/27993](https://go.dev/issue/27993). |
| 149 | +We extend that checkpoint with additional logic: (1) to find |
| 150 | +additional eventually-runnable goroutines, or (2) to mark goroutines as |
| 151 | +leaked, both of which provide another reason to restart |
| 152 | +the marking phase. |
| 153 | +Even if #27993 is resolved, the checkpoint must be preserved |
| 154 | +for goroutine leak detection. |
0 commit comments