design/74299-runtime-free.md: add runtime.free and use via compiler and stdlib to reduce GC work Design doc for golang/go#74299 and related CLs. This is currently largely in sync with the latest implementation. (The largest current delta is the exact name for runtime.freeSized, which might make more sense as runtime.free.) Roughly two-thirds of the content here was originally in the CL 673695 commit message, but it was moved to this design document to make it easier to manage it separately from the CLs. Updates golang/go#74299 Change-Id: I56738000cc473fa95d848f2d078c3d160c6ed8cd

diff --git a/design/74299-runtime-free.md b/design/74299-runtime-free.md
new file mode 100644
index 0000000..01735a1
--- /dev/null
+++ b/design/74299-runtime-free.md

@@ -0,0 +1,511 @@
+# Directly freeing user memory to reduce GC work
+
+PJ Malloy ([@thepudds](https://github.com/thepudds)) \
+January 2025
+
+**Discussion** at https://go.dev/issue/74299. \
+**CL stack** at https://go.dev/cl/673695.
+
+## Abstract
+
+This is a design for implementing a runtime.free within the
+runtime, and then using it via automatic calls inserted by the compiler
+and via a limited set of explicit calls from the standard library.
+
+When triggered, the goals include faster reuse of user memory,
+less overall CPU usage by the GC, longer times between GC cycles
+(with less overall time with the write barrier enabled),
+and more cache-friendly allocations for user code.
+
+This hopefully complements other on-going work like:
+ * [memory regions](https://go.dev/issue/70257), which can dynamically cast a wider net via a new user-facing API.
+ * [more stack allocations of slices](https://go.dev/cl/664299), which currently targets smaller slices such as <= 32 bytes.
+ * [Green Tea](https://go.dev/issue/73581), which is a general performance improvement for the GC in many cases.
+
+## Background
+
+The compiler today might be able to prove that an allocation
+does not escape, but in many cases today the allocation is
+still placed on the heap (such as because of a variable size
+that is not statically visible at compile time, or the size is
+too large for the stack).
+
+Separately, in some cases, the runtime or standard library might know
+a heap object is logically dead but the compiler is unable to prove that.
+
+More generally, for reducing allocations, there is sometimes
+a chicken and egg problem.
+
+For example, [#72036](https://go.dev/issue/72036) is a WIP approach to update escape analysis so
+that interface arguments do not always escape, with a primary target of
+`w.Write(b)` and friends, but many interesting byte slices are
+variable size known only at runtime. Currently, a make with an unknown
+capacity (beyond 32 bytes) causes a slice to be heap allocated, so
+solely improving escape analysis in this case has limited impact
+on actual applications -- if a slice is heap allocated for two reasons,
+eliminating one reason just means it is still heap allocated for the
+other reason, which lessens the value of doing the work to eliminate the
+first reason.
+
+In other words, a runtime.free delivers performance benefits on its own,
+but having a runtime.free also makes other improvements more valuable,
+including in escape analysis.
+
+There is also a reasonably rich set of follow-on targets that
+likely become available with the existence of a runtime.free.
+
+## High-level approach
+
+In an attempt to help triangulate on what is possible for the runtime,
+compiler, and standard library to do together to reduce how much work
+the GC must do, this document proposes updates to:
+
+1. **The runtime**: implement the ability to free memory explicitly without
+waiting for a GC cycle. Freed objects are recorded in one of two types of
+new free lists, and are then immediately reusable for
+the next allocation in the same span class, and hence can dramatically
+reduce pressure on the GC. Challenges here include:
+    * doing this safely (including in the face of the concurrent GC,
+      conservative scanning, and time delays between an object being observed by the GC
+      to when it is processed)
+    * doing this performantly (including without materially slowing down
+      the normal allocation path when there is no reuse, making it fast on the reuse path,
+      and being careful to not increase memory usage too much due to tracking
+      additional information).
+
+2. **The compiler**: automatically insert runtime calls to track and free slices
+allocated via make when the compiler can prove it is safe to do so. Also implemented
+is automatically inserting calls to free the intermediate memory
+from repeated calls to `append` in a loop (without freeing the final slice) when it
+can prove the memory is unaliased. This is done regardless of whether
+the final slice escapes (for example, even if the final slice is returned from
+the function that built the slice using `append` in a loop).
+
+3. **The standard library**: update code to invoke a runtime.free explicitly
+in a few targeted places.
+
+All three steps have been implemented (with some of this work in review,
+some close to review, and some running but still in WIP stage).
+
+A sample benchmark included below shows `strings.Builder` operating
+roughly 2x faster.
+
+For the explicit invocation in the standard library, separate CLs update
+the builtin maps to free backing map storage during growth and split,
+as well as a small number of lower-level stdlib things like `strings.Builder`
+(which are "low level" in the sense that they already encapsulate some low-level performance tricks,
+but also in the sense that they let other higher-level pieces of the stdlib then safely benefit from their
+performance).
+
+**Note**: It is not intended that higher-level portions of the stdlib (like net/http) would have manual
+calls to a runtime.free.
+
+Whether we ever take step 3 (manual code updates within the standard library)
+is debatable, and whether we want to do so can be determined later. However,
+it is useful to do so now at least in WIP CLs to see the resulting performance.
+
+Additional steps that are natural follow-ons (but not yet attempted) include:
+
+4. Recognizing `append` in loops that start with a slice of unknown provenance,
+such as a function parameter, but still be able to automatically
+free intermediate memory starting with the second growslice if the slice
+does not otherwise become aliased.
+
+5. Recognizing `append` in an iterator-based loop to be able to prove it is
+safe to free unaliased intermediate memory, such as in the patterns
+shown in `slices.Collect` and `slices.AppendSeq`.  (The first round of
+work manually updates those two in particular, including to be able to
+see the resulting performance, but that could wait for
+the compiler to automatically recognize those two patterns without any
+manual code edit to the `slices` package.)
+
+6. Possibly extend to freeing more than just slices. (Slices are a natural
+first step, but the large majority of the work implemented is independent of whether
+the heap object represents a slice.)
+
+## Runtime API overview
+
+The APIs in this section have a working initial implementation:
+
+```go
+ func freeSized(ptr unsafe.Pointer, size uintptr, noscan bool)
+ func freeTracked(trackedObjs *[]trackedObj)
+```
+
+Currently, the first API `freeSized` can be called by the
+standard library, with initial targets in the map grow/split code,
+`strings.Builder`, possibly places like `slices.Collect` and
+`slices.AppendSeq`, and other locations where the code knows it owns a
+pointer and has knowledge of when it is logically dead.
+
+The second API `freeTracked` is automatically inserted by the compiler.
+`freeTracked` is paired with a new slice allocation function. Together,
+this allows for heap-allocated objects to be tracked and later automatically
+freed as a scope is exited:
+
+```go
+ func makeslicetracked64(et *_type, len64 int64, cap64 int64, trackedObjs *[]trackedObj) unsafe.Pointer
+```
+
+Note that `freeSized` accepts an `unsafe.Pointer` and hence keeps the pointer
+alive. It therefore could be a pessimization in some cases (such
+as a long-lived function) if the caller does not call `freeSized` before
+or roughly when the liveness analysis of the compiler
+would otherwise have determined the heap object is reclaimable by the GC.
+
+In contrast, `freeTracked` operates on `uintptr` plus a sequence number to know
+whether it is still valid to use the `uintptr`. This avoids keeping the heap object
+alive longer than it otherwise would have been, which is important for performance
+and also simplifies where the runtime calls can be automatically placed by the compiler.
+
+There are more details on these functions at the bottom of this document,
+including draft API doc.
+
+An additional runtime API (most likely a temporary convenience) is
+currently inserted by the compiler instead of using `makeslicetracked64`:
+
+```go
+func trackSlice(sl unsafe.Pointer, trackedObjs *[]trackedObj)
+```
+
+## Compiler changes overview
+
+Additional CLs update the compiler to automatically insert free calls for slices allocated via
+make when the compiler can prove it is safe to do so.
+
+Consider for example the following code:
+
+```go
+func f1(size int) {
+	s := make([]int64, size) // heap allocated if > 32 bytes
+	_ = s
+}
+```
+
+In today's compiler, the backing array for `s` is not marked as escaping,
+but will be heap allocated if larger than 32 bytes.
+
+As part of this work, the compiler has been updated so that the
+backing array for `s` is recognized as non-escaping but unknown size and
+it is valid to free it when `f1` is done.
+
+The compiler automatically transforms `f1` to roughly:
+
+```go
+func f1(size int) {
+	var freeablesArr [1]struct { // compiler counts how many are needed (one, in this case)
+		ptr      uintptr
+		sweepgen uint32
+	}
+	freeables := freeablesArr[:0]
+	defer freeTracked(&freeables)
+
+	s := make([]int64, size)
+	trackSlice(unsafe.Pointer(&s), &freeables)
+
+	_ = s
+}
+```
+
+The `freeablesArr` is stored on the stack and automatically sized by the compiler based
+on the number of target sites. The `defer` is inserted in a way so that it is an open-coded defer
+and hence has low performance impact.
+
+See the API doc below for details on `freeTracked`.
+
+The transformed code just above shows what is currently implemented, but two items seen
+there that might change as the implementation is polished:
+ * `freeTracked` currently takes a slice for convenience, but we could make that tighter.
+ * the current intent is to use `makeslicetracked64` (for example, in `f1` above),
+ but the compiler currently inserts calls to a new `runtime.trackSlice` instead,
+ which initially seemed slightly easier to implement on the compiler side.
+ The current plan is to have the compiler use `makeslicetracked64` instead.
+
+If there are multiple target sites within a function, such as in `f2` here:
+
+```go
+func f2(size1 int, size2 int, x bool) {
+	s1 := make([]int64, size1) // heap allocated if > 32 bytes
+	if x {
+		s2 := make([]int64, size2) // heap allocated if > 32 bytes
+		_ = s2
+	}
+	_ = s1
+}
+```
+
+The compiler counts the max number of sites (not sensitive to conditionals),
+and sizes `freeablesArr` appropriately (two element array in this case),
+with a single call to `freeTracked` per function:
+
+```go
+func f2(size1 int, size2 int, x bool) {
+	var freeablesArr [2]struct { // compiler counts how many are needed (two, in this case)
+		ptr      uintptr
+		sweepgen uint32
+	}
+	freeables := freeablesArr[:0]
+	defer freeTracked(&freeables) // only a single freeTracked call, regardless of how many tracked sites
+
+	s1 := make([]int64, size1)
+	trackSlice(unsafe.Pointer(&s1), &freeables)
+	if x {
+		s2 := make([]int64, size2)
+		trackSlice(unsafe.Pointer(&s2), &freeables)
+
+		_ = s2
+	}
+	_ = s1
+}
+```
+
+The max number of automatically inserted call sites is bounded
+based on the count of `make` calls in a given function.
+
+Inside `for` loops or when a `goto` looks like it is
+being used in an unstructured loop, the compiler
+currently skips automatically inserting these calls.
+
+However, a modest extension (and the current plan) is to have the compiler instead insert
+a separate runtime entry point in loops to allow heap-allocated slice memory
+to be reclaimed on each loop iteration that returns
+back to the same allocation spot when the compiler
+can prove it is safe to do so, without requiring unbounded or
+dynamic space in the trackedObj slice. (The compiler can already
+prove it is safe and the runtime entry point is prototyped.)
+
+## Example standard library target for runtime.free
+
+Using strings.Builder as an example target -- this is already a somewhat
+low-level part of the stdlib that already helps encapsulate some
+performance-related trickery, including using unsafe and
+bytealg.MakeNoZero. The strings.Builder already owns its in-progress
+buffer that is being built up, until it returns a usually final string.
+The existing internal comment says:
+
+```go
+    // External users should never get direct access to this buffer, since
+    // the slice at some point will be converted to a string using unsafe,
+    // also data between len(buf) and cap(buf) might be uninitialized.
+    buf []byte
+```
+
+If we update strings.Builder to explicitly call runtime.freeSized on
+that buf after a resize operation (but without freeing the usually final
+incarnation of buf that will be returned to the user as a string), we
+can see some nice benchmark results on the existing strings benchmarks
+that call Builder.Write N times and then call Builder.String.  Here, the
+(uncommon) case of a single Builder.Write is not helped (given it never
+resizes after first alloc if there is only one Write), but the impact grows
+such that it is up to ~2x faster as there are more resize operations due
+to more strings.Builder.Write calls:
+
+```
+                                               │     disabled.out │      new-free-20.txt                │
+                                               │      sec/op      │   sec/op     vs base                │
+BuildString_Builder/1Write_36Bytes_NoGrow-4           55.82n ± 2%   55.86n ± 2%        ~ (p=0.794 n=20)
+BuildString_Builder/2Write_36Bytes_NoGrow-4           125.2n ± 2%   115.4n ± 1%   -7.86% (p=0.000 n=20)
+BuildString_Builder/3Write_36Bytes_NoGrow-4           224.0n ± 1%   188.2n ± 2%  -16.00% (p=0.000 n=20)
+BuildString_Builder/5Write_36Bytes_NoGrow-4           239.1n ± 9%   205.1n ± 1%  -14.20% (p=0.000 n=20)
+BuildString_Builder/8Write_36Bytes_NoGrow-4           422.8n ± 3%   325.4n ± 1%  -23.04% (p=0.000 n=20)
+BuildString_Builder/10Write_36Bytes_NoGrow-4          436.9n ± 2%   342.3n ± 1%  -21.64% (p=0.000 n=20)
+BuildString_Builder/100Write_36Bytes_NoGrow-4         4.403µ ± 1%   2.381µ ± 2%  -45.91% (p=0.000 n=20)
+BuildString_Builder/1000Write_36Bytes_NoGrow-4        48.28µ ± 2%   21.38µ ± 2%  -55.71% (p=0.000 n=20)
+```
+
+An alternative to strings.Builder calling runtime.freeSized itself would
+be to provide an alternate runtime entry point for appends where the
+backing array is known to be logically dead after a resize,
+and the runtime therefore knows it can do the free itself. Such a
+function could automatically be inserted by the compiler in many cases
+of appends in a loop, such as in the patterns shown by slices.Collect
+and slices.AppendSeq, though for more complex cases such as
+strings.Builder, it is probably unlikely that the compiler would be able
+to prove it is safe to automatically do so, including for
+strings.Builder, the manual edit must track whether Builder.String has already been
+called.  (I suspect it is not the common case to call Builder.String
+multiple times on a single instance, but the API allows it).
+
+Whether we ever use a runtime.free with strings.Builder is debatable
+(including there are alternatives that could be used for
+strings.Builder, such as a segmented buffer and/or sync.Pool
+such as used within cmd/compile). We are mainly using it here as a
+small-ish concrete example.
+
+There are other potential optimizations for strings.Builder,
+like allocating an initial buf (e.g., 64 bytes)
+on the first Write, faster growth than current 1.25x growth for larger
+slices, and "right sizing" the final string, where these could use
+runtime.freeSized if those optimizations end up being worthwhile; an
+earlier cut did implement those optimizations, and the intent is to
+measure again with the more recent implementation.
+
+## Normal allocation performance
+
+For the cases where a normal allocation is happening without any reuse,
+some initial micro-benchmarks suggest the impact of these changes could
+be small to negligible (at least with GOAMD64=v3):
+
+```
+goos: linux
+goarch: amd64
+pkg: runtime
+cpu: AMD EPYC 7B13
+                        │ base-512M-v3.bench │   ps16-512M-goamd64-v3.bench     │
+                        │        sec/op      │ sec/op     vs base               │
+Malloc8-16                      11.01n ± 1%   10.94n ± 1%  -0.68% (p=0.038 n=20)
+Malloc16-16                     17.15n ± 1%   17.05n ± 0%  -0.55% (p=0.007 n=20)
+Malloc32-16                     18.65n ± 1%   18.42n ± 0%  -1.26% (p=0.000 n=20)
+MallocTypeInfo8-16              18.63n ± 0%   18.36n ± 0%  -1.45% (p=0.000 n=20)
+MallocTypeInfo16-16             22.32n ± 0%   22.65n ± 0%  +1.50% (p=0.000 n=20)
+MallocTypeInfo32-16             23.37n ± 0%   23.89n ± 0%  +2.23% (p=0.000 n=20)
+geomean                         18.02n        18.01n       -0.05%
+```
+
+## runtime API
+
+This section lists the runtime APIs that are either automatically
+inserted by the compiler or can be called from a targeted set of
+locations in the standard library.
+
+Note: the term "logically dead" here is distinct from the compiler's liveness
+analysis, and the runtime free and re-use implementations must be (and
+attempt to be) robust in sequences such as a first alloc of a given
+pointer, GC observes and queues the pointer, runtime.free, GC dequeues
+pointer for scanning, runtime re-uses the same pointer, all within a
+single GC phase. (TODO: more precise term than "logically dead"?)
+
+```go
+// freeSized records that a heap object is reusable and available for
+// immediate reuse in a subsequent mallocgc allocation, without
+// needing to wait for the GC cycle to progress.
+//
+// The information is recorded in a free list stored in the
+// current P's mcache. The caller must pass in the user size
+// and whether the object has pointers, which allows a faster free
+// operation.
+//
+// freeSized must be called by the effective owner of ptr who knows
+// the pointer is logically dead, with no possible aliases that might
+// be used past that moment. The intended callers are
+// a limited number of places in the standard library (e.g., strings.Builder)
+// or the runtime (e.g., maps grow/split code, append-handling
+// code). In the future, the compiler could also insert direct freeSized calls,
+// though currently a related API freeTracked serves that purpose.
+//
+// freeSized is a no-op if called on a stack pointer. It must not be called
+// on memory in the data or bss sections, which is partially enforced.
+//
+// In addition, the caller must know that ptr's object has no specials, such
+// as might have been created by a call to SetFinalizer or AddCleanup.
+// (Internally, the runtime deals appropriately with internally-created
+// specials, such as specials for memory profiling.)
+//
+// If the size of ptr's object is less than or equal to 16 bytes or
+// greater than 32 KiB, freeSized is currently a no-op. It must only
+// be called in alloc-safe places.
+func freeSized(p unsafe.Pointer, size uintptr, noscan bool) bool
+
+// makeslicetracked64 is like the existing makeslice64, but the caller
+// also passes in a pointer to a trackedObj slice that allows the runtime
+// to track the heap-allocated backing array and possibly later
+// free the tracked objects via freeTracked. The contents of
+// the trackedObj slice are opaque to the caller, but the caller
+// provides the space for the trackedObj slice.
+//
+// Currently, the only intended caller is code inserted by
+// the compiler, which only does so when escape analysis can
+// prove the associated pointer has a known scope -- currently,
+// when it can prove the pointer does not outlive the current
+// function. This effectively creates a scope for the
+// associated heap pointers. The compiler also arranges for
+// freeTracked to be called on function exit, so that
+// the heap objects are automatically freed when leaving the
+// scope. The compiler also arranges for the trackedObj slice's
+// backing array to be placed on the user's normal goroutine
+// stack to provide storage for the tracking information,
+// with that storage matching the lifetime of the tracked
+// heap objects (that is, both match the lifetime of
+// the function call).
+func makeslicetracked64(et *_type, len64 int64, cap64 int64, trackedObjs *[]trackedObj) unsafe.Pointer
+
+// freeTracked marks as reusable the objects allocated by
+// calls to makeslicetracked64 and tracked in the trackedObjs
+// slice, unless the GC has already possibly reclaimed
+// those objects.
+//
+// The trackedObj slice does not keep the associated pointers
+// alive, and freeTracked becomes a no-op if the GC cycle
+// has progressed such that the GC might have already reclaimed
+// the tracked pointers before freeTracked was called.
+// This might happen for example if a function call lasts
+// multiple GC cycles. This is currently decided per tracked
+// object.
+//
+// The compiler statically counts the maximum number of
+// possible makeslicetracked64 calls in a function
+// in order to statically know how much user
+// stack space is needed for the trackedObj slice.
+func freeTracked(trackedObjs *[]trackedObj)
+
+// trackedObj is used to track a heap-allocated object that
+// has a proven scope and can potentially be freed by freeTracked.
+//
+// trackedObj does not keep the tracked object alive.
+//
+// trackedObj is opaque outside the allocation and free code.
+// The common case is the sweepgen matches the current sweepgen
+// but if there is a mismatch, the runtime does not attempt
+// to reuse the tracked object.
+//
+// TODO: there is a plan to allow a trackedObj to have specials,
+// including user-created via SetFinalizer or AddCleanup, but
+// not yet implemented. (In short -- update runtime.addspecial
+// to update a bitmap of objIndexes that have specials so
+// that the fast path for allocation & free does not need to
+// walk an mspan's specials linked list, and then a trackedObj with
+// specials will not be reused. An alternative might be to
+// sweep the span and process the specials after reaching
+// the end of an mspan in the mcache if there are reusable
+// objects with specials, but that might be harder or perhaps
+// infeasible. More generally, sweeping sooner than the
+// normal GC but not in the allocation/free fast path (perhaps
+// in a goroutine managed by the runtime), might be worth
+// exploring for other reasons than just specials).
+type trackedObj struct {
+	ptr      uintptr   // tracked object
+	sweepgen uint32    // TODO: perhaps use more bits;  mheap_.sweepgen is 32 bits.
+	spc      spanClass // 1 byte; potential optimization (not used yet)
+}
+```
+
+## Other possible runtime APIs
+
+### runtime.free without size, noscan
+
+An earlier version of this work also implemented an API that just took an
+`unsafe.Pointer`, without any size or noscan parameters, like:
+
+```go
+func free(ptr unsafe.Pointer)
+ ```
+
+However, this was more expensive than the more explicit `freeSized` API,
+primarily due to needing to call `spanOf`, `findObject` or similar.
+It might make sense to re-introduce this simpler API,
+though currently the best guess is it is not worthwhile.
+(Early measurements suggested that passing the size and noscan parameters
+like in `freeSized` allowed eliminating roughly 60% of the cost of a free operation,
+though the implementation has changed and improved since then, and it
+might be worth re-measuring.)
+
+If we do not use that signature, we likely should use the name `runtime.free` for
+the API described as `runtime.freeSized` in this document.
+
+### runtime.freeSlice
+
+Separately, we will likely also use a `runtime.freeSlice`.
+
+In the current implementation, it is primarily used as a test function,
+but currently leaning towards using it in some spots in the standard library
+as a convenience for callers over `runtime.freeSized`.