design/74299-runtime-free.md: add runtime.freegc and use via compiler to reduce GC work

Design doc for golang/go#74299 and related CLs.

The implementation begins with the stack at CL 673695.

This document is currently largely in sync with the latest
implementation.

Updates golang/go#74299

Change-Id: I56738000cc473fa95d848f2d078c3d160c6ed8cd
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/700255
Reviewed-by: Michael Knyszek <mknyszek@google.com>
diff --git a/design/74299-runtime-freegc.md b/design/74299-runtime-freegc.md
new file mode 100644
index 0000000..d3ca8f6
--- /dev/null
+++ b/design/74299-runtime-freegc.md
@@ -0,0 +1,726 @@
+# Directly freeing user memory to reduce GC work
+
+PJ Malloy ([@thepudds](https://github.com/thepudds)) \
+January 2025
+
+**Discussion** at https://go.dev/issue/74299. \
+**CL stack** at https://go.dev/cl/673695.
+
+## Introduction
+
+This covers a design and initial implementation for a mechanism
+to free user memory eagerly to the Go runtime to be reused without
+needing to wait for the GC cycle to progress. The mechanism is implemented
+in a new runtime function, `runtime.freegc`, which is invoked
+by the compiler in cases where it can prove the memory is no longer used.
+
+A call to `runtime.freegc` records memory as dead and tracks it for
+later reuse. The allocator is updated to be able to safely
+reuse these heap objects, and care is taken so that `runtime.freegc` is able
+to coexist and cooperate with the concurrently running garbage collector.
+
+Updates to the compiler include automatically inserting calls
+to `runtime.freegc` such as when the compiler proves heap allocated memory
+does not live past a function return and therefore may be safely freed
+on function exit.
+
+The compiler is also updated to do a form of alias analysis so
+that even escaping heap memory can be immediately freed once
+it is dead in some cases, such as the intermediate memory for appends
+of a non-aliased slice in a loop.
+
+When enabled, `runtime.freegc` allows more eager reuse of user memory,
+reducing the allocation rate from the GC's perspective, and thus
+reducing the frequency of GC cycles, total GC CPU usage, and how often
+write barriers are enabled. It can also produce a more cache-friendly
+allocation pattern for user code.
+
+This design composes well with the compiler's historic ability to do
+stack allocation of statically-known sizes by filling in gaps that
+stack allocation cannot currently handle. It also composes well with
+possible future improvements like:
+ * [memory regions](https://go.dev/issue/70257), which can
+   dynamically cast a wider net via a new user-facing API.
+ * [stack allocations of dynamically-sized small slices](https://go.dev/cl/664299), which
+   currently targets slices <= 32 bytes.
+ * [escape analysis improvements for interface arguments](https://go.dev/issue/72036),
+   which for example in `w.Write(b)` would allow `b` to be stack allocated, but which
+   would be limited to only helping statically-sized slices unless there is an
+   ability to hook into something like `runtime.freegc`. Other improvements also
+   become more broadly applicable and hence more worthwhile if `runtime.freegc` exists.
+
+There is also a reasonably rich group of direct follow-on targets for `runtime.freegc`
+that likely become available, and it becomes another tool in the toolbox for
+the compiler and runtime.
+
+### Hand-written free lists
+
+Today, users can implement their own free lists and achieve some of the memory reuse
+benefits described here. However, `runtime.freegc` is implemented at a low level of
+the runtime and has various advantages like being able to reuse memory across types
+more safely and easily, more directly hook into what the GC is doing, do things
+like place pointers inside the heap memory of types that do not have pointers
+(to more efficiently track reusable objects), and even change the GC shape of
+reusable objects, which is something even unsafe user code cannot do.
+
+Updating the compiler translates to the reuse being both safer and more automated
+than hand written code using a custom free list.
+
+And of course, the goal is to speed up safe, idiomatic Go code without requiring
+a variety of ad-hoc user written free lists.
+
+## High-level approach
+
+In an attempt to help triangulate on what is possible for the runtime
+and compiler to do together to reduce how much work the GC must do,
+this document proposes updates to:
+
+1. **The runtime**: implement `runtime.freegc` and associated runtime entry-points.
+Freed objects are recorded in one of two types of new free lists, and
+are then immediately reusable for the next allocation in the same span class
+via updates to the allocator. Challenges here include:
+    * Safety, including in the face of the concurrent GC,
+      conservative scanning, and time delays between an object being observed by the GC
+      to when it is processed.
+    * Performance, specifically without materially slowing down
+      the normal allocation path when there is no reuse, making it fast on the reuse path,
+      and being careful to not increase memory usage too much due to tracking
+      additional information.
+
+2. **The compiler**: automatically insert runtime calls to track and free slices
+allocated via `make` when the compiler can prove it is safe to do so. Also implemented
+is automatically inserting calls to free the intermediate memory
+from repeated calls to `append` in a loop (without freeing the final slice) when it
+can prove the memory is unaliased. This is done regardless of whether
+the final slice escapes (for example, even if the final slice is returned from
+the function that built the slice using `append` in a loop).
+
+A sample benchmark included below shows `strings.Builder` operating
+roughly 2x faster.
+
+An experimental modification to `reflect` to use `runtime.freegc`
+and then using that `reflect` with `json/v2` gave reported memory
+allocation reductions of -43.7%, -32.9%, -21.9%, -22.0%, -1.0%
+for the 5 official real-world unmarshalling benchmarks from
+`go-json-experiment/jsonbench` by the authors of `json/v2`, covering
+the CanadaGeometry through TwitterStatus datasets.
+
+A separate CL updates the builtin maps to free backing map storage
+during growth and split to free logically dead table memory that
+is not held by map iteration.
+
+Note: there is no current intent to modify the standard library to have
+explicit calls to `runtime.freegc`, and of course such an ability
+would never be exposed to end-user code. The `reflect` and other standard
+library modifications were experiments that were helpful while designing
+and implementing the runtime side prior to some of the compiler work.
+
+Additional steps that are natural follow-ons (but not yet attempted) include:
+
+4. Recognizing `append` in loops that start with a slice of unknown provenance,
+such as a function parameter, but still be able to automatically
+free intermediate memory starting with the second slice growth if the slice
+does not otherwise become aliased.
+
+5. Recognizing `append` in an iterator-based loop to be able to prove it is
+safe to free unaliased intermediate memory, such as in the patterns
+shown in `slices.Collect`.  (**Update**: this is now implemented but not yet in review.)
+
+6. Possibly extend to freeing more than just slices. Slices are a natural
+first step, but the large majority of the work implemented is independent of whether
+the heap object represents a slice. For example, on function exit, we might be able to
+free the table memory for non-escaping maps that do not have memory
+held by iteration.
+
+## Runtime API overview
+
+There are currently four primary runtime entry-points with working implementations,
+plus some additional variations on these not listed here. There are additional
+details with more draft API doc towards the end of this document.
+
+`freegc` is the primary entry-point for freeing user memory. It is emitted directly
+by the compiler and also underlies the implementation of the other `free*` runtime
+entry-points.
+
+```go
+func freegc(ptr unsafe.Pointer, size uintptr, noscan bool)
+```
+
+`growsliceNoAlias` is used during an `append`. It is like the existing `growslice`,
+but only for the case where we know there are no aliases for
+the backing memory of the slice, and hence we can free the old backing
+memory as the slice is updated with new, larger backing memory. Importantly, this
+works even if the slice escapes, such as if the resulting slice is returned
+from a function. `growsliceNoAlias` does both the grow and the free.
+
+ ```go
+ func growsliceNoAlias(oldPtr *any, newLen, oldCap, num int, et *byte) (ary []any)
+ ```
+
+`makeslicetracked64` (and variations) allocates a heap object and records
+or "tracks" the allocated object's pointer. `makeslicetracked64` is like
+the existing `makeslice64`, but fills in a `trackedObjs` for later
+use by `freegcTracked`. The compiler is responsible for reserving stack space
+for a properly sized backing array for `trackedObj` based on how many objects
+will be tracked, which gives the runtime the space it needs to write down
+its tracking information.
+
+```go
+func makeslicetracked64(et *_type, len64 int64, cap64 int64, trackedObjs *[]trackedObj) unsafe.Pointer
+```
+
+`freegcTracked` is automatically inserted by the compiler in cases where we know the
+lifetime of a heap-allocated object is tied to a certain scope (usually a function),
+which allows for tracked heap objects to be automatically freed as a scope is exited.
+`freegcTracked` must be paired with functions like `makeslicetracked64`.
+
+```go
+func freegcTracked(trackedObjs *[]trackedObj)
+```
+
+An additional runtime API was temporarily used by some CLs as a
+stand-in for `makeslicetracked64` for convenience, but there was no
+plan to use it beyond helping to bootstrap some of the work:
+
+```go
+// TEMPORARY API.
+func trackSlice(sl unsafe.Pointer, trackedObjs *[]trackedObj)
+```
+
+### Pointer liveness and `freegc` vs. `freegcTracked`
+
+This section summarizes some differences between `freegc` and `freegcTracked`
+and attempts to explain why `freegcTracked` exists at all.
+
+First, note that `freegc` accepts an `unsafe.Pointer` and hence keeps the pointer
+alive. It therefore could be a pessimization in some cases
+(such as a long-lived function or function that straddles GC transitions)
+if the caller does not call `freegc` before or roughly when
+the liveness analysis of the compiler would otherwise have determined
+the heap object is reclaimable by the GC.
+
+For certain allocations or in certain loops, we directly insert `freegc`
+via the compiler almost exactly at the point where the memory goes dead (or
+similarly insert a layer on top of `freegc` like `growsliceNoAlias`).
+`freegc` says that the memory is immediately available for reuse,
+which is especially valuable in a loop or when reusing memory is otherwise
+useful without waiting to exit a function.
+
+However, we would be giving up on many valuable targets if we did not also
+take advantage of escape analysis results (based on its current implementation and
+based on some escape analysis modifications as part of this work).
+
+Escape analysis today provides primarily function-scoped lifetime results.
+If we merely added `freegc` to function exit points to free objects with function-scoped
+lifetimes, that would likely be a pessimization if the object otherwise would have
+been dead before the function exit.
+
+This is part of the reason `freegcTracked` exists and takes a different approach
+than `freegc`. Our tracking of an allocation does not keep the tracked memory alive,
+therefore avoiding the liveness pessimization described above when used at scope
+exit points. Another benefit of `freegcTracked` is that it frees multiple
+objects at scope exit, which means fewer calls into the runtime and less emitted
+code (including compared to if we were to instead have one `freegc`
+per object at scope exit).
+
+In addition, a given pointer today can go dead at multiple places in a function,
+which would also add complexity if we were to directly use `freegc` automatically
+wherever the pointer goes dead, in contrast to `freegcTracked` effectively being
+a single clean up operation as the function exits.
+
+Another difference is `freegcTracked` is essentially a two-step process -- first
+there must be a tracked allocation, and later there is a `freegcTracked` call
+that operates on those tracked allocation(s). In contrast, `freegc` can
+operate on any heap allocation as long as we know it is safe to do so.
+
+Summarizing `freegc` vs. `freegcTracked`:
+
+```
+  freegc                                    freegcTracked
+  -------------------------------------     ---------------------------
+  frees single heap object                  can free multiple heap objects
+
+  lower-level primitive                     implemented on top of freegc
+
+  operates on any heap allocation           paired with tracked allocation function
+
+  call keeps pointer alive                  tracking does not keep pointer alive
+
+  primarily intended for use the moment     primarily intended for use at
+  a pointer is otherwise dead               scope exit
+```
+
+The initial implementation of tracking avoided keeping the pointer alive
+via an approach of storing a `uintptr` on the stack along with a sequence number (which is
+incremented as the GC cycle progresses) to then later know whether it is
+still valid to use the stored `uintptr` to free its heap object as the scope is exited.
+
+While reviewing this work, Keith Randall recently suggested an
+improvement to the implementation of `freegcTracked` where the runtime zeros out
+the tracked pointers on the stack as the GC cycle progresses, which is somewhat
+similar in effect to a weak pointer. The GC-related portions of the runtime
+would take a more active role now in the tracking, rather than the `freegcTracked` code
+inferring what has happened based on the sequence number. This new approach might
+touch more pieces of the code base, but it is better overall, and the intent is to now
+implement this improvement. The general characteristics of `freegcTracked`
+described above still apply (except now there is no sequence number), and most
+of the associated implementation is unaffected.
+
+There are more details on these functions at the bottom of this document,
+including draft API doc.
+
+## Compiler changes overview
+
+### Analyzing aliases
+
+The changes described in this subsection do not rely on `runtime.freegcTracked`,
+but do rely on `runtime.freegc` and `runtime.growsliceNoAlias`.
+
+As part of this work, the compiler has been updated to prove whether certain variables
+are aliased or not, and to also examine `append` calls to see if it can prove
+whether the slice is not aliased.
+
+If so, the compiler automatically inserts a call to the new
+runtime function `runtime.growsliceNoAlias` (briefly described above),
+which frees the old backing memory after slice growth is complete
+and the old storage is logically dead.
+
+This works with multiple `append` calls for the same slice,
+including inside loops, and the final slice memory can be
+escaping, such as returned from the function. (In that case,
+the final slice value is of course not freed.)
+
+An example target is:
+
+```go
+    func f1(input []int) []int {
+        var s []int
+        for _, x := range input {
+            s = append(s, g(x))
+        }
+        return s
+    }
+```
+
+After proving that the slice `s` cannot be aliased in that loop,
+the compiler converts that roughly to:
+
+```go
+    func f1(input []int) []int {
+        var s []int
+        for _, x := range input {
+			// s cannot be aliased here.
+			// The old backing array for s is freed inside growsliceNoAlias after growth.
+			// The final slice is not freed. It is returned to the caller of f1.
+			// Note: this is simplified from the actual signature for growsliceNoAlias.
+            s = growsliceNoAlias(s, g(x))
+        }
+        return s
+    }
+```
+
+Note that proving a slice is unaliased must be careful in the presence
+of loops (or gotos that form unstructured loops). In this next example,
+we do not treat `s` as entirely unaliased because it is aliased on
+the second pass through the loop:
+
+```go
+var alias []int
+
+func f2() {
+        var s []int
+        for i := range 10 {
+                s = append(s, i)
+                alias = s
+        }
+}
+```
+
+For some additional details on the overall approach, see the new
+`src/cmd/compile/internal/escape/alias.go` file in
+[CL 710015](https://go.dev/710015) that is part of this work (especially
+the `(*aliasAnalysis).analyze` method).
+
+By tracking aliasing and renamings across function inlinings (as
+well as some support for recognizing whether a `goto` is an unstructured loop),
+the compiler is also able to automatically free intermediate memory from appends
+used with iterators, such as in:
+
+```go
+    s := slices.Collect(seq)
+```
+
+or
+
+```go
+    var s2 []int
+    s2 = slices.AppendSeq(s2, seqA)
+    s2 = slices.AppendSeq(s2, seqB)
+	s2 = slices.AppendSeq(s2, seqC)
+```
+
+(An earlier version of this work was only able to handle these iterator cases
+via manual edits to the `slices` package; that is no longer needed).
+
+### Analyzing `make`
+
+The changes described in this subsection do rely on `runtime.freegcTracked`.
+
+Additional CLs update the compiler to automatically insert free calls for slices allocated via
+`make` when the compiler can prove it is safe to do so.
+
+Consider for example the following code:
+
+```go
+func f1(size int) {
+	s := make([]int64, size) // heap allocated if > 32 bytes
+	_ = s
+}
+```
+
+In today's compiler, the backing array for `s` is not marked as escaping,
+but will be heap allocated if larger than 32 bytes.
+
+As part of this work, the compiler has been updated so that the
+backing array for `s` is recognized as non-escaping but unknown size and
+it is valid to free it when `f1` is done.
+
+The compiler automatically transforms `f1` to roughly:
+
+```go
+func f1(size int) {
+	var freeablesArr [1]struct { // compiler counts how many are needed (one, in this case)
+		ptr      uintptr
+		// Possibly 1-2 additional fields in future.
+	}
+	freeables := freeablesArr[:0]
+	defer freegcTracked(&freeables)
+
+	s := make([]int64, size)
+	trackSlice(unsafe.Pointer(&s), &freeables) // NOTE: only temporarily two runtime calls
+
+	_ = s
+}
+```
+
+TODO: remove the `trackSlice` from this example, which was only a temporary API used to help
+bootstrap this work.
+
+TODO: mention `make`-like uses of `append`, such as `append([]byte(nil), make([]byte, n)...)`.
+
+The `freeablesArr` is stored on the stack and automatically sized by the compiler based
+on the number of target sites. The `defer` is currently inserted in a way so that it is
+an open-coded `defer` and hence has low performance impact, though the plan is to switch
+to inserting the calls without relying on `defer`.
+
+For more details on `freegcTracked`, see the high-level description above or the API doc below.
+
+The transformed code just above shows what is currently implemented, but two items seen
+there that might change as the implementation is polished:
+ * `freegcTracked` currently takes a slice for convenience, but we could make that tighter.
+ * the current intent is to use `makeslicetracked64` (for example, in `f1` above),
+ but the compiler initially inserted calls to a new `runtime.trackSlice` instead,
+ which initially seemed slightly easier to implement on the compiler side.
+ The plan is to have the compiler use `makeslicetracked64` instead.
+
+If there are multiple target sites within a function, such as in `f2` here:
+
+```go
+func f2(size1 int, size2 int, x bool) {
+	s1 := make([]int64, size1) // heap allocated if > 32 bytes
+	if x {
+		s2 := make([]int64, size2) // heap allocated if > 32 bytes
+		_ = s2
+	}
+	_ = s1
+}
+```
+
+The compiler counts the max number of sites (not sensitive to conditionals),
+and sizes `freeablesArr` appropriately (two element array in this case),
+with a single call to `freegcTracked` per function:
+
+```go
+func f2(size1 int, size2 int, x bool) {
+	var freeablesArr [2]struct { // compiler counts how many are needed (two, in this case)
+		ptr      uintptr
+		// Possibly 1-2 additional fields in future.
+	}
+	freeables := freeablesArr[:0]
+	defer freegcTracked(&freeables) // only a single freegcTracked call, regardless of how many tracked sites
+
+	s1 := make([]int64, size1)
+	trackSlice(unsafe.Pointer(&s1), &freeables)
+	if x {
+		s2 := make([]int64, size2)
+		trackSlice(unsafe.Pointer(&s2), &freeables) // NOTE: only temporarily two runtime calls
+
+		_ = s2
+	}
+	_ = s1
+}
+```
+
+TODO: remove the `trackSlice` from this example, which was only a temporary API used to help
+bootstrap this work.
+
+The max number of automatically inserted call sites is bounded
+based on the count of `make` calls in a given function.
+
+Inside `for` loops or when a `goto` looks like it is
+being used in an unstructured loop, the compiler
+currently skips automatically inserting these calls.
+
+However, a modest extension (and the current plan) is to have the compiler
+instead use a different runtime entry point for freeing in loops. The idea is
+to allow heap-allocated slice memory to be reclaimed on each loop iteration
+when it returns to the same allocation site, and when it's safe to do so.
+Because it's the same allocation site, this doesn't require unbounded
+or dynamic space in the trackedObj slice.  (The compiler can already prove
+it is safe and the runtime entry point is prototyped.)
+
+## `strings.Builder`
+
+As an early experiment, we updated `strings.Builder` to have manual calls
+to `runtime.freegc`, though there is no intent to actually manually
+modify `strings.Builder`.
+
+`strings.Builder` is a somewhat
+low-level part of the stdlib that already helps encapsulate some
+performance-related trickery, including using `unsafe` and
+`internal/bytealg.MakeNoZero`. The `strings.Builder` already owns its in-progress
+buffer that is being built up, until it returns a usually final string.
+The existing internal comment says:
+
+```go
+    // External users should never get direct access to this buffer, since
+    // the slice at some point will be converted to a string using unsafe,
+    // also data between len(buf) and cap(buf) might be uninitialized.
+    buf []byte
+```
+
+If we update `strings.Builder` to explicitly call `runtime.freegc` on
+that `buf` after a resize operation (but without freeing the usually final
+incarnation of `buf` that will be returned to the user as a string via `Builder.String`), we
+can see some nice benchmark results on the existing strings benchmarks
+that call `strings.(*Builder).Write` N times and then call `String`.  Here, the
+(uncommon) case of a single `Write` is not helped (given it never
+resizes after first alloc if there is only one `Write`), but the impact grows
+such that it is up to ~2x faster as there are more resize operations due
+to more `Write` calls:
+
+```
+                                               │     disabled.out │      new-free-20.txt                │
+                                               │      sec/op      │   sec/op     vs base                │
+BuildString_Builder/1Write_36Bytes_NoGrow-4           55.82n ± 2%   55.86n ± 2%        ~ (p=0.794 n=20)
+BuildString_Builder/2Write_36Bytes_NoGrow-4           125.2n ± 2%   115.4n ± 1%   -7.86% (p=0.000 n=20)
+BuildString_Builder/3Write_36Bytes_NoGrow-4           224.0n ± 1%   188.2n ± 2%  -16.00% (p=0.000 n=20)
+BuildString_Builder/5Write_36Bytes_NoGrow-4           239.1n ± 9%   205.1n ± 1%  -14.20% (p=0.000 n=20)
+BuildString_Builder/8Write_36Bytes_NoGrow-4           422.8n ± 3%   325.4n ± 1%  -23.04% (p=0.000 n=20)
+BuildString_Builder/10Write_36Bytes_NoGrow-4          436.9n ± 2%   342.3n ± 1%  -21.64% (p=0.000 n=20)
+BuildString_Builder/100Write_36Bytes_NoGrow-4         4.403µ ± 1%   2.381µ ± 2%  -45.91% (p=0.000 n=20)
+BuildString_Builder/1000Write_36Bytes_NoGrow-4        48.28µ ± 2%   21.38µ ± 2%  -55.71% (p=0.000 n=20)
+```
+
+There is no plan to hand insert calls to `runtime.freegc` within `strings.Builder`
+(including there are alternatives that could be used within
+`strings.Builder`, such as a segmented buffer and/or `sync.Pool`
+such as used within `cmd/compile`). We are mainly using it here as a
+small-ish concrete example.
+
+In addition, after doing the hand edited `strings.Builder` prototype,
+the compiler work has since progressed further, and I have some hope for
+the compiler to be able to automatically free intermediate memory
+for many common uses of `strings.Builder` via `growsliceNoAlias`
+(with roughly 2 of 3 steps towards that goal currently implemented;
+`strings.Builder` is more complex than an ordinary use of `append`
+for a few reasons).
+
+There are other potential optimizations for `strings.Builder`,
+like allocating an initial buf (e.g., 64 bytes)
+on the first `Write`, faster growth than current 1.25x growth for larger
+slices, and "right sizing" the final string; an earlier cut did
+implement those optimizations.
+
+## Normal allocation performance
+
+For the cases where a normal allocation is happening without any reuse,
+some initial micro-benchmarks suggest the impact of these changes could
+be small to negligible (at least with GOAMD64=v3):
+
+```
+goos: linux
+goarch: amd64
+pkg: runtime
+cpu: AMD EPYC 7B13
+                        │ base-512M-v3.bench │   ps16-512M-goamd64-v3.bench     │
+                        │        sec/op      │ sec/op     vs base               │
+Malloc8-16                      11.01n ± 1%   10.94n ± 1%  -0.68% (p=0.038 n=20)
+Malloc16-16                     17.15n ± 1%   17.05n ± 0%  -0.55% (p=0.007 n=20)
+Malloc32-16                     18.65n ± 1%   18.42n ± 0%  -1.26% (p=0.000 n=20)
+MallocTypeInfo8-16              18.63n ± 0%   18.36n ± 0%  -1.45% (p=0.000 n=20)
+MallocTypeInfo16-16             22.32n ± 0%   22.65n ± 0%  +1.50% (p=0.000 n=20)
+MallocTypeInfo32-16             23.37n ± 0%   23.89n ± 0%  +2.23% (p=0.000 n=20)
+geomean                         18.02n        18.01n       -0.05%
+```
+
+## runtime API
+
+This section lists the runtime APIs that are either automatically
+inserted by the compiler or can be called from a targeted set of
+locations in the standard library.
+
+Note: the term "logically dead" here is distinct from the compiler's liveness
+analysis, and the runtime free and re-use implementations must be (and
+attempt to be) robust in sequences such as a first alloc of a given
+pointer, GC observes and queues the pointer, runtime.free, GC dequeues
+pointer for scanning, runtime re-uses the same pointer, all within a
+single GC phase. (TODO: more precise term than "logically dead"?)
+
+```go
+// freegc records that a heap object is reusable and available for
+// immediate reuse in a subsequent mallocgc allocation, without
+// needing to wait for the GC cycle to progress.
+//
+// The information is recorded in a free list stored in the
+// current P's mcache. The caller must pass in the user size
+// and whether the object has pointers, which allows a faster free
+// operation.
+//
+// freegc must be called by the effective owner of ptr who knows
+// the pointer is logically dead, with no possible aliases that might
+// be used past that moment. In other words, ptr must be the
+// last and only pointer to its referent.
+//
+// The intended caller is the compiler.
+//
+// ptr must point to a heap object or into the current g's stack,
+// in which case freegc is a no-op. In particular, ptr must not point
+// to memory in the data or bss sections, which is partially enforced.
+// For objects with a malloc header, ptr should point mallocHeaderSize bytes
+// past the base; otherwise, ptr should point to the base of the heap object.
+// In other words, ptr should be the same pointer that was returned by mallocgc.
+//
+// In addition, the caller must know that ptr's object has no specials, such
+// as might have been created by a call to SetFinalizer or AddCleanup.
+// (Internally, the runtime deals appropriately with internally-created
+// specials, such as specials for memory profiling).
+//
+// If the size of ptr's object is less than 16 bytes or greater than
+// 32KiB - gc.MallocHeaderSize bytes, freegc is currently a no-op. It must only
+// be called in alloc-safe places.
+//
+// Note that freegc accepts an unsafe.Pointer and hence keeps the pointer
+// alive. It therefore could be a pessimization in some cases (such
+// as a long-lived function) if the caller does not call freegc before
+// or roughly when the liveness analysis of the compiler
+// would otherwise have determined ptr's object is reclaimable by the GC.
+func freegc(ptr unsafe.Pointer, size uintptr, noscan bool) bool
+
+// growsliceNoAlias is like growslice but only for the case where
+// we know that oldPtr is not aliased.
+//
+// In other words, the caller must know that there are no other references
+// to the backing memory of the slice being grown aside from the slice header
+// that will be updated with new backing memory when growsliceNoAlias
+// returns, and therefore oldPtr must be the only pointer to its referent
+// aside from the slice header updated by the returned slice.
+func growsliceNoAlias(oldPtr unsafe.Pointer, newLen, oldCap, num int, et *_type) slice
+
+// makeslicetracked64 is like the existing makeslice64, but the caller
+// also passes in a pointer to a trackedObj slice that allows the runtime
+// to track the heap-allocated backing array and possibly later
+// free the tracked objects via freegcTracked. The contents of
+// the trackedObj slice are opaque to the caller, but the caller
+// provides the space for the trackedObj slice.
+//
+// Currently, the only intended caller is code inserted by
+// the compiler, which only does so when escape analysis can
+// prove the associated pointer has a known scope -- currently,
+// when it can prove the pointer does not outlive the current
+// function. This effectively creates a scope for the
+// associated heap pointers. The compiler also arranges for
+// freegcTracked to be called on function exit, so that
+// the heap objects are automatically freed when leaving the
+// scope. The compiler also arranges for the trackedObj slice's
+// backing array to be placed on the user's normal goroutine
+// stack to provide storage for the tracking information,
+// with that storage matching the lifetime of the tracked
+// heap objects (that is, both match the lifetime of
+// the function call).
+func makeslicetracked64(et *_type, len64 int64, cap64 int64, trackedObjs *[]trackedObj) unsafe.Pointer
+
+// freegcTracked marks as reusable the objects allocated by
+// calls to makeslicetracked64 and tracked in the trackedObjs
+// slice, unless the GC has already possibly reclaimed
+// those objects.
+//
+// The trackedObj slice does not keep the associated pointers
+// alive past the start of a mark cycle, and
+// freegcTracked becomes a no-op if the GC cycle
+// has progressed such that the GC might have already reclaimed
+// the tracked pointers before freegcTracked was called.
+// This might happen for example if a function call lasts
+// multiple GC cycles.
+//
+// The compiler statically counts the maximum number of
+// possible makeslicetracked64 calls in a function
+// in order to statically know how much user
+// stack space is needed for the trackedObj slice.
+func freegcTracked(trackedObjs *[]trackedObj)
+
+// trackedObj is used to track a heap-allocated object that
+// has a proven scope and can potentially be freed by freegcTracked.
+//
+// trackedObj does not keep the tracked object alive.
+//
+// trackedObj is opaque outside the allocation and free code.
+//
+// TODO: update description to describe runtime zeroing.
+//
+// TODO: there is a plan to allow a trackedObj to have specials,
+// including user-created via SetFinalizer or AddCleanup, but
+// not yet implemented. (In short -- update runtime.addspecial
+// to update a bitmap of objIndexes that have specials so
+// that the fast path for allocation & free does not need to
+// walk an mspan's specials linked list, and then a trackedObj with
+// specials will not be reused. An alternative might be to
+// sweep the span and process the specials after reaching
+// the end of an mspan in the mcache if there are reusable
+// objects with specials, but that might be harder or perhaps
+// infeasible. More generally, sweeping sooner than the
+// normal GC but not in the allocation/free fast path (perhaps
+// in a goroutine managed by the runtime), might be worth
+// exploring for other reasons than just specials).
+type trackedObj struct {
+	ptr      uintptr   // tracked object
+	// Future: possibly 1-2 additional metadata fields to speed up free.
+}
+```
+
+## Other possible runtime APIs
+
+### runtime.free without size, noscan
+
+An earlier version of this work also implemented an API that just took an
+`unsafe.Pointer`, without any size or noscan parameters, like:
+
+```go
+func free(ptr unsafe.Pointer)
+```
+
+However, this was more expensive, primarily due to needing to call
+`spanOf`, `findObject` or similar. Early measurements suggested that
+passing the size and noscan parameters like in `freegc` allowed
+eliminating roughly 60% of the cost of a free operation.
+
+(A small historical note: the more explicit API taking size and noscan parameters
+was originally named `freeSized` but was later renamed to `freegc`; the older `freeSized` name
+might still be visible in some earlier prototype CLs.)
+
+### runtime.freegcSlice
+
+Separately, we will likely also use a `runtime.freegcSlice`. It is a
+small function over `runtime.freegc` that accepts a slice. (Initially it was
+just a test function, but ended up being more convenient in some cases,
+including in some of the early prototype compiler work.)