design: add generics implementation design docs

Change-Id: I33c636fce2e96677a9c81b15fa574ac136c22402
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/255797
Reviewed-by: Keith Randall <khr@golang.org>
diff --git a/design/generics-implementation-dictionaries.md b/design/generics-implementation-dictionaries.md
new file mode 100644
index 0000000..cd81102
--- /dev/null
+++ b/design/generics-implementation-dictionaries.md
@@ -0,0 +1,331 @@
+# Generics implementation - Dictionaries
+
+This document describes a method to implement the [Go generics proposal](https://go.googlesource.com/proposal/+/refs/heads/master/design/go2draft-type-parameters.md) using compile-time instantiated dictionaries. Dictionaries will be stenciled per instantiation of a generic function.
+
+When we generate code for a generic function, we will generate only a single chunk of assembly for that function. It will take as an argument a _dictionary_, which is a set of information describing the concrete types that the parameterized types take on. It includes the concrete types themselves, of course, but also derived information as we will see below.
+
+A counter-proposal would generate a different chunk of assembly for each instantiation and not require dictionaries - see the [Generics Implementation - Stenciling](https://go.googlesource.com/proposal/+/refs/heads/master/design/generics-implementation-stenciling.md) proposal.
+
+The most important feature of a dictionary is that it is compile-time computeable. All dictionaries will reside in the read-only data section, and will be passed around by reference. Anything they reference (types, other dictionaries, etc.) must also be in the read-only or code data sections.
+
+A running example of a generic function:
+
+
+```
+func f [T1, T2 any](x int, y T1) T2 {
+    ...
+}
+```
+
+
+With a callsite:
+
+
+```
+f[int, float64](7, 3.5)
+```
+
+
+The implementation of f will have an additional argument which is the pointer to the dictionary structure. We could put the additional argument first or last, or in its own register. Reserving a register for it seems overkill. Putting it first is similar to how receivers are passed (speaking of which, would the receiver or the dictionary come first?). Putting it last means less argument shuffling in the case where wrappers are required (not sure where those might be yet).
+
+The dictionary will contain lots of fields of various types, depending on what the implementation of f needs. Note that we must look inside the implementation of f to determine what is required. This means that the compiler will need to summarize what fields are necessary in the dictionary of the function, so that callers can compute that information and put it in the dictionary when it knows what the instantiated types are. (Note this implies that we can’t instantiate a generic function knowing only its signature - we need its implementation at compile time also. So implemented-in-assembly and cross-shared-object-boundary instantiations are not possible.)
+
+The dictionary will contain the following items:
+
+
+## Instantiated types
+
+The first thing that the dictionary will contain is a reference to the `runtime._type` for each parameterized type.
+
+
+```
+type dictionary struct {
+    T1 *runtime._type
+    T2 *runtime._type
+    ...
+}
+```
+
+
+We should probably include these values unconditionally, even if the implementation doesn’t need them (for printing in tracebacks, for example).
+
+
+## Derived types
+
+The code in f may declare new types which are derived from the generic parameter types. For instance:
+
+
+```
+    type X struct { x int; y T1 }
+    m := map[string]T1{}
+```
+
+
+The dictionary needs to contain a `*runtime._type` for each of the types mentioned in the body of f which are derived from the generic parameter types.
+
+
+```
+type dictionary struct {
+    ...
+    D1 *runtime._type // struct { x int; y T1 }
+    D2 *runtime._type // map[string]T1
+    ...
+}
+```
+
+
+How will the caller know what derived types the body of f needs? This is a very important question, and will be discussed at length later (see the proto-dictionary section). For now, just assume that there will be summary information for each function which lets the callsite know what derived types are needed.
+
+
+## Subdictionaries
+
+If f calls other functions, it needs a dictionary for those calls. For example,
+
+
+```
+func g[T](g T) { ... }
+```
+
+
+Then in f,
+
+
+```
+    g[T1](y)
+```
+
+
+The call to g needs a dictionary. At the callsite to g from f, f has no way to know what dictionary it should use, because the type parameterizing the instantiation of g is a generic type. So the caller of f must provide that dictionary.
+
+
+```
+type dictionary struct {
+    ...
+    S1 *dictionary // SubDictionary for call to g
+    S2 *dictionary // SubDictionary for some other call
+    ...
+}
+```
+
+
+
+## Helper methods
+
+The dictionary should contain methods that operate on the generic types. For instance, if f has the code:
+
+
+```
+    y2 := y + 1
+    if y2 > y { … }
+```
+
+
+(assuming here that `T1` has a type list that allows `+` and `>`), then the dictionary must contain methods that implement `+` and `>`.
+
+
+```
+type dictionary struct {
+    ...
+    plus func(z, x, y *T1)      // does *z = *x+*y
+    greater func(x, y *T1) bool // computes *x>*y
+    ...
+}
+```
+
+
+There’s some choice available here as to what methods to include. For `new(T1)` we could include in the dictionary a method that returns a `*T1`, or we could call `runtime.newobject` directly with the `T1` field of the dictionary. Similarly for many other tasks (`+`, `>`, ...), we could use runtime helpers instead of dictionary methods, passing the appropriate `*runtime._type` arguments so the runtime could switch on the type and do the appropriate computation.
+
+
+## Stack layout
+
+`f` needs to allocate stack space for any temporary variables it needs. Some of those variables would be of generic type, so `f` doesn’t know how big they are. It is up to the dictionary to tell it that.
+
+
+```
+type dictionary struct {
+    ...
+    frameSize uintptr
+    ...
+}
+```
+
+
+The caller knows the types of all the temporary variables, so it can compute the stack size required. (Note that this depends a lot on the implementation of `f`. I’ll get to that later.)
+
+Stack scanning and copying also need to know about the stack objects in `f`. The dictionary can provide that information
+
+
+```
+type dictionary struct {
+    ...
+    stackObjects []stackObject
+    ...
+}
+type stackObject struct {
+    offset uintptr
+    typ *runtime._type
+}
+```
+
+
+All values of generic type, as well as derived types that have bare generic types in them (e.g. `struct {x int; y T1}` or `[4]T1`, but not reference types with generic bases, like `[]T1` or `map[T1]T2`), must be stack objects. `f` will operate on such types using pointers to these stack objects. Preamble code in `f` will set up local variables as pointers to each of the stack objects (along with zeroing locals and return values?). All accesses to generic typed values will be by reference from these pointers.
+
+There might also be non-generic stack objects in `f`. Maybe we list them separately, or combine the lists in the dictionary (so we only have to look in one place for the list).
+
+The outargs section also needs some help. Marshaling arguments to a call, and copying back results from a call, are challenging because the offsets from SP are not known to `f` (if any argument has a type with a bare generic type in it). We could marshal/unmarshal arguments one at a time, keeping track of an argument pointer while doing so. If `f` calls `h`:
+
+
+```
+func f[T1, T2 any](x int, y T1, h func(x T1, y int, z T2) int) T2 {
+    var z T2
+    ....
+    r := h(y, x, z)
+}
+```
+
+
+then we would compile that to:
+
+
+```
+argPtr = SP
+memmove(argPtr, &y, dictionary.T1.size)
+argPtr += T1.size
+argPtr = roundUp(argPtr, alignof(int))
+*(*int)argPtr = x
+argPtr += sizeof(int)
+memmove(argPtr, &z, dictionary.T2.size)
+argPtr += T2.size
+call h
+argPtr = roundUp(argPtr, 8) // alignment of return value start
+r = *(*int)argPtr
+```
+
+
+Another option is to include in the dictionary the offset needed for every argument/return value of every function call in `f`. That would make life simpler, but it’s a lot of information. Something like:
+
+
+```
+memmove(SP + dictionary.callsite1.arg1offset, &y, dictionary.T1.size)
+*(*int)(SP + dictionary.callsite1.arg2offset) = x
+memmove(SP + dictionary.callsite1.arg3offset, &z, dictionary.T2.size)
+call h
+r = *(*int)(SP + dictionary.callsite1.ret1offset)
+```
+
+
+We could share information for identically-shaped callsites. And maybe keep the offset arrays as separate global symbols and keep references to them in the dictionary (one more indirection for each argument marshal, but may use less space).
+
+We need to reserve enough space in the outargs section for all the marshaled arguments, across all callsites. The `frameSize` field of the dictionary should include this space.
+
+Another option in this space is to change the calling convention to pass all values of generic type by pointer. This will simplify the layout problem for the arguments sections. But it requires implementing wrapper functions for calls from generic code to non-generic code or vice-versa. It is not entirely clear what the rules would be for where those wrappers get generated and instantiated.
+
+The outargs plan will get extra complicated when we move to a [register-based calling convention](https://github.com/golang/go/issues/40724). Possibly calls out from generic code will remain on ABI0.
+
+
+## Pointer maps
+
+Each stack frame needs to tell the runtime where all of its pointers are.
+
+Because all arguments and local variables of generic type will be stack objects, we don’t need special pointer maps for them. Each variable of generic type will be referenced by a local pointer variable, and those local pointer variables will have their liveness tracked as usual (similar to how moved-to-heap variables are handled today).
+
+Arguments of generic type will be stack objects, but that leaves the problem of how to scan those arguments before the function starts - we need a pointer map at function entry, before stack objects can be set up.
+
+For the function entry problem, we can add a pointer bitmap to the dictionary. This will be used when the function needs to call morestack, or when the function is used in a `go` or `defer` statement and hasn’t started running yet.
+
+
+```
+type dictionary struct {
+    ...
+    argPointerMap bitMap // arg size and ptr/nonptr bitmap
+    ...
+}
+```
+
+
+We may be able to derive the pointer bitmap from the list of stack objects, if that list made it easy to distinguish arguments (args+returns are distinguished because the former are positive offsets from FP and the latter are negative offsets from FP. Distinguishing args vs returns might also be doable using a retoffset field in funcdata).
+
+
+## End of Dictionary
+
+
+```
+type dictionary struct {
+    ...
+    // That's all?
+}
+```
+
+
+
+## The Proto-Dictionary
+
+There’s a lot of information we need to record about a function `f`, so that callers of `f` can assemble an appropriate dictionary. We’ll call this information a proto-dictionary. Each entry in the proto-dictionary is conceptually a function from the concrete types used to instantiate the generic function, to the contents of the dictionary entry. At each callsite at compile time, the proto-dictionary is evaluated with the concrete type parameters to produce a real dictionary. (Or, if the callsite uses some generic types as type arguments, partially evaluate the proto-dictionary to produce a new proto-dictionary that represents some sub-dictionary of a higher-level dictionary.) There are two main features of the proto-dictionary. The first is that the functions described above must be computable at compile time. The second is that the proto-dictionary must be serializable, as we need to write it to an object file and read it back from an object file (for cases where the call to the generic function is in a different package than the generic function being called).
+
+The proto-dictionary includes information for all the sections listed above:
+
+
+
+*   Derived types. Each derived type is a “skeleton” type with slots to put some of `f`’s type parameters.
+*   Any sub-proto-dictionaries for callsites in `f`. (Note: any callsites in `f` which use only concrete type parameters do not need to be in the dictionary of `f`, because they can be generated at that callsite. Only callsites in `f` which use one or more of `f`’s type parameters need to be a subdictionary of `f`’s dictionary.)
+*   Helper methods, if needed, for all types+operations that need them.
+*   Stack layout information. The proto-dictionary needs a list of all of the stack objects and their types (which could be one of the derived types, above), and all callsites and their types (maybe one representative for each arg/result shape). Converting from a proto-dictionary to a dictionary would involve listing all the stack objects and their types, computing all the outargs section offsets, and adding up all the pieces of the frame to come up with an overall frame size.
+*   Pointer maps. The proto-dictionary needs a list of argument/return values and their types, so that it can compute the argument layout and derive a pointer bitmap from that. We also need liveness bits for each argument/return value at each safepoint, so we can compute pointer maps once we know the argument/return value types.
+
+
+## Closures
+
+Suppose f creates a closure?
+
+
+```
+func f[T any](x T, y T) {
+    c := func() {
+        x = y
+    }
+    c()
+}
+```
+
+
+We need to pass a dictionary to the anonymous function as part of the closure, so it knows how to do things like copy values of generic type. When building the dictionary for `f`, one of the subdictionaries needs to be the dictionary required for the anonymous function, which `f` can then use as part of constructing the closure.
+
+
+## Generic Types
+
+This document has so far just considered generic functions. But we also need to handle generic types. These should be straightforward to stencil just like we do for derived types within functions.
+
+
+## Deduplication
+
+We should name the dictionaries appropriately, so deduplication happens automatically. For instance, two different packages instantiating `f` using the same concrete types should use the same dictionary in the final binary. Deduplication should work fine using just names as is done currently in the compiler for, e.g., `runtime._type` structures.
+
+Then the worst case space usage is one dictionary per instantiation. Note that some subdictionaries might be equivalent to a top-level dictionary for that same function.
+
+
+## Other Issues
+
+Recursion - can a dictionary ever reference itself? How do we build it, and the corresponding proto-dictionaries, then? I haven’t wrapped my head around the cases in which this could come up.
+
+Computing the proto-dictionary for a function probably requires compiling the function, so we know how many of each temporary of generic type is required. For other global passes like escape analysis, we don’t actually need the compiled code to compute the summary. An early pass over the source code is enough. It’s possible we could avoid the need to compile the function if we were ok with being conservative about the locals needed.
+
+Dictionary layout. Because the dictionary is completely compile time and read only, it does not need to adhere to any particular structure. It’s just an array of bytes. The compiler assigns meanings to the fields as needed, including any ordering or packing. We would, of course, keep some sort of ordering just for our sanity.
+
+The exception to the above is that the runtime needs access to a few of the dictionary fields. Those include the stackObjects slice and the pointer maps. So we should put these fields first in the dictionary. We also need a way for the runtime to get access to the dictionary itself, which could be done by always making it the first argument, and storing it in a known place (or reading out the stack object slice and pointer maps, and storing those in a known place in the stack frame).
+
+Register calling convention: argument stack objects? Use ABI0? If arguments come in registers, and those registers contain data for a generic type, it could be complicated to get that data into a memory location so we can take the address of it.
+
+Mentioned above, for the same reason we might want to use ABI0 for calling out (at least if any argument type is generic).
+
+TODO: calling methods on generic types
+
+
+## Risks
+
+
+
+*   Is it all worth it? Are we wasting so much space on these dictionaries that we might as well just stencil the code?
+
+    At the very least, the dictionaries won’t take up code space. We’re in effect trading data cache misses for instruction cache misses (and associated instruction cache things, like branch predictor entries).
+
+*   How much slower would this dictionary implementation be, than just stenciling everything? This design is pretty careful to produce no more allocations than a fully stenciled implementation, but there are a lot of other costs to operating in a generic fashion which are harder to measure. For example, if the concrete type is an `int`, a fully stenciled implementation can do `x = y` with a single load and store (or even just a register-register copy), whereas the dictionary implementation must call `memmove`.
diff --git a/design/generics-implementation-stenciling.md b/design/generics-implementation-stenciling.md
new file mode 100644
index 0000000..75a6d30
--- /dev/null
+++ b/design/generics-implementation-stenciling.md
@@ -0,0 +1,147 @@
+# Generics implementation - Stenciling
+
+This document describes a method to implement the [Go generics proposal](https://go.googlesource.com/proposal/+/refs/heads/master/design/go2draft-type-parameters.md). This method generates multiple implementations of a generic function by instantiating the body of that generic function for each set of types with which it is instantiated. By “implementation” here, I mean a blob of assembly code and associated descriptors (pcln tables, etc.). This proposal stands in opposition to the [Generics Implementation - Dictionaries](https://go.googlesource.com/proposal/+/refs/heads/master/design/generics-implementation-dictionaries.md) proposal, where we generate a single implementation of a generic function that handles all possible instantiated types.
+
+Suppose we have the following generic function
+
+
+```
+func f[T1, T2 any](x int, y T1) T2 {
+    ...
+}
+```
+
+
+And we have two call sites of `f`
+
+
+```
+var a float64 = f[int, float64](7, 8.0)
+var b struct{f int} = f[complex128, struct{f int}](3, 1+1i)
+```
+
+
+Then we generate two versions of `f`, compiling each one into its own implementation:
+
+
+```
+func f1(x int, y int) float64 {
+    ... identical bodies ...
+}
+func f2(x int, y complex128) struct{f int} {
+    ... identical bodies ...
+}
+```
+
+
+This design doc walks through the details of how this compilation strategy would work.
+
+
+## Naming
+
+The binary will now have potentially multiple instances of a function in it. How will those functions be named? In the example above, just calling them `f1` and `f2` won’t work.
+
+At least for the linker, we need names that will unambiguously indicate which implementation we’re talking about. This doc proposes to decorate the function name with each type parameter name, in brackets:
+
+
+```
+f[int, float64]
+f[complex, struct{f int}]
+```
+
+
+The exact syntax doesn’t really matter, but it must be unambiguous. The type names will be formatted using cmd/compile/internal/types/Type.ShortString (as is used elsewhere, e.g. for type descriptors).
+
+Should we show these names to users? For panic tracebacks, it is probably ok, since knowing the type parameters could be useful (as are regular parameters, which we also show). But what about CPU profiles? Should we unify profiles of differently-instantiated versions of the same function? Or keep them separate?
+
+
+## Instantiation
+
+Because we don’t know what types `f` will be instantiated with when we compile `f` itself (but see the section on type lists below), we can’t generate the implementations at the definition of `f`. We must generate the implementations at the callsites of `f`.
+
+At each callsite of `f` where type parameters are provided, we must generate a new implementation of `f`. To generate an implementation, the compiler must have access to the body of `f`. To facilitate that, the object file must contain the body of any generic functions, so that the compiler can compile them again, with possibly different type parameters, during compilation of the calling package (this mechanism already exists for inlining, so there is maybe not much work to do here).
+
+ \
+It isn’t obvious at some callsites what the concrete type parameters are. For instance, consider `g`:
+
+
+```
+func [T any] g(x T) float64 {
+    return f[T, float64](5, x)
+}
+```
+
+
+The callsite of `f` in `g` doesn’t know what all of its type parameters to `f` are. We won’t be able to generate an implementation of `f` until the point where `g` is instantiated. So implementing a function at a callsite might require recursively implementing callees at callsites in its body. (TODO: where would the caller of `g` get the body of `f` from? Is that also in the object file somewhere?)
+
+How do we generate implementations in cases of general recursion?
+
+
+```
+func r1[X, Y, Z any]() {
+    r2[X, Y, Z]()
+}
+func r2[X, Y, Z any]() {
+    r1[Y, Z, X]()
+}
+r1[int8, int16, int32]()
+```
+
+
+What implementations does this generate? I think the answer is clear, but we need to make sure our build system comes up with that answer without hanging the compiler in an infinite recursion. We’d need to record the existence of an instantiation (which we probably want to do anyway, to avoid generating `f[int, float64]` twice in one compilation unit) before generating code for that instantiation.
+
+
+## Type lists
+
+If a generic function has a type parameter that has a type constraint which contains a type list, then we could implement that function at its definition, with each element of that type list. Then we wouldn’t have to generate an implementation at each call site. This strategy is fragile, though. Type lists are understood as listing underlying types (under the generics proposal as of this writing), so the set of possible instantiating types is still infinite. But maybe we generate an instantiation for each unnamed type (and see the deduplication section for when it could be reused for a named type with the same underlying type).
+
+
+## Deduplication
+
+The compiler will be responsible for generating only one implementation for a given particular instantiation (function + concrete types used for instantiation). For instance, if you do:
+
+
+```
+f[int, float64](3, 5)
+f[int, float64](4, 6)
+```
+
+
+If both of these calls are in the same package, the compiler can detect the duplication and generate the implementation `f[int, float64]` only once.
+
+If the two calls to `f` are in different packages, though, then things aren’t so simple. The two compiler invocations will not know about each other. The linker will be responsible for deduplicating implementations resulting from instantiating the same function multiple times. In the example above, the linker will end up seeing two `f[int, float64]` symbols, each one generated by a different compiler invocation. The functions will be marked as DUPOK so the linker will be able to throw one of them away. (Note: due to the relaxed semantics of Go’s function equality, the deduplication is not required; it is just an optimization.)
+
+Note that the build system already deals with deduplicating code. For example, the generated equality and hash functions are deduplicated for similar reasons.
+
+
+## Risks
+
+There are two main risks with this implementation, which are related.
+
+
+
+1. This strategy requires more compile time, as we end up compiling the same instantiation multiple times in multiple packages.
+2. This strategy requires more code space, as there will be one copy of `f` for each distinct set of type parameters it is instantiated with. This can lead to large binaries and possibly poor performance (icache misses, branch mispredictions, etc.).
+
+For the first point, there are some possible mitigations. We could enlist the go tool to keep track of the implementations that a particular compiler run generated (recorded in the export data somewhere), and pass the name of those implementations along to subsequent compiler invocations. Those subsequent invocations could avoid generating that same implementation again. This mitigation wouldn’t work for compilations that were started in parallel from the go tool, however. Another option is to have the compiler report back to the go tool the implementations it needs. The go tool can then deduplicate that list and invoke the compiler again to actually generate code for that deduped list. This mitigation would add complexity, and possibly compile time, because we’d end up calling the compiler multiple times for each package. In any case, we can’t reduce the number of compilations beyond that needed for each unique instantiation, which still might be a lot. Which leads to point 2...
+
+For the second point, we could try to deduplicate implementations which have different instantiated types, but different in ways that don’t matter to the generated code. For instance, if we have
+
+
+```
+type myInt int
+f[int, bool](3, 4)
+f[myInt, bool](5, 6)
+```
+
+
+Do we really need multiple implementations of `f`? It might be required, if for example `f` assigns its second argument to an `interface{}`-typed variable. But maybe `f` only depends on the underlying type of its second argument (adds values of that type together and then compares them, say), in which case the implementations could share code.
+
+I suspect there will be lots of cases where sharing is possible, if the underlying types are indistinguishable w.r.t. the garbage collector (same size and ptr/nonptr layout). We’d need to detect the tricky cases somehow, maybe using summary information about what properties of each generic parameter a function uses (including things it calls with those same parameters, which makes it tricky when recursion is involved).
+
+If we deduplicate in this fashion, it complicates naming. How do we name the two implementations of `f` shown above? (They would be named `f[int, bool]` and `f[myInt, bool]` by default.) Which do we pick? Or do we name it `f[underlying[int], bool]`? Or can we give one implementation multiple names? Which name do we show in backtraces, debuggers, profilers?
+
+Another option here is to have the linker do content-based deduplication. Only if the assembly code of two functions is identical, will the implementations be merged. (In fact, this might be a desirable feature independent of generics.) This strategy nicely sidesteps the problem of how to decide whether two implementations can share the same code - we compile both and see. (Comparing assembly for identical behavior is nontrivial, as we would need to recursively compare any symbols referenced by relocations, but the linker already has some ability to do this.)
+
+Idea: can we generate a content hash for each implementation, so the linker can dedup implementations without even loading the implementation into memory?
+