design/35112-scaling-the-page-allocator.md - proposal - Git at Google

 # Proposal: Scaling the Go page allocator

 Author(s): Michael Knyszek, Austin Clements

 Last updated: 2019-10-18

 ## Abstract

 The Go runtime's page allocator (i.e. `(*mheap).alloc`) has scalability
 problems.
 In applications with a high rate of heap allocation and a high GOMAXPROCS, small
 regressions in the allocator can quickly become big problems.

 Based on ideas from Austin about making P-specific land-grabs to reduce lock
 contention, and with evidence that most span allocations are one page in size
 and are for small objects (<=32 KiB in size), I propose we:
 1. Remove the concept of spans for free memory and track free memory with a
    bitmap.
 1. Allow a P to cache free pages for uncontended allocation.
 Point (1) simplifies the allocator, reduces some constant overheads, and more
 importantly enables (2), which tackles lock contention directly.

 ## Background

 The Go runtime's page allocator (i.e. `(*mheap).alloc`) has serious scalability
 issues.
 These were discovered when working through
 [golang/go#28479](https://github.com/golang/go/issues/28479) and
 [kubernetes/kubernetes#75833](https://github.com/kubernetes/kubernetes/issues/75833#issuecomment-477758829)
 which were both filed during or after the Go 1.12 release.
 The common thread between each of these scenarios is a high rate of allocation
 and a high level of parallelism (in the Go world, a relatively high GOMAXPROCS
 value, such as 32).

 As it turned out, adding some extra work for a small subset of allocations in Go
 1.12 and removing a fast-path data structure in the page heap caused significant
 regressions in both throughput and tail latency.
 The fundamental issue is the heap lock: all operations in the page heap
 (`mheap`) are protected by the heap lock (`mheap.lock`).
 A high allocation rate combined with a high degree of parallelism leads to
 significant contention on this lock, even though page heap allocations are
 relatively cheap and infrequent.
 For instance, if the most popular allocation size is ~1 KiB, as seen with the
 Kubernetes scalability test, then the runtime accesses the page heap every 10th
 allocation or so.

 The proof that this is really a scalability issue in the design and not an
 implementation bug in Go 1.12 is that we were seeing barging behavior on this
 lock in Go 1.11, which indicates that the heap lock was already in a collapsed
 state before the regressions in Go 1.12 were even introduced.

 ## Proposal

 I believe we can significantly improve the scalability of the page allocator if
 we eliminate as much lock contention in the heap as possible.
 We can achieve this in two ways:
 1. Make the allocator faster.
    The less time spent with the lock held the better the assumptions of e.g. a
    futex hold up.
 1. Come up with a design that avoids grabbing the lock at all in the common
    case.

 So, what is this common case? We currently have span allocation data for a
 couple large Go applications which reveal that an incredibly high fraction of
 allocations are for small object spans.
 First, we have data from Kubernetes' 12-hour load test, which indicates that
 99.89% of all span allocations are for small object spans, with 93% being from
 the first 50 size classes, inclusive.
 Next, data from a large Google internal service shows that 95% of its span
 allocations are for small object spans, even though this application is known to
 make very large allocations relatively frequently.
 94% of all of this application's span allocations are from the first 50 size
 classes, inclusive.

 Thus, I propose we:
 * Track free pages in a bitmap that spans the heap's address space.
 * Allow a P to cache a set of bits from the bitmap.

 The goal is to have most (80%+) small object span allocations allocate quickly,
 and without a lock.
 (1) makes the allocator significantly more cache-friendly, predictable, and
 enables (2), which helps us avoid grabbing the lock in the common case and
 allows us to allocate a small number of pages very quickly.

 Note that this proposal maintains the current first-fit allocation policy and
 highest-address-first scavenging policy.

 ### Tracking free memory with bitmaps

 With a first-fit policy, allocation of one page (the common case) amounts to
 finding the first free page in the heap.
 One promising idea here is to use a bitmap because modern microarchitectures are
 really good at iterating over bits.
 Each bit in the bitmap represents a single runtime page (8 KiB as of this
 writing), where 1 means in-use and 0 means free.
 "In-use" in the context of the new page allocator is now synonymous with "owned
 by a span".
 The concept of a free span isn't useful here.

 I propose that the bitmap be divided up into shards (called chunks) which are
 small enough to be quick to iterate over.
 512-bit shards would each represent 4 MiB (8 KiB pages) and fit in roughly 2
 cache lines.
 These shards could live in one global bitmap which is mapped in as needed, or to
 reduce the virtual memory footprint, we could use a sparse-array style structure
 like `(*mheap).arenas`.
 Picking a chunk size which is independent of arena size simplifies the
 implementation because arena sizes are platform-dependent.

 Simply iterating over the whole bitmap to find a free page is still fairly
 inefficient, especially for dense heaps.
 We want to be able to quickly skip over completely in-use sections of the heap.
 Thus, I propose we attach summary information to each chunk such that it's much
 faster to filter out chunks which couldn't possibly satisfy the allocation.

 What should this summary information contain? I propose we augment each chunk
 with three fields: `start, max, end uintptr`.
 `start` represents the number of contiguous 0 bits at the start of this bitmap
 shard.
 Similarly, `end` represents the number of contiguous 0 bits at the end of the
 bitmap shard.
 Finally, `max` represents the largest contiguous section of 0 bits in the bitmap
 shard.

 The diagram below illustrates an example summary for a bitmap chunk.
 The arrow indicates which direction addresses go (lower to higher).
 The bitmap contains 3 zero bits at its lowest edge and 7 zero bits at its
 highest edge.
 Within the summary, there are 10 contiguous zero bits, which `max` reflects.

 ![Diagram of a summary.](35112/summary-diagram.png)

 With these three fields, we can determine whether we'll be able to find a
 sufficiently large contiguous free section of memory in a given arena or
 contiguous set of arenas with a simple state machine.
 Computing this summary information for an arena is less trivial to make fast,
 and effectively amounts to a combination of a table to get per-byte summaries
 and a state machine to merge them until we have a summary which represents the
 whole chunk.
 The state machine for `start` and `end` is mostly trivial.
 `max` is only a little more complex: by knowing `start`, `max`, and `end` for
 adjacent summaries, we can merge the summaries by picking the maximum of each
 summary's `max` value and the sum of their `start` and `end` values.
 I propose we update these summary values eagerly as spans are allocated and
 freed.
 For large allocations that span multiple arenas, we can zero out summary
 information very quickly, and we really only need to do the full computation of
 summary information for the ends of the allocation.

 There's a problem in this design so far wherein subsequent allocations may end
 up treading the same path over and over.
 Unfortunately, this retreading behavior's time complexity is `O(heap * allocs)`.
 We propose a simple solution to this problem: maintain a hint address.
 A hint address represents an address before which there are definitely no free
 pages in the heap.
 There may not be free pages for some distance after it, hence why it is just a
 hint, but we know for a fact we can prune from the search everything before that
 address.
 In the steady-state, as we allocate from the lower addresses in the heap, we can
 bump the hint forward with every search, effectively eliminating the search
 space until new memory is freed.
 Most allocations are expected to allocate not far from the hint.

 There's still an inherent performance problem with this design: larger
 allocations may require iterating over the whole heap, even with the hint
 address.
 This issue arises from the fact that we now have an allocation algorithm with a
 time complexity linear in the size of the heap.
 Modern microarchitectures are good, but not quite good enough to just go with
 this.

 Therefore, I propose we take this notion of a summary-per-chunk and extend it:
 we can build a tree around this, wherein a given entry at some level of the
 radix tree represents the merge of some number of summaries in the next level.
 The leaf level in this case contains the per-chunk summaries, while each entry
 in the previous levels may reflect 8 chunks, and so on.

 This tree would be constructed from a finite number of arrays of summaries, with
 lower layers being smaller in size than following layers, since each entry
 reflects a larger portion of the address space.
 More specifically, we avoid having an "explicit" pointer-based structure (think
 "implicit" vs. "explicit" when it comes to min-heap structures: the former tends
 to be an array, while the latter tends to be pointer-based).

 Below is a diagram of the complete proposed structure.

 ![Diagram of the proposed radix tree.](35112/radix-tree-diagram.png)

 The bottom two boxes are the arenas and summaries representing the full address
 space.
 Each red line represents a summary, and each set of dotted lines from a summary
 into the next layer reflects which part of that next layer that summary refers
 to.

 In essence, because this tree reflects our address space, it is in fact a radix
 tree over addresses.
 By left shifting a memory address by different amounts, we can find the exact
 summary which contains that address in each level.

 On allocation, this tree may be searched by looking at `start`, `max`, and `end`
 at each level: if we see that `max` is large enough, we continue searching in
 the next, more granular, level.
 If `max` is too small, then we look to see if there's free space spanning two
 adjacent summaries' memory regions by looking at the first's `end` value and the
 second's `start` value.
 Larger allocations are therefore more likely to cross larger boundaries of the
 address space are more likely to get satisfied by levels in the tree which are
 closer to the root.
 Note that if the heap has been exhausted, then we will simply iterate over the
 root level, find all zeros, and return.

 #### Implementation details

 A number of details were omitted from the previous section for brevity, but
 these details are key for an efficient implementation.

 Firstly, note that `start, max, end uintptr` is an awkward structure in size,
 requiring either 12 bytes or 24 bytes to store naively, neither of which fits a
 small multiple of a cache line comfortably.
 To make this structure more cache-friendly, we can pack them tightly into
 64-bits if we constrain the height of the radix tree.
 The packing is straight-forward: we may dedicate 21 bits to each of these three
 numbers and pack them into 63 bits.
 A small quirk with this scheme is that each of `start`, `max`, and `end` are
 counts, and so we need to represent zero as well as the maximum value (`2^21`),
 which at first glance requires an extra bit per field.
 Luckily, in that case (i.e. when `max == 2^21`), then `start == max && max ==
 end`.
 We may use the last remaining bit to represent this case.
 A summary representing a completely full region is also conveniently `uint64(0)`
 in this representation, which enables us to very quickly skip over parts of the
 address space we don't care about with just one load and branch.

 As mentioned before, a consequence of this packing is that we need to place a
 restriction on our structure: each entry in the root level of the radix tree may
 only represent at most `2^21` 8 KiB pages, or 16 GiB, because we cannot
 represent any more in a single summary.
 From this constraint, it follows that the root level will always be
 `2^(heapAddrBits - 21 - log2(pageSize))` in size in entries.
 Should we need to support much larger heaps, we may easily remedy this by
 representing a summary as `start, max, end uint32`, though at the cost of cache
 line alignment and 1.5x metadata overhead.
 We may also consider packing the three values into two `uint64` values, though
 at the cost of twice as much metadata overhead.
 Note that this concern is irrelevant on 32-bit architectures: we can easily
 represent the whole address space with a tree and 21 bits per summary field.
 Unfortunately, we cannot pack it more tightly on 32-bit architectures since at
 least 14 bits are required per summary field.

 Now that we've limited the size of the root level, we need to pick the sizes of
 the subsequent levels.
 Each entry in the root level must reflect some number of entries in the
 following level, which gives us our fanout.
 In order to stay cache-friendly, I propose trying to keep the fanout close to
 the size of an L1 cache line or some multiple thereof.
 64 bytes per line is generally a safe bet, and our summaries are 8 bytes wide,
 so that gives us a fanout of 8.

 Taking all this into account, for a 48-bit address space (such as how we treat
 `linux/amd64` in the runtime), I propose the following 5-level array structure:
 * Level 0: `16384` entries (fanout = 1, root)
 * Level 1: `16384*8` entries (fanout = 8)
 * Level 2: `16384*8*8` entries (fanout = 8)
 * Level 3: `16384*8*8*8` entries (fanout = 8)
 * Level 4: `16384*8*8*8*8` entries (fanout = 8, leaves)

 Note that level 4 has `2^48 bytes / (512 * 8 KiB)` entries, which is exactly the
 number of chunks in a 48-bit address space.
 Each entry at this level represents a single chunk.
 Similarly, since a chunk represents 512, or 2^9 pages, each entry in the root
 level summarizes a region of `2^21` contiguous pages, as intended.
 This scheme can be trivially applied to any system with a larger address space,
 since we just increase the size of the root level.
 For a 64-bit address space, the root level can get up to 8 GiB in size, but
 that's mostly virtual address space which is fairly cheap since we'll only
 commit to what we use (see below).

 For most heaps, `2^21` contiguous pages or 16 GiB per entry in the root level is
 good enough.
 If we limited ourselves to 8 entries in the root, we would still be able to
 gracefully support up to 128 GiB (and likely double that, thanks to
 prefetchers).
 Some Go applications may have larger heaps though, but as mentioned before we
 can always change the structure of a summary away from packing into 64 bits and
 then add an additional level to the tree, at the expense of some additional
 metadata overhead.

 Overall this uses between KiB and hundreds of MiB of address space on systems
 with smaller address spaces (~600 MiB for a 48-bit address space, ~128 KiB for a
 32-bit address space).
 For a full 64-bit address space, this layout requires ~37 TiB of reserved
 memory.

 At first glance, this seems like an enormous amount, but in reality that's an
 extremely small fraction (~0.00022%) of the full address space.
 Furthermore, this address space is very cheap since we'll only commit what we
 use, and to reduce the size of core dumps and eliminate issues with overcommit
 we will map the space as `PROT_NONE` (only `MEM_RESERVE` on Windows) and map it
 as read/write explicitly when we grow the heap (an infrequent operation).

 There are only two known adverse effects of this large mapping on Linux:
   1. `ulimit -v`, which restricts even `PROT_NONE` mappings.
   1. Programs like top, when they report virtual memory footprint, include
      `PROT_NONE` mappings.
 In the grand scheme of things, these are relatively minor consequences.
 The former is not used often, and in cases where it is, it's used as an
 inaccurate proxy for limiting a process's physical memory use.
 The latter is mostly cosmetic, though perhaps some monitoring system uses it as
 a proxy for memory use, and will likely result in some harmless questions.

 ### Allow a P to cache bits from the bitmap

 I propose adding a free page cache to each P.
 The page cache, in essence, is a base address marking the beginning of a 64-page
 aligned chunk, and a 64-bit bitmap representing free pages in that chunk.
 With 8 KiB pages, this makes it so that at most each P can hold onto 512 KiB of
 memory.

 The allocation algorithm would thus consist of a P first checking its own cache.
 If it's empty, it would then go into the bitmap and cache the first non-zero
 chunk of 64 bits it sees, noting the base address of those 64 bits.
 It then allocates out of its own cache if able.
 If it's unable to satisfy the allocation from these bits, then it goes back and
 starts searching for contiguous bits, falling back on heap growth if it fails.
 If the allocation request is more than 16 pages in size, then we don't even
 bother checking the cache.
 The probability that `N` consecutive free pages will be available in the page
 cache decreases exponentially as `N` approaches 64, and 16 strikes a good
 balance between being opportunistic and being wasteful.

 Note that allocating the first non-zero chunk of 64 bits is an equivalent
 operation to allocating one page out of the heap: fundamentally we're looking
 for the first free page we can find in both cases.
 This means that we can and should optimize for this case, since we expect that
 it will be extremely common.
 Note also that we can always update the hint address in this case, making all
 subsequent allocations (large and small) faster.

 Finally, there's a little hiccup in doing this and that's that acquiring an
 `mspan` object currently requires acquiring the heap lock, since these objects
 are just taken out of a locked SLAB allocator.
 This means that even if we can perform the allocation uncontended we still need
 the heap lock to get one of these objects.
 We can solve this problem by adding a pool of `mspan` objects to each P, similar
 to the `sudog` cache.

 ### Scavenging

 With the elimination of free spans, scavenging must work a little differently as
 well.
 The primary bit of information we're concerned with here is the `scavenged`
 field currently on each span.
 I propose we add a `scavenged` bitmap to each `heapArena` which mirrors the
 allocation bitmap, and represents whether that page has been scavenged or not.
 Allocating any pages would unconditionally clear these bits to avoid adding
 extra work to the allocation path.

 The scavenger's job is now conceptually much simpler.
 It takes bits from both the allocation bitmap as well as the `scavenged` bitmap
 and performs a bitwise-OR operation on the two to determine which pages are
 "scavengable".
 It then scavenges any contiguous free pages it finds in a single syscall,
 marking the appropriate bits in the `scavenged` bitmap.
 Like the allocator, it would have a hint address to avoid walking over the same
 parts of the heap repeatedly.

 Because this new algorithm effectively requires iterating over the heap
 backwards, there's a slight concern with how much time it could take,
 specifically if it does the scavenge operation with the heap lock held like
 today.
 Instead, I propose that the scavenger iterate over the heap without the lock,
 checking the free and scavenged bitmaps optimistically.
 If it finds what appears to be valid set of contiguous scavengable bits, it'll
 acquire the heap lock, verify their validity, and scavenge.

 We're still scavenging with the heap lock held as before, but scaling the
 scavenger is outside the scope of this document (though we certainly have ideas
 there).

 #### Huge-page Awareness

 Another piece of the scavenging puzzle is how to deal with the fact that the
 current scavenging policy is huge-page aware.
 There are two dimensions to this huge-page awareness: the runtime counts the
 number of free and unscavenged huge pages for pacing purposes, and the runtime
 scavenges those huge pages first.

 For the first part, the scavenger currently uses an explicit ratio calculated
 whenever the GC trigger is updated to determine the rate at which it should
 scavenge, and it uses the number of free and unscavenged huge pages to determine
 this ratio.

 Instead, I propose that the scavenger releases memory one page at a time while
 avoiding breaking huge pages, and it times how long releasing each page takes.
 Given a 1% maximum time spent scavenging for the background scavenger, we may
 then determine the amount of time to sleep, thus effectively letting the
 scavenger set its own rate.
 In some ways this self-pacing is more accurate because we no longer have to make
 order-of-magnitude assumptions about how long it takes to scavenge.
 Also, it represents a significant simplification of the scavenger from an
 engineering perspective; there's much less state we need to keep around in
 general.

 The downside to this self-pacing idea is that we must measure time spent
 sleeping and time spent scavenging, which may be funky in the face of OS-related
 context switches and other external anomalies (e.g. someone puts their laptop in
 sleep mode).
 We can deal with such anomalies by setting bounds on how high or low our
 measurements are allowed to go.
 Furthermore, I propose we manage an EWMA which we feed into the time spent
 sleeping to account for scheduling overheads and try to drive the actual time
 spent scavenging to 1% of the time the goroutine is awake (the same pace as
 before).

 As far as scavenging huge pages first goes, I propose we just ignore this aspect
 of the current scavenger simplicity's sake.
 In the original scavenging proposal, the purpose of scavenging huge pages first
 was for throughput: we would get the biggest bang for our buck as soon as
 possible, so huge pages don't get "stuck" behind small pages.
 There's a question as to whether this actually matters in practice: conventional
 wisdom suggests a first-fit policy tends to cause large free fragments to
 congregate at higher addresses.
 By analyzing and simulating scavenging over samples of real Go heaps, I think
 this wisdom mostly holds true.

 The graphs below show a simulation of scavenging these heaps using both
 policies, counting how much of the free heap is scavenged at each moment in
 time.
 Ignore the simulated time; the trend is more important.

 ![Graph of scavenge throughput.](35112/scavenge-throughput-graph.png)

 With the exception of two applications, the rest all seem to have their free and
 unscavenged huge pages at higher addresses, so the simpler policy leads to a
 similar rate of releasing memory.
 The simulation is based on heap snapshots at the end of program execution, so
 it's a little non-representative since large, long-lived objects, or clusters of
 objects, could have gotten freed just before measurement.
 This misrepresentation actually acts in our favor, however, since it suggests an
 even smaller frequency of huge pages appearing in the middle of the heap.

 ## Rationale

 The purpose of this proposal is to help the memory allocator scale.
 To reiterate: it's current very easy to put the heap lock in a collapsing state.
 Every page-level allocation must acquire the heap lock, and with 1 KiB objects
 we're already hitting that page on every 10th allocation.

 To give you an idea of what kinds of timings are involved with page-level
 allocations, I took a trace from a 12-hour load test from Kubernetes when I was
 diagnosing
 [kubernetes/kubernetes#75833](https://github.com/kubernetes/kubernetes/issues/75833#issuecomment-477758829).
 92% of all span allocations were for the first 50 size classes (i.e. up and
 including 8 KiB objects).
 Each of those, on average, spent 4.0µs in the critical section with the heap
 locked, minus any time spent scavenging.
 The mode of this latency was between 3 and 4µs, with the runner-up being between
 2 and 3µs.
 These numbers were taken with the load test built using Go 1.12.4 and from a
 `linux/amd64` GCE instance.
 Note that these numbers do not include the time it takes to acquire or release
 the heap lock; it is only the time in the critical section.

 I implemented a prototype of this proposal which lives outside of the Go
 runtime, and optimized it over the course of a few days.
 I then took heap samples from large, end-to-end benchmarks to get realistic heap
 layouts for benchmarking the prototype.

 The prototype benchmark then started with these heap samples and allocated out
 of them until the heap was exhausted.
 Without the P cache, allocations took only about 680 ns on average on a similar
 GCE instance to the Kubernetes case, pretty much regardless of heap size.
 This number scaled gracefully relative to allocation size as well.
 To be totally clear, this time includes finding space, marking the space and
 updating summaries.
 It does not include clearing scavenge bits.

 With the P cache included, that number dropped to 20 ns on average.
 The comparison with the P cache isn't an apples-to-apples comparison since it
 should include heap lock/unlock time on the slow path (and the k8s numbers
 should too).
 However I believe this only strengthens our case: with the P cache, in theory,
 the lock will be acquired less frequently, so an apples-to-apples comparison
 would be even more favorable to the P cache.

 All of this doesn't even include the cost savings when freeing memory.
 While I do not have numbers regarding the cost of freeing, I do know that the
 free case in the current implementation is a significant source of lock
 contention ([golang/go#28479](https://github.com/golang/go/issues/28479)).
 Each free currently requires a treap insertion and maybe one or two removals for
 coalescing.

 In comparison, freeing memory in this new allocator is faster than allocation
 (without the cache): we know exactly which bits in the bitmap to clear from the
 address, and can quickly index into the arenas array to update them as well as
 their summaries.
 While updating the summaries still takes time, we can do even better by freeing
 many pages within the same arena at once, amortizing the cost of this update.
 In fact, the fast page sweeper that Austin added in Go 1.12 already iterates
 over the heap from lowest to highest address, freeing completely free spans.
 It would be straight-forward to batch free operations within the same heap arena
 to achieve this cost amortization.

 In sum, this new page allocator design has the potential to not only solve our
 immediate scalability problem, but also gives us more headroom for future
 optimizations compared to the current treap-based allocator, for which a number
 of various caching strategies, have been designed and/or attempted.

 ### Fragmentation Concerns

 The biggest way fragmentation could worsen with this design is as a result of
 the P cache.
 The P cache makes it so that allocation isn't quite exactly a serialized
 single-threaded first-fit, and P may hold onto pages which another P may need
 more.

 In practice, given an in-tree prototype, we've seen that this fragmentation
 scales with the number of P's, and we believe this to be a reasonable trade-off:
 more processors generally require more memory to take advantage of parallelism.

 ### Prior Art

 As far as prior art is concerned, there hasn't been much work with bitmap
 allocators for languages which have a GC.
 Consider Go against other other managed languages with a GC: Go's GC sits in a
 fairly unique point in the design space for GCs because it is a non-moving
 collector.
 Most allocators in other managed languages (e.g. Java) tend to be bump
 allocators, since they tend to have moving and compacting GCs.
 Other managed languages also tend to have many more allocations of smaller size
 compared to Go, so the "slow path" of allocating pages is usually just grabbing
 a fixed-size block out of a shared pool, which can be made to be quite fast.
 Go relies on being able to allocate blocks of different sizes to reduce
 fragmentation in its current allocator design.

 However, when considering non-GC'd languages, e.g C/C++, there has been some
 notable work using bitmaps in memory allocators.
 In particular,
 [DieHard](https://github.com/emeryberger/DieHard/blob/master/docs/pldi2006-diehard.pdf),
 [DieHarder](https://www.usenix.org/legacy/event/woot/tech/final_files/Novark.pdf),
 and [Mesh](https://arxiv.org/pdf/1902.04738.pdf).
 DieHard and DieHarder in particular implement an effecient amortized `O(1)`
 bitmap-based allocator, though that was not their primary contribution.
 Mesh uses bitmaps for managing which slots are free in a span, like the Go
 allocator.
 We were not aware of DieHard(er) at the time of writing this proposal, though
 they use bitmaps to track individual objects instead of pages.
 There are also a few niche cases where they are used such as [GPU-accelerated
 allocation](https://arxiv.org/abs/1810.11765) and [real-time
 applications](http://www.gii.upv.es/tlsf/).

 A good point of comparison for Go's current page allocator is TCMalloc, and in
 many ways Go's memory allocator is based on TCMalloc.
 However, there are some key differences that arise as a result of Go's GC.
 Notably, TCMalloc manages its per-CPU caches as arrays of pointers, rather than
 through spans directly like Go does.
 The reason for this, as far as I can tell, is because when a free occurs in
 TCMalloc, that object is immediately available for re-use, whereas with Go,
 object lifetimes are effectively rounded up to a GC cycle.
 As a result of this global, bulk (de)allocation behavior resulting in the lack
 of short-term re-use, I suspect Go tends to ask the page allocator for memory
 more often that TCMalloc does.
 This bulk (de)allocation behavior would thus help explain why page allocator
 scalability hasn't been such a big issue for TCMalloc (again, as far as I'm
 aware).

 In sum, Go sits in a unique point in the memory management design space.
 The bitmap allocator fits this point in the design space well: bulk allocations
 and frees can be grouped together to amortize the cost of updating the summaries
 thanks to the GC.
 Furthermore, since we don't move objects in the heap, we retain the flexibility
 of dealing with fragments efficiently through the radix tree.

 ### Considered Alternatives

 #### Cache spans in a P

 One considered alternative is to keep the current span structure, and instead
 try to cache the spans themselves on a P, splitting them on each allocation
 without acquiring the heap lock.

 While this seems like a good idea in principle, one big limitation is that you
 can only cache contiguous regions of free memory.
 Suppose many heap fragments tend to just be one page in size: one ends up having
 to go back to the page allocator every single time anyway.
 While it is true that one might only cache one page from the heap in the
 proposed design, this case is fairly rare in practice, since it picks up any
 available memory it can find in a given 64-page aligned region.

 The proposed design also tends to have nicer properties: the treap structure
 scales logarithmically (probabilistically) with respect to the number of free
 heap fragments, but even this property doesn't scale too well to very large
 heaps; one might have to chase down 20 pointers in a 20 GiB heap just for an
 allocation, not to mention the additional removals required.
 Small heaps may see allocations as fast as 100 ns, whereas large heaps may see
 page allocation latencies of 4 µs or more on the same hardware.
 On the other hand, the proposed design has a very consistent performance profile
 since the radix tree is effectively perfectly balanced.

 Furthermore, this idea of caching spans only helps the allocation case.
 In most cases the source of contention is not only allocation but also freeing,
 since we always have to do a treap insertion (and maybe one or two removals) on
 the free path.
 In this proposal, the free path is much more efficient in general (no complex
 operations required, just clearing memory), even though it still requires
 acquiring the heap lock.

 Finally, caching spans doesn't really offer much headroom in terms of future
 optimization, whereas switching to a bitmap allocator allows us to make a
 variety of additional optimizations because the design space is mostly
 unexplored.

 ## Compatibility

 This proposal changes no public APIs in either syntax or semantics, and is
 therefore Go 1 backwards compatible.

 ## Implementation

 Michael Knyszek will implement this proposal.

 The implementation will proceed as follows:
 1. Change the scavenger to be self-paced to facilitate an easier transition.
 1. Graft the prototype (without the P cache) into the runtime.
     * The plan is to do this as a few large changes which are purely additive
       and with tests.
     * The two allocators will live side-by-side, and we'll flip between the two
       in a single small change.
 1. Delete unnecessary code from the old allocator.
 1. Create a pool of `mspan` objects for each P.
 1. Add a page cache to each P.