blob: 1738e88f40221c3cf5394540807cf215c2117eb9 [file] [log] [blame] [view]
# Proposal: Scaling the Go page allocator
Author(s): Michael Knyszek, Austin Clements
Last updated: 2019-10-18
## Abstract
The Go runtime's page allocator (i.e. `(*mheap).alloc`) has scalability
problems.
In applications with a high rate of heap allocation and a high GOMAXPROCS, small
regressions in the allocator can quickly become big problems.
Based on ideas from Austin about making P-specific land-grabs to reduce lock
contention, and with evidence that most span allocations are one page in size
and are for small objects (<=32 KiB in size), I propose we:
1. Remove the concept of spans for free memory and track free memory with a
bitmap.
1. Allow a P to cache free pages for uncontended allocation.
Point (1) simplifies the allocator, reduces some constant overheads, and more
importantly enables (2), which tackles lock contention directly.
## Background
The Go runtime's page allocator (i.e. `(*mheap).alloc`) has serious scalability
issues.
These were discovered when working through
[golang/go#28479](https://github.com/golang/go/issues/28479) and
[kubernetes/kubernetes#75833](https://github.com/kubernetes/kubernetes/issues/75833#issuecomment-477758829)
which were both filed during or after the Go 1.12 release.
The common thread between each of these scenarios is a high rate of allocation
and a high level of parallelism (in the Go world, a relatively high GOMAXPROCS
value, such as 32).
As it turned out, adding some extra work for a small subset of allocations in Go
1.12 and removing a fast-path data structure in the page heap caused significant
regressions in both throughput and tail latency.
The fundamental issue is the heap lock: all operations in the page heap
(`mheap`) are protected by the heap lock (`mheap.lock`).
A high allocation rate combined with a high degree of parallelism leads to
significant contention on this lock, even though page heap allocations are
relatively cheap and infrequent.
For instance, if the most popular allocation size is ~1 KiB, as seen with the
Kubernetes scalability test, then the runtime accesses the page heap every 10th
allocation or so.
The proof that this is really a scalability issue in the design and not an
implementation bug in Go 1.12 is that we were seeing barging behavior on this
lock in Go 1.11, which indicates that the heap lock was already in a collapsed
state before the regressions in Go 1.12 were even introduced.
## Proposal
I believe we can significantly improve the scalability of the page allocator if
we eliminate as much lock contention in the heap as possible.
We can achieve this in two ways:
1. Make the allocator faster.
The less time spent with the lock held the better the assumptions of e.g. a
futex hold up.
1. Come up with a design that avoids grabbing the lock at all in the common
case.
So, what is this common case? We currently have span allocation data for a
couple large Go applications which reveal that an incredibly high fraction of
allocations are for small object spans.
First, we have data from Kubernetes' 12-hour load test, which indicates that
99.89% of all span allocations are for small object spans, with 93% being from
the first 50 size classes, inclusive.
Next, data from a large Google internal service shows that 95% of its span
allocations are for small object spans, even though this application is known to
make very large allocations relatively frequently.
94% of all of this application's span allocations are from the first 50 size
classes, inclusive.
Thus, I propose we:
* Track free pages in a bitmap that spans the heap's address space.
* Allow a P to cache a set of bits from the bitmap.
The goal is to have most (80%+) small object span allocations allocate quickly,
and without a lock.
(1) makes the allocator significantly more cache-friendly, predictable, and
enables (2), which helps us avoid grabbing the lock in the common case and
allows us to allocate a small number of pages very quickly.
Note that this proposal maintains the current first-fit allocation policy and
highest-address-first scavenging policy.
### Tracking free memory with bitmaps
With a first-fit policy, allocation of one page (the common case) amounts to
finding the first free page in the heap.
One promising idea here is to use a bitmap because modern microarchitectures are
really good at iterating over bits.
Each bit in the bitmap represents a single runtime page (8 KiB as of this
writing), where 1 means in-use and 0 means free.
"In-use" in the context of the new page allocator is now synonymous with "owned
by a span".
The concept of a free span isn't useful here.
I propose that the bitmap be divided up into shards (called chunks) which are
small enough to be quick to iterate over.
512-bit shards would each represent 4 MiB (8 KiB pages) and fit in roughly 2
cache lines.
These shards could live in one global bitmap which is mapped in as needed, or to
reduce the virtual memory footprint, we could use a sparse-array style structure
like `(*mheap).arenas`.
Picking a chunk size which is independent of arena size simplifies the
implementation because arena sizes are platform-dependent.
Simply iterating over the whole bitmap to find a free page is still fairly
inefficient, especially for dense heaps.
We want to be able to quickly skip over completely in-use sections of the heap.
Thus, I propose we attach summary information to each chunk such that it's much
faster to filter out chunks which couldn't possibly satisfy the allocation.
What should this summary information contain? I propose we augment each chunk
with three fields: `start, max, end uintptr`.
`start` represents the number of contiguous 0 bits at the start of this bitmap
shard.
Similarly, `end` represents the number of contiguous 0 bits at the end of the
bitmap shard.
Finally, `max` represents the largest contiguous section of 0 bits in the bitmap
shard.
The diagram below illustrates an example summary for a bitmap chunk.
The arrow indicates which direction addresses go (lower to higher).
The bitmap contains 3 zero bits at its lowest edge and 7 zero bits at its
highest edge.
Within the summary, there are 10 contiguous zero bits, which `max` reflects.
![Diagram of a summary.](35112/summary-diagram.png)
With these three fields, we can determine whether we'll be able to find a
sufficiently large contiguous free section of memory in a given arena or
contiguous set of arenas with a simple state machine.
Computing this summary information for an arena is less trivial to make fast,
and effectively amounts to a combination of a table to get per-byte summaries
and a state machine to merge them until we have a summary which represents the
whole chunk.
The state machine for `start` and `end` is mostly trivial.
`max` is only a little more complex: by knowing `start`, `max`, and `end` for
adjacent summaries, we can merge the summaries by picking the maximum of each
summary's `max` value and the sum of their `start` and `end` values.
I propose we update these summary values eagerly as spans are allocated and
freed.
For large allocations that span multiple arenas, we can zero out summary
information very quickly, and we really only need to do the full computation of
summary information for the ends of the allocation.
There's a problem in this design so far wherein subsequent allocations may end
up treading the same path over and over.
Unfortunately, this retreading behavior's time complexity is `O(heap * allocs)`.
We propose a simple solution to this problem: maintain a hint address.
A hint address represents an address before which there are definitely no free
pages in the heap.
There may not be free pages for some distance after it, hence why it is just a
hint, but we know for a fact we can prune from the search everything before that
address.
In the steady-state, as we allocate from the lower addresses in the heap, we can
bump the hint forward with every search, effectively eliminating the search
space until new memory is freed.
Most allocations are expected to allocate not far from the hint.
There's still an inherent performance problem with this design: larger
allocations may require iterating over the whole heap, even with the hint
address.
This issue arises from the fact that we now have an allocation algorithm with a
time complexity linear in the size of the heap.
Modern microarchitectures are good, but not quite good enough to just go with
this.
Therefore, I propose we take this notion of a summary-per-chunk and extend it:
we can build a tree around this, wherein a given entry at some level of the
radix tree represents the merge of some number of summaries in the next level.
The leaf level in this case contains the per-chunk summaries, while each entry
in the previous levels may reflect 8 chunks, and so on.
This tree would be constructed from a finite number of arrays of summaries, with
lower layers being smaller in size than following layers, since each entry
reflects a larger portion of the address space.
More specifically, we avoid having an "explicit" pointer-based structure (think
"implicit" vs. "explicit" when it comes to min-heap structures: the former tends
to be an array, while the latter tends to be pointer-based).
Below is a diagram of the complete proposed structure.
![Diagram of the proposed radix tree.](35112/radix-tree-diagram.png)
The bottom two boxes are the arenas and summaries representing the full address
space.
Each red line represents a summary, and each set of dotted lines from a summary
into the next layer reflects which part of that next layer that summary refers
to.
In essence, because this tree reflects our address space, it is in fact a radix
tree over addresses.
By left shifting a memory address by different amounts, we can find the exact
summary which contains that address in each level.
On allocation, this tree may be searched by looking at `start`, `max`, and `end`
at each level: if we see that `max` is large enough, we continue searching in
the next, more granular, level.
If `max` is too small, then we look to see if there's free space spanning two
adjacent summaries' memory regions by looking at the first's `end` value and the
second's `start` value.
Larger allocations are therefore more likely to cross larger boundaries of the
address space are more likely to get satisfied by levels in the tree which are
closer to the root.
Note that if the heap has been exhausted, then we will simply iterate over the
root level, find all zeros, and return.
#### Implementation details
A number of details were omitted from the previous section for brevity, but
these details are key for an efficient implementation.
Firstly, note that `start, max, end uintptr` is an awkward structure in size,
requiring either 12 bytes or 24 bytes to store naively, neither of which fits a
small multiple of a cache line comfortably.
To make this structure more cache-friendly, we can pack them tightly into
64-bits if we constrain the height of the radix tree.
The packing is straight-forward: we may dedicate 21 bits to each of these three
numbers and pack them into 63 bits.
A small quirk with this scheme is that each of `start`, `max`, and `end` are
counts, and so we need to represent zero as well as the maximum value (`2^21`),
which at first glance requires an extra bit per field.
Luckily, in that case (i.e. when `max == 2^21`), then `start == max && max ==
end`.
We may use the last remaining bit to represent this case.
A summary representing a completely full region is also conveniently `uint64(0)`
in this representation, which enables us to very quickly skip over parts of the
address space we don't care about with just one load and branch.
As mentioned before, a consequence of this packing is that we need to place a
restriction on our structure: each entry in the root level of the radix tree may
only represent at most `2^21` 8 KiB pages, or 16 GiB, because we cannot
represent any more in a single summary.
From this constraint, it follows that the root level will always be
`2^(heapAddrBits - 21 - log2(pageSize))` in size in entries.
Should we need to support much larger heaps, we may easily remedy this by
representing a summary as `start, max, end uint32`, though at the cost of cache
line alignment and 1.5x metadata overhead.
We may also consider packing the three values into two `uint64` values, though
at the cost of twice as much metadata overhead.
Note that this concern is irrelevant on 32-bit architectures: we can easily
represent the whole address space with a tree and 21 bits per summary field.
Unfortunately, we cannot pack it more tightly on 32-bit architectures since at
least 14 bits are required per summary field.
Now that we've limited the size of the root level, we need to pick the sizes of
the subsequent levels.
Each entry in the root level must reflect some number of entries in the
following level, which gives us our fanout.
In order to stay cache-friendly, I propose trying to keep the fanout close to
the size of an L1 cache line or some multiple thereof.
64 bytes per line is generally a safe bet, and our summaries are 8 bytes wide,
so that gives us a fanout of 8.
Taking all this into account, for a 48-bit address space (such as how we treat
`linux/amd64` in the runtime), I propose the following 5-level array structure:
* Level 0: `16384` entries (fanout = 1, root)
* Level 1: `16384*8` entries (fanout = 8)
* Level 2: `16384*8*8` entries (fanout = 8)
* Level 3: `16384*8*8*8` entries (fanout = 8)
* Level 4: `16384*8*8*8*8` entries (fanout = 8, leaves)
Note that level 4 has `2^48 bytes / (512 * 8 KiB)` entries, which is exactly the
number of chunks in a 48-bit address space.
Each entry at this level represents a single chunk.
Similarly, since a chunk represents 512, or 2^9 pages, each entry in the root
level summarizes a region of `2^21` contiguous pages, as intended.
This scheme can be trivially applied to any system with a larger address space,
since we just increase the size of the root level.
For a 64-bit address space, the root level can get up to 8 GiB in size, but
that's mostly virtual address space which is fairly cheap since we'll only
commit to what we use (see below).
For most heaps, `2^21` contiguous pages or 16 GiB per entry in the root level is
good enough.
If we limited ourselves to 8 entries in the root, we would still be able to
gracefully support up to 128 GiB (and likely double that, thanks to
prefetchers).
Some Go applications may have larger heaps though, but as mentioned before we
can always change the structure of a summary away from packing into 64 bits and
then add an additional level to the tree, at the expense of some additional
metadata overhead.
Overall this uses between KiB and hundreds of MiB of address space on systems
with smaller address spaces (~600 MiB for a 48-bit address space, ~128 KiB for a
32-bit address space).
For a full 64-bit address space, this layout requires ~37 TiB of reserved
memory.
At first glance, this seems like an enormous amount, but in reality that's an
extremely small fraction (~0.00022%) of the full address space.
Furthermore, this address space is very cheap since we'll only commit what we
use, and to reduce the size of core dumps and eliminate issues with overcommit
we will map the space as `PROT_NONE` (only `MEM_RESERVE` on Windows) and map it
as read/write explicitly when we grow the heap (an infrequent operation).
There are only two known adverse effects of this large mapping on Linux:
1. `ulimit -v`, which restricts even `PROT_NONE` mappings.
1. Programs like top, when they report virtual memory footprint, include
`PROT_NONE` mappings.
In the grand scheme of things, these are relatively minor consequences.
The former is not used often, and in cases where it is, it's used as an
inaccurate proxy for limiting a process's physical memory use.
The latter is mostly cosmetic, though perhaps some monitoring system uses it as
a proxy for memory use, and will likely result in some harmless questions.
### Allow a P to cache bits from the bitmap
I propose adding a free page cache to each P.
The page cache, in essence, is a base address marking the beginning of a 64-page
aligned chunk, and a 64-bit bitmap representing free pages in that chunk.
With 8 KiB pages, this makes it so that at most each P can hold onto 512 KiB of
memory.
The allocation algorithm would thus consist of a P first checking its own cache.
If it's empty, it would then go into the bitmap and cache the first non-zero
chunk of 64 bits it sees, noting the base address of those 64 bits.
It then allocates out of its own cache if able.
If it's unable to satisfy the allocation from these bits, then it goes back and
starts searching for contiguous bits, falling back on heap growth if it fails.
If the allocation request is more than 16 pages in size, then we don't even
bother checking the cache.
The probability that `N` consecutive free pages will be available in the page
cache decreases exponentially as `N` approaches 64, and 16 strikes a good
balance between being opportunistic and being wasteful.
Note that allocating the first non-zero chunk of 64 bits is an equivalent
operation to allocating one page out of the heap: fundamentally we're looking
for the first free page we can find in both cases.
This means that we can and should optimize for this case, since we expect that
it will be extremely common.
Note also that we can always update the hint address in this case, making all
subsequent allocations (large and small) faster.
Finally, there's a little hiccup in doing this and that's that acquiring an
`mspan` object currently requires acquiring the heap lock, since these objects
are just taken out of a locked SLAB allocator.
This means that even if we can perform the allocation uncontended we still need
the heap lock to get one of these objects.
We can solve this problem by adding a pool of `mspan` objects to each P, similar
to the `sudog` cache.
### Scavenging
With the elimination of free spans, scavenging must work a little differently as
well.
The primary bit of information we're concerned with here is the `scavenged`
field currently on each span.
I propose we add a `scavenged` bitmap to each `heapArena` which mirrors the
allocation bitmap, and represents whether that page has been scavenged or not.
Allocating any pages would unconditionally clear these bits to avoid adding
extra work to the allocation path.
The scavenger's job is now conceptually much simpler.
It takes bits from both the allocation bitmap as well as the `scavenged` bitmap
and performs a bitwise-OR operation on the two to determine which pages are
"scavengable".
It then scavenges any contiguous free pages it finds in a single syscall,
marking the appropriate bits in the `scavenged` bitmap.
Like the allocator, it would have a hint address to avoid walking over the same
parts of the heap repeatedly.
Because this new algorithm effectively requires iterating over the heap
backwards, there's a slight concern with how much time it could take,
specifically if it does the scavenge operation with the heap lock held like
today.
Instead, I propose that the scavenger iterate over the heap without the lock,
checking the free and scavenged bitmaps optimistically.
If it finds what appears to be valid set of contiguous scavengable bits, it'll
acquire the heap lock, verify their validity, and scavenge.
We're still scavenging with the heap lock held as before, but scaling the
scavenger is outside the scope of this document (though we certainly have ideas
there).
#### Huge-page Awareness
Another piece of the scavenging puzzle is how to deal with the fact that the
current scavenging policy is huge-page aware.
There are two dimensions to this huge-page awareness: the runtime counts the
number of free and unscavenged huge pages for pacing purposes, and the runtime
scavenges those huge pages first.
For the first part, the scavenger currently uses an explicit ratio calculated
whenever the GC trigger is updated to determine the rate at which it should
scavenge, and it uses the number of free and unscavenged huge pages to determine
this ratio.
Instead, I propose that the scavenger releases memory one page at a time while
avoiding breaking huge pages, and it times how long releasing each page takes.
Given a 1% maximum time spent scavenging for the background scavenger, we may
then determine the amount of time to sleep, thus effectively letting the
scavenger set its own rate.
In some ways this self-pacing is more accurate because we no longer have to make
order-of-magnitude assumptions about how long it takes to scavenge.
Also, it represents a significant simplification of the scavenger from an
engineering perspective; there's much less state we need to keep around in
general.
The downside to this self-pacing idea is that we must measure time spent
sleeping and time spent scavenging, which may be funky in the face of OS-related
context switches and other external anomalies (e.g. someone puts their laptop in
sleep mode).
We can deal with such anomalies by setting bounds on how high or low our
measurements are allowed to go.
Furthermore, I propose we manage an EWMA which we feed into the time spent
sleeping to account for scheduling overheads and try to drive the actual time
spent scavenging to 1% of the time the goroutine is awake (the same pace as
before).
As far as scavenging huge pages first goes, I propose we just ignore this aspect
of the current scavenger simplicity's sake.
In the original scavenging proposal, the purpose of scavenging huge pages first
was for throughput: we would get the biggest bang for our buck as soon as
possible, so huge pages don't get "stuck" behind small pages.
There's a question as to whether this actually matters in practice: conventional
wisdom suggests a first-fit policy tends to cause large free fragments to
congregate at higher addresses.
By analyzing and simulating scavenging over samples of real Go heaps, I think
this wisdom mostly holds true.
The graphs below show a simulation of scavenging these heaps using both
policies, counting how much of the free heap is scavenged at each moment in
time.
Ignore the simulated time; the trend is more important.
![Graph of scavenge throughput.](35112/scavenge-throughput-graph.png)
With the exception of two applications, the rest all seem to have their free and
unscavenged huge pages at higher addresses, so the simpler policy leads to a
similar rate of releasing memory.
The simulation is based on heap snapshots at the end of program execution, so
it's a little non-representative since large, long-lived objects, or clusters of
objects, could have gotten freed just before measurement.
This misrepresentation actually acts in our favor, however, since it suggests an
even smaller frequency of huge pages appearing in the middle of the heap.
## Rationale
The purpose of this proposal is to help the memory allocator scale.
To reiterate: it's current very easy to put the heap lock in a collapsing state.
Every page-level allocation must acquire the heap lock, and with 1 KiB objects
we're already hitting that page on every 10th allocation.
To give you an idea of what kinds of timings are involved with page-level
allocations, I took a trace from a 12-hour load test from Kubernetes when I was
diagnosing
[kubernetes/kubernetes#75833](https://github.com/kubernetes/kubernetes/issues/75833#issuecomment-477758829).
92% of all span allocations were for the first 50 size classes (i.e. up and
including 8 KiB objects).
Each of those, on average, spent 4.0µs in the critical section with the heap
locked, minus any time spent scavenging.
The mode of this latency was between 3 and 4µs, with the runner-up being between
2 and 3µs.
These numbers were taken with the load test built using Go 1.12.4 and from a
`linux/amd64` GCE instance.
Note that these numbers do not include the time it takes to acquire or release
the heap lock; it is only the time in the critical section.
I implemented a prototype of this proposal which lives outside of the Go
runtime, and optimized it over the course of a few days.
I then took heap samples from large, end-to-end benchmarks to get realistic heap
layouts for benchmarking the prototype.
The prototype benchmark then started with these heap samples and allocated out
of them until the heap was exhausted.
Without the P cache, allocations took only about 680 ns on average on a similar
GCE instance to the Kubernetes case, pretty much regardless of heap size.
This number scaled gracefully relative to allocation size as well.
To be totally clear, this time includes finding space, marking the space and
updating summaries.
It does not include clearing scavenge bits.
With the P cache included, that number dropped to 20 ns on average.
The comparison with the P cache isn't an apples-to-apples comparison since it
should include heap lock/unlock time on the slow path (and the k8s numbers
should too).
However I believe this only strengthens our case: with the P cache, in theory,
the lock will be acquired less frequently, so an apples-to-apples comparison
would be even more favorable to the P cache.
All of this doesn't even include the cost savings when freeing memory.
While I do not have numbers regarding the cost of freeing, I do know that the
free case in the current implementation is a significant source of lock
contention ([golang/go#28479](https://github.com/golang/go/issues/28479)).
Each free currently requires a treap insertion and maybe one or two removals for
coalescing.
In comparison, freeing memory in this new allocator is faster than allocation
(without the cache): we know exactly which bits in the bitmap to clear from the
address, and can quickly index into the arenas array to update them as well as
their summaries.
While updating the summaries still takes time, we can do even better by freeing
many pages within the same arena at once, amortizing the cost of this update.
In fact, the fast page sweeper that Austin added in Go 1.12 already iterates
over the heap from lowest to highest address, freeing completely free spans.
It would be straight-forward to batch free operations within the same heap arena
to achieve this cost amortization.
In sum, this new page allocator design has the potential to not only solve our
immediate scalability problem, but also gives us more headroom for future
optimizations compared to the current treap-based allocator, for which a number
of various caching strategies, have been designed and/or attempted.
### Fragmentation Concerns
The biggest way fragmentation could worsen with this design is as a result of
the P cache.
The P cache makes it so that allocation isn't quite exactly a serialized
single-threaded first-fit, and P may hold onto pages which another P may need
more.
In practice, given an in-tree prototype, we've seen that this fragmentation
scales with the number of P's, and we believe this to be a reasonable trade-off:
more processors generally require more memory to take advantage of parallelism.
### Prior Art
As far as prior art is concerned, there hasn't been much work with bitmap
allocators for languages which have a GC.
Consider Go against other other managed languages with a GC: Go's GC sits in a
fairly unique point in the design space for GCs because it is a non-moving
collector.
Most allocators in other managed languages (e.g. Java) tend to be bump
allocators, since they tend to have moving and compacting GCs.
Other managed languages also tend to have many more allocations of smaller size
compared to Go, so the "slow path" of allocating pages is usually just grabbing
a fixed-size block out of a shared pool, which can be made to be quite fast.
Go relies on being able to allocate blocks of different sizes to reduce
fragmentation in its current allocator design.
However, when considering non-GC'd languages, e.g C/C++, there has been some
notable work using bitmaps in memory allocators.
In particular,
[DieHard](https://github.com/emeryberger/DieHard/blob/master/docs/pldi2006-diehard.pdf),
[DieHarder](https://www.usenix.org/legacy/event/woot/tech/final_files/Novark.pdf),
and [Mesh](https://arxiv.org/pdf/1902.04738.pdf).
DieHard and DieHarder in particular implement an effecient amortized `O(1)`
bitmap-based allocator, though that was not their primary contribution.
Mesh uses bitmaps for managing which slots are free in a span, like the Go
allocator.
We were not aware of DieHard(er) at the time of writing this proposal, though
they use bitmaps to track individual objects instead of pages.
There are also a few niche cases where they are used such as [GPU-accelerated
allocation](https://arxiv.org/abs/1810.11765) and [real-time
applications](http://www.gii.upv.es/tlsf/).
A good point of comparison for Go's current page allocator is TCMalloc, and in
many ways Go's memory allocator is based on TCMalloc.
However, there are some key differences that arise as a result of Go's GC.
Notably, TCMalloc manages its per-CPU caches as arrays of pointers, rather than
through spans directly like Go does.
The reason for this, as far as I can tell, is because when a free occurs in
TCMalloc, that object is immediately available for re-use, whereas with Go,
object lifetimes are effectively rounded up to a GC cycle.
As a result of this global, bulk (de)allocation behavior resulting in the lack
of short-term re-use, I suspect Go tends to ask the page allocator for memory
more often that TCMalloc does.
This bulk (de)allocation behavior would thus help explain why page allocator
scalability hasn't been such a big issue for TCMalloc (again, as far as I'm
aware).
In sum, Go sits in a unique point in the memory management design space.
The bitmap allocator fits this point in the design space well: bulk allocations
and frees can be grouped together to amortize the cost of updating the summaries
thanks to the GC.
Furthermore, since we don't move objects in the heap, we retain the flexibility
of dealing with fragments efficiently through the radix tree.
### Considered Alternatives
#### Cache spans in a P
One considered alternative is to keep the current span structure, and instead
try to cache the spans themselves on a P, splitting them on each allocation
without acquiring the heap lock.
While this seems like a good idea in principle, one big limitation is that you
can only cache contiguous regions of free memory.
Suppose many heap fragments tend to just be one page in size: one ends up having
to go back to the page allocator every single time anyway.
While it is true that one might only cache one page from the heap in the
proposed design, this case is fairly rare in practice, since it picks up any
available memory it can find in a given 64-page aligned region.
The proposed design also tends to have nicer properties: the treap structure
scales logarithmically (probabilistically) with respect to the number of free
heap fragments, but even this property doesn't scale too well to very large
heaps; one might have to chase down 20 pointers in a 20 GiB heap just for an
allocation, not to mention the additional removals required.
Small heaps may see allocations as fast as 100 ns, whereas large heaps may see
page allocation latencies of 4 µs or more on the same hardware.
On the other hand, the proposed design has a very consistent performance profile
since the radix tree is effectively perfectly balanced.
Furthermore, this idea of caching spans only helps the allocation case.
In most cases the source of contention is not only allocation but also freeing,
since we always have to do a treap insertion (and maybe one or two removals) on
the free path.
In this proposal, the free path is much more efficient in general (no complex
operations required, just clearing memory), even though it still requires
acquiring the heap lock.
Finally, caching spans doesn't really offer much headroom in terms of future
optimization, whereas switching to a bitmap allocator allows us to make a
variety of additional optimizations because the design space is mostly
unexplored.
## Compatibility
This proposal changes no public APIs in either syntax or semantics, and is
therefore Go 1 backwards compatible.
## Implementation
Michael Knyszek will implement this proposal.
The implementation will proceed as follows:
1. Change the scavenger to be self-paced to facilitate an easier transition.
1. Graft the prototype (without the P cache) into the runtime.
* The plan is to do this as a few large changes which are purely additive
and with tests.
* The two allocators will live side-by-side, and we'll flip between the two
in a single small change.
1. Delete unnecessary code from the old allocator.
1. Create a pool of `mspan` objects for each P.
1. Add a page cache to each P.