Proposal: Smarter Scavenging

Author(s): Michael Knyszek <mknyszek@google.com> Last Updated: 2019-05-09

Motivation & Purpose

Out-of-memory errors (OOMs) have long been a pain-point for Go applications. A class of these errors come from the same underlying cause: a temporary spike in memory causes the Go runtime to grow the heap, but it takes a very long time (on the order of minutes) to return that unneeded memory back to the system.

The system can end up killing the application in many situations, such as if the system has no swap space or if system monitors count this space against your application. In addition, if this additional space is counted against your application, you end up paying more for memory when you don’t really need it.

The Go runtime does have internal mechanisms to help deal with this, but they don’t react to changes in the application promptly enough. The way users solve this problem today is through a runtime library function called debug.FreeOSMemory. debug.FreeOSMemory performs a GC and subsequently returns all unallocated memory back to the underlying system. However, this solution is very heavyweight:

Returning all free memory back to the underlying system at once is expensive, and can lead to latency spikes as it holds the heap lock through the whole process.
It’s an invasive solution: you need to modify your code to call it when you need it.
Reusing free chunks of memory becomes more expensive. On UNIX-y systems that means an extra page fault (which is surprisingly expensive on some systems).

The purpose of this document is to propose we replace the existing mechanisms in the runtime with something stronger that responds promptly to the memory requirements of Go applications, ensuring the application is only charged for as much as it needs to remain performant.

Background

Scavenging

Dynamic memory allocators typically obtain memory from the operating system by requesting for it to be mapped into their virtual address space. Sometimes this space ends up unused, and modern operating systems provide a way to tell the OS that certain virtual memory address regions won’t be used without unmapping them. This means the physical memory backing those regions may be taken back by the OS and used elsewhere. We in the Go runtime refer to this technique as “scavenging”.

Scavenging is especially useful in dealing with page-level external fragmentation, since we can give these fragments back to the OS, reducing the process’ resident set size (RSS). That is, the amount of memory that is backed by physical memory in the application’s address space.

Go 1.11

As of Go 1.11, the only scavenging process in the Go runtime was a periodic scavenger which runs every 2.5 minutes. This scavenger combs over all the free spans in the heap and scavenge them if they have been unused for at least 5 minutes. When the runtime coalesced spans, it would track how much of the new span was scavenged.

While this simple technique is surprisingly effective for long-running applications, the peak RSS of an application can end up wildly exaggerated in many circumstances, even though the application’s peak in-use memory is significantly smaller. The periodic scavenger just does not react quickly enough to changes in the application’s memory usage.

Go 1.12

As of Go 1.12, in addition to the periodic scavenger, the Go runtime also performs heap-growth scavenging. On each heap growth up to N bytes of the largest spans are scavenged, where N is the amount of bytes the heap grew by. The idea here is to “pay back” the cost of a heap growth. This technique helped to reduce the peak RSS of some applications.

Note on Span Coalescing Rules

As part of the Go 1.12 release, the span coalescing rules had changed such that scavenged and unscavenged spans would not coalesce.

Earlier in the Go 1.12 cycle a choice was made to coalesce the two different kinds of spans by scavenging them, but this turned out to be far too aggressive in practice since most spans would become scavenged over time. This policy was especially costly if scavenging was particularly expensive on a given platform.

In addition to avoiding the problem above, there’s a key reason why not to merge across this boundary: if most spans end up scavenged over time, then we do not have the fine-grained control we need over memory to create good policies and mechanisms for scavenging memory.

Prior Art

For scavenging, we look to C/C++ allocators which have a much richer history of scavenging memory than allocators for managed languages. For example, the HotSpot VM just started scavenging memory, and even then its policies are very conservative, only returning memory during low application activity. The Shenandoah collector has had this functionality for a little while, but it just does the same thing Go 1.11 did, as far as I can tell.

For the purposes of this document, we will focus our comparisons on jemalloc, which appears to me to be the state-of-the-art in scavenging.

Goals

The goal in scavenging smarter is two-fold:

Reduce the average and peak RSS of Go applications.
Minimize the CPU impact of keeping the RSS low.

The two goals go hand-in-hand. On the one hand, you want to keep the RSS of the application as close to its in-use memory usage as possible. On the other hand, doing so is expensive in terms of CPU time, having to make syscalls and handle page faults. If we’re too aggressive and scavenge every free space we have, then on every span allocation we effectively incur a hard page fault (or invoke a syscall), and we’re calling a syscall on every span free.

The ideal scenario, in my view, is that the RSS of the application “tracks” the peak in-use memory over time.

We should keep the RSS close to the actual in-use heap, but leave enough of a buffer such that the application has a pool of unscavenged memory to allocate from.
We should try to smooth over fast and transient changes in heap size.

The goal of this proposal is to improve the Go runtime’s scavenging mechanisms such that it exhibits the behavior shown above. Compared with today’s implementation, this behavior should reduce the average overall RSS of most Go applications with minimal impact on performance.

Proposal

Three questions represent the key policy decisions that describe a memory scavenging system.

How much memory should we retain (not scavenge)?
At what rate is memory scavenged?
Which memory should we scavenge?

I propose that for the Go runtime, we:

Retain some constant times the heap goal.
Scavenge at a rate based on a predefined fixed mutator overhead (1%), madvise performance characteristics, and the number of free huge pages.
Scavenge the unscavenged spans with the highest base addresses first.

Additionally, I propose we change the span allocation policy to prefer unscavenged spans over scavenged spans, and to be first-fit rather than best-fit.

Rationale

How much memory should we retain?

As part of our goal to keep the program’s reported RSS to a minimum, we ideally want to scavenge as many pages as it takes to track the program’s in-use memory.

However, there’s a performance trade-off in tracking the program’s in-use memory too closely. For example, if the heap very suddenly shrinks but then grows again, there's a significant cost in terms of syscalls and page faults incurred. On the other hand, if we scavenge too passively, then the program’s reported RSS may be inflated significantly.

This question is difficult to answer in general, because generally allocators can not predict the future behavior of the application. jemalloc avoids this question entirely, relying solely on having a good (albeit complicated) answer to the “rate” question (see next section).

But, as Austin mentioned in golang/go#16930, Go has an advantage over C/C++ allocators in this respect. The Go runtime knows that before the next GC, the heap will grow to the heap goal.

This suggests that between GCs there may be some body of free memory that one can drop with relatively few consequences. Thus, I propose the following heuristic: Retain C * next_gc bytes of memory, where C is some constant like 1.1 (indicating 10% additional overhead), and may be adjusted as we see fit. The purpose of the constant factor is to leave some headroom to reduce the probability that the application needs to allocate from scavenged memory in the steady state. The proposed value for C is 1.1, but is subject to change.

At what rate is memory scavenged?

In order to have the application’s RSS track the amount of heap space it’s actually using over time, we want to be able to grow and shrink the RSS at a rate proportional to how the in-use memory of the application is growing and shrinking, with smoothing.

When it comes to growth, that problem is generally solved. The application may cause the heap to grow, and the allocator will map new, unscavenged memory in response. Or, similarly, the application may allocate out of scavenged memory.

On the flip side, figuring out the rate at which to shrink the RSS is harder. Ideally the rate is “as soon as possible”, but unfortunately this could result in latency issues.

jemalloc solves this by having its memory “decay” according to a sigmoid-like curve.

Each contiguous extent of allocable memory decays according to a globally-defined tunable rate, and how many of them end up available for scavenging is governed by a sigmoid-like function.

The result of this policy is that the heap shrinks in sigmoidal fashion: carefully turning down to smooth out noise in in-use memory but at some point committing and scavenging lots of memory at once. While this strategy works well in general, it’s still prone to making bad decisions in certain cases, and relies on the developer to tune the decay rate for the application. Furthermore, I believe that this design by jemalloc was a direct result of not knowing anything about the future state of the heap.

But, as I mentioned earlier, the Go runtime does know that the heap will grow to the heap goal, so we know at what point we should stop (i.e. set the rate to 0). What we don’t know is how fast we should get to that point.

To answer that question, I propose we set an overhead goal; let‘s say we steal no more than 1% of the mutator’s time doing syscalls to perform scavenging tasks. Based on experiments, it takes roughly 10µs to scavenge 1 physical page, so we can derive an average scavenging rate of 1 page per millisecond.

1 physical page per millisecond, however, is a slow rate. On linux/amd64 this corresponds to 4 KiB/ms. Consider the extreme (but not unrealistic) case of a heap which spikes to 200 GiB in size during initialization but quickly settles back to its steady-state of 50 GiB. Reclaiming that 150 GiB would take about 11 hours at that rate.

But on some platforms there‘s a saving grace: Linux and FreeBSD have a feature called transparent huge pages, meaning the OS backs contiguous and aligned blocks of 512 physical pages (on amd64) with a single huge page in the machine’s page table. Scavenging a single huge page has about the same cost as scavenging a single regular physical page.

If we consider the example from earlier again, those 150 GiB are likely going to be covered with huge pages (and if not, we can ensure that this is true with some very infrequent madvise syscalls when coalescing spans), which means most of the time we can scavenge that space at a rate of 2 MiB/ms, which amounts to about 77 seconds, which is reasonable for a heap of that size.

Thus, I propose we pick a rate that is based on the number of free, unscavenged huge pages in the heap, a fixed mutator overhead target, and the performance characteristics of madvise. This rate will be updated after any change to the heap goal, so at minimum at the end of every GC.

To keep allocation latencies at about their current levels, the scavenger will run asynchronously, much like the Go runtime’s background sweeper. Another advantage of running asynchronously is that even if the application goes idle the scavenger will continue its work until it is done because the rate is defined for that GC cycle.

Unlike the background sweeper, which does not have an explicit pacing mechanism, the background scavenger will batch more work if it finds itself falling behind the pacing goal. The scavenger will need to acquire the heap lock which could impact allocation latencies anyway, but as long as its paced properly the critical section should be fairly short (usually 10µs).

One issue with having the scavenger be fully asynchronous is that the application could actively create more work for the scavenger to do. There are two ways this could happen:

An allocation causes the heap to grow.
An allocation is satisfied using scavenged memory. The former case is already eliminated by heap-growth scavenging. The latter case may be eliminated by scavenging memory when we allocate from scavenged memory, which as of Go 1.12 we also already do.

The additional scavenging during allocation could prove expensive in terms of latency, given the costs associated with the madvise syscall. In the steady-state this situation should be avoided, but this is a known risk of this design.

Which memory should we scavenge?

At first, this question appears to be a lot like trying to pick an eviction policy for caches or a page replacement policy. The naive answer is thus to favor least-recently used memory, since there’s a cost to allocating scavenged memory (much like a cache miss). Indeed, this is the route which jemalloc takes.

However unlike cache eviction policies or page replacement policies, which cannot make any assumptions about memory accesses, scavenging policy is deeply tied to allocation policy. Fundamentally, we want to scavenge the memory that the allocator is least likely to pick in the future.

For a best-fit allocation policy, one idea (the current one) is to pick the largest contiguous chunks of memory first for scavenging.

This scavenging policy does well and picks the least likely to be reused spans assuming that most allocations are small. If most allocations are small, then smaller contiguous free spaces will be used up first, and larger ones may be scavenged with little consequence. Consider also that even though the cost of scavenging memory is generally proportional to how many physical pages are scavenged at once, scavenging memory still has fixed costs that may be amortized by picking larger spans first. In essence, by making fewer madvise syscalls, we pay the cost of the syscall itself less often. In the cases where most span allocations aren’t small, however, we’ll be making the same number of madvise syscalls but we will incur many more page faults.

Thus, I propose a more robust alternative: change the Go runtime’s span allocation policy to be first-fit, rather than best-fit. Address-ordered first-fit allocation policies generally perform as well as best-fit in practice when it comes to fragmentation [Johnstone98], a claim which I verified holds true for the Go runtime by simulating a large span allocation trace.

Furthermore, I propose we then scavenge the spans with the highest base address first. The advantage of a first-fit allocation policy here is that we know something about which chunks of memory will actually be chosen, which leads us to a sensible scavenging policy.

First-fit allocation paired with scavenging the “last” spans has a clear preference for taking spans which are less likely to be used, even if the assumption that most allocations are small does not hold. Therefore this policy is more robust than the current one and should therefore incur fewer page faults overall.

There’s still the more general question of how performant this policy will be. First and foremost, efficient implementations of first fit-allocation exist (see Appendix). Secondly, a valid concern with this new policy is that it no longer amortizes the fixed costs of scavenging because it may choose smaller spans to scavenge, thereby making more syscalls to scavenge the same amount of memory.

In the case where most span allocations are small, a first-fit allocation policy actually works to our advantage since it tends to aggregate smaller fragments at lower addresses and larger fragments at higher addresses [Wilson95]. In this case I expect performance to be on par with best-fit allocation and largest-spans-first scavenging. Where this assumption does not hold true, it’s true that this new policy may end up making more syscalls. However, the sum total of the marginal costs in scavenging generally outweigh the fixed costs.

The one exception here is huge pages, which have very tiny marginal costs. In order to maintain coherency with our pacing, which is based on the number of free, unscavenged huge pages, I further propose we prefer to scavenge huge pages before smaller free heap fragments. By doing this we're able to get the best throughput possible, in case huge pages are “stuck” behind lots of small fragments. In general this likely shouldn’t be the case, it helps protect against that happening. To achieve this, we must also to keep track of where we have free, unscavenged huge pages, but this may be done fairly easily: we just need to extend the runtime treap implementation to support searching for huge pages.

Although this policy diverges slightly from the highest-address-first ordering, for each huge page release we get a much larger “bang for our buck”. Furthermore, this policy is only slightly divergent since larger free fragments in an address-ordered allocation policy tend to be found at higher addresses anyway.

With all of this put together, I suspect any performance degradation will be minimal.

Implementation

Michael Knyszek will implement this functionality. The rough plan is as follows:

Add treap tests.
Modify and augment the treap implementation to efficiently implement first-fit allocation.
- This step will simultaneously change the policy to pick higher addresses first without any additional work.
Track the number of free unscavenged huge pages.
Track spans which have free unscavenged huge pages explicitly.
Remove the existing periodic scavenger.
Add a background scavenger goroutine.

Other Considerations

Heap Lock Contention. Currently the scavenging process happens with the heap lock held. With each scavenging operation taking on the order of 10µs, this can add up fast and block progress. The way jemalloc combats this is to give up the heap lock when actually making any scavenging-related syscalls. Unfortunately this comes with the caveat that any spans currently being scavenged are not available for allocation, which could cause more heap growths and discourage reuse of existing virtual address space. Also, a process’s memory map is protected by a single coarse-grained read-write lock on many modern operating systems and writers typically need to queue behind readers. Since scavengers are usually readers of this lock and heap growth is a writer on this lock it may mean that letting go of the heap lock doesn’t help so much.

Appendix: Implementing a First-fit Data Structure

We can efficiently find the lowest-address available chunk of memory that also satisfies the allocation request by modifying and augmenting any existing balanced binary tree.

For brevity we’ll focus just on the treap implementation in the runtime here. The technique shown here is similar to that found in [Rezaei00]. The recipe for transforming out best-fit treap into a first-fit treap consists of the following steps:

First, modify the existing treap implementation to sort by a span’s base address.

Next, attach a new field to each binary tree node called maxPages. This field represents the maximum size in 8 KiB pages of a span in the subtree rooted at that node.

For a leaf node, maxPages is always equal to the node’s span’s length. This invariant is maintained every time the tree changes. For most balanced trees, the tree may change in one of three ways: insertion, removal, and tree rotations.

Tree rotations are simple: one only needs to update the two rotated nodes by checking their span’s size and comparing it with maxPages of their left and right subtrees, taking the maximum (effectively just recomputing maxPages non-recursively).

A newly-inserted node in a treap is always a leaf, so that case is handled. Once we insert it, however, any number of subtrees from the parent may now have a different maxPages, so we start from the newly-inserted node and walk up the tree, updating maxPages. Once we reach a point where maxPages does not change, we may stop. Then we may rotate the leaf into place. At most, we travel the height of the tree in this update, but usually we’ll travel less.

On removal, a treap uses rotations to make the node to-be-deleted a leaf. Once the node becomes a leaf, we remove it, and then update its ancestors starting from its new parent. We may stop, as before, when maxPages is no longer affected by the change.

Finally, we modify the algorithm which finds a suitable span to use for allocation, or returns nil if one is not found. To find the first-fit span in the tree we leverage maxPages in the following algorithm (in pseudo-Go pseudocode):

1  func Find(root, pages):
2    t = root
3    for t != nil:
4      if t.left != nil and t.left.maxPages >= pages:
5        t = t.left
6      else if t.span.pages >= pages:
7        return t.span
8      else t.right != nil and t.right.maxPages >= pages:
9        t = t.right
10     else:
11       return nil

By only going down paths where we’re sure there’s at least one span that can satisfy the allocation, we ensure that the algorithm always returns a span of at least pages in size.

Because we prefer going left if possible (line 4) over taking the current node’s span (line 6) over going right if possible (line 8), we ensure that we allocate the node with the lowest base address.

The case where we cannot go left, cannot take the current node, and cannot go right (line 10) should only be possible at the root if maxPages is managed properly. That is, just by looking at the root, one can tell whether an allocation request is satisfiable.

References

Johnstone, Mark S., and Paul R. Wilson. “The Memory Fragmentation Problem: Solved?” Proceedings of the First International Symposium on Memory Management - ISMM 98, 1998. doi:10.1145/286860.286864.

M. Rezaei and K. M. Kavi, “A new implementation technique for memory management,” Proceedings of the IEEE SoutheastCon 2000. ‘Preparing for The New Millennium’ (Cat.No.00CH37105), Nashville, TN, USA, 2000, pp. 332-339. doi:10.1109/SECON.2000.845587

Wilson, Paul R., Mark S. Johnstone, Michael Neely, and David Boles. “Dynamic Storage Allocation: A Survey and Critical Review.” Memory Management Lecture Notes in Computer Science, 1995, 1-116. doi:10.1007/3-540-60368-9_19.