design/30333-smarter-scavenging.md - proposal - Git at Google

 # Proposal: Smarter Scavenging

 Author(s): Michael Knyszek \<mknyszek@google.com\> Last Updated: 2019-02-20

 ## Motivation & Purpose

 Out-of-memory errors (OOMs) have long been a pain-point for Go applications.
 A class of these errors come from the same underlying cause: a temporary spike
 in memory causes the Go runtime to grow the heap, but it takes a very long time
 (on the order of minutes) to return that unneeded memory back to the system.

 The system can end up killing the application in many situations, such as if
 the system has no swap space or if system monitors count this space against
 your application.
 In addition, if this additional space is counted against your application, you
 end up paying more for memory when you don’t really need it.

 The Go runtime does have internal mechanisms to help deal with this, but they
 don’t react to changes in the application promptly enough.
 The way users solve this problem today is through a runtime library function
 called `debug.FreeOSMemory`.
 `debug.FreeOSMemory` performs a GC and subsequently returns all unallocated
 memory back to the underlying system.
 However, this solution is very heavyweight:
 * Returning all free memory back to the underlying system at once is expensive,
   and can lead to latency spikes as it holds the heap lock through the whole
   process.
 * It’s an invasive solution: you need to modify your code to call it when you
   need it.
 * Reusing free chunks of memory becomes more expensive. On UNIX-y systems that
   means an extra page fault (which is surprisingly expensive on some systems).

 The purpose of this document is to propose we replace the existing mechanisms in
 the runtime with something stronger that responds promptly to the memory
 requirements of Go applications, ensuring the application is only charged for as
 much as it needs to remain performant.

 ## Background

 ### Scavenging

 Dynamic memory allocators typically obtain memory from the operating system by
 requesting for it to be mapped into their virtual address space.
 Sometimes this space ends up unused, and modern operating systems provide a way
 to tell the OS that certain virtual memory address regions won’t be used
 without unmapping them. This means the physical memory backing those regions
 may be taken back by the OS and used elsewhere.
 We in the Go runtime refer to this technique as “scavenging”.

 Scavenging is especially useful in dealing with page-level external
 fragmentation, since we can give these fragments back to the OS, reducing the
 process’ resident set size (RSS).
 That is, the amount of memory that is backed by physical memory in the
 application’s address space.

 ### Go 1.11

 As of Go 1.11, the only scavenging process in the Go runtime was a periodic
 scavenger which runs every 2.5 minutes.
 This scavenger combs over all the free spans in the heap and scavenge them if
 they have been unused for at least 5 minutes.
 When the runtime coalesced spans, it would track how much of the new span was
 scavenged.

 While this simple technique is surprisingly effective for long-running
 applications, the peak RSS of an application can end up wildly exaggerated in
 many circumstances, even though the application’s peak in-use memory is
 significantly smaller.
 The periodic scavenger just does not react quickly enough to changes in the
 application’s memory usage.

 ### Go 1.12

 As of Go 1.12, in addition to the periodic scavenger, the Go runtime
 also performs heap-growth scavenging.
 On each heap growth up to N bytes of the largest spans are scavenged, where N
 is the amount of bytes the heap grew by.
 The idea here is to “pay back” the cost of a heap growth.
 This technique helped to reduce the peak RSS of some applications.

 #### Note on Span Coalescing Rules

 As part of the Go 1.12 release, the span coalescing rules had changed such that
 scavenged and unscavenged spans would not coalesce.

 Earlier in the Go 1.12 cycle a choice was made to coalesce the two different
 kinds of spans by scavenging them, but this turned out to be far too aggressive
 in practice since most spans would become scavenged over time.
 This policy was especially costly if scavenging was particularly expensive on a
 given platform.

 In addition to avoiding the problem above, there’s a key reason why not to merge
 across this boundary: if most spans end up scavenged over time, then we do not
 have the fine-grained control we need over memory to create good policies and
 mechanisms for scavenging memory.

 ### Prior Art

 For scavenging, we look to C/C++ allocators which have a much richer history of
 scavenging memory than allocators for managed languages. For example, the
 HotSpot VM just started scavenging memory, and even then its policies are very
 conservative, [only returning memory during low application
 activity](https://openjdk.java.net/jeps/346).
 The [Shenandoah
 collector](https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2018-June/022203.html)
 has had this functionality for a little while, but it just does the same thing
 Go 1.11 did, as far as I can tell.

 For the purposes of this document, we will focus our comparisons on
 [jemalloc](https://jemalloc.net), which appears to me to be the state-of-the-art
 in scavenging.

 ## Goals

 The goal in scavenging smarter is two-fold:
 * Reduce the average and peak RSS of Go applications.
 * Minimize the CPU impact of keeping the RSS low.

 The two goals go hand-in-hand. On the one hand, you want to keep the RSS of the
 application as close to its in-use memory usage as possible.
 On the other hand, doing so is expensive in terms of CPU time, having to make
 syscalls and handle page faults.
 If we’re too aggressive and scavenge every free space we have, then on every
 span allocation we effectively incur a hard page fault (or invoke a syscall),
 and we’re calling a syscall on every span free.

 The ideal scenario, in my view, is that the RSS of the application “tracks” the
 peak in-use memory over time.
 * We should keep the RSS close to the actual in-use heap, but leave enough of a
   buffer such that the application has a pool of unscavenged memory to allocate
   from.
 * We should try to smooth over fast and transient changes in heap size.

 The goal of this proposal is to improve the Go runtime’s scavenging mechanisms
 such that it exhibits the behavior shown above.
 Compared with today’s implementation, this behavior should reduce the average
 overall RSS of most Go applications with minimal impact on performance.

 ## Proposal

 Three questions represent the key policy decisions that describe a memory
 scavenging system.
 1. At what rate is memory scavenged?
 1. How much memory should we retain (not scavenge)?
 1. Which memory should we scavenge?

 I propose that for the Go runtime, we:
 1. Scavenge at a rate proportional to the rate at which the application is
    allocating memory.
 1. Retain some constant times the peak heap goal over the last `N` GCs.
 1. Scavenge the unscavenged spans with the highest base addresses first.

 Additionally, I propose we change the span allocation policy to prefer
 unscavenged spans over scavenged spans, and to be first-fit rather than
 best-fit.

 ## Rationale

 ### How much memory should we retain?

 As part of our goal to keep the program’s reported RSS to a minimum, we ideally
 want to scavenge as many pages as it takes to track the program’s in-use memory.

 However, there’s a performance trade-off in tracking the program’s in-use memory
 too closely.
 For example, if the heap very suddenly shrinks but then grows again, there's a
 significant cost in terms of syscalls and page faults incurred.
 On the other hand, if we scavenge too passively, then the program’s reported
 RSS may be inflated significantly.

 This question is difficult to answer in general, because generally allocators
 can not predict the future behavior of the application.
 jemalloc avoids this question entirely, relying solely on having a good (albeit
 complicated) answer to the “rate” question (see next section).

 But, as Austin mentioned in
 [golang/go#16930](https://github.com/golang/go/issues/16930), Go has an
 advantage over C/C++ allocators in this respect.
 The Go runtime knows that before the next GC, the heap will grow to the heap
 goal.

 This suggests that between GCs there may be some body of free memory that one
 can drop with relatively few consequences.
 Thus, I propose the following heuristic, borrowed from #16930: retain
 `C*max(heap goal, max(heap goal over the last N GCs))` bytes of memory, and
 scavenge the rest.
 For a full rationale of the formula, see
 [golang/go#16930](https://github.com/golang/go/issues/16930).
 `C` is the "steady state variance factor" mentioned in #16930.
 `C` also represents a pool of unscavenged memory in addition to that guaranteed
 by the heap goal which the application may allocate from, increasing the
 probability that a given allocation will be satisfied by unscavenged memory and
 thus not incur a page fault on access.
 The initial proposed values for `C` and `N` are 1.125 (9/8) and 16,
 respectively.

 ### At what rate is memory scavenged?

 In order to have the application’s RSS track the amount of heap space it’s
 actually using over time, we want to be able to grow and shrink the RSS at a
 rate proportional to how the in-use memory of the application is growing and
 shrinking, with smoothing.

 When it comes to growth, that problem is generally solved.
 The application may cause the heap to grow, and the allocator will map new,
 unscavenged memory in response.
 Or, similarly, the application may allocate out of scavenged memory.

 On the flip side, figuring out the rate at which to shrink the RSS is harder.
 Ideally the rate is “as soon as possible”, but unfortunately this could result
 in latency issues.

 jemalloc solves this by having its memory “decay” according to a sigmoid-like
 curve.

 Each contiguous extent of allocable memory decays according to a
 globally-defined tunable rate, and how many of them end up available for
 scavenging is governed by a sigmoid-like function.

 The result of this policy is that the heap shrinks in sigmoidal fashion:
 carefully turning down to smooth out noise in in-use memory but at some point
 committing and scavenging lots of memory at once.
 While this strategy works well in general, it’s still prone to making bad
 decisions in certain cases, and relies on the developer to tune the decay rate
 for the application.
 Furthermore, I believe that this design by jemalloc was a direct result of not
 knowing anything about the future state of the heap.

 As mentioned earlier, the Go runtime does know that the heap will grow to the
 heap goal.

 Thus, I propose a *proportional scavenging policy*, in the same vein as the
 runtime’s proportional sweeping implementation.
 Because of how Go’s GC is paced, we know that the heap will grow to the heap
 goal in the future and we can measure how quickly it’s approaching that goal by
 seeing how quickly it’s allocating.
 Between GCs, I propose that the scavenger do its best to scavenge down to the
 scavenge goal by the time the next GC comes in.

 The proportional scavenger will run asynchronously, much like the Go runtime’s
 background sweeper, but will be more aggressive, batching more scavenging work
 if it finds itself falling behind.

 One issue with this design is situations where the application goes idle.
 In that case, the scavenger will do at least one unit of work (scavenge one
 span) on wake-up to ensure it makes progress as long as there's work to be
 done.

 Another issue with having the scavenger be fully asynchronous is that the
 application could actively create more work for the scavenger to do.
 There are two ways this could happen:
 * An allocation causes the heap to grow.
 * An allocation is satisfied using scavenged memory.
 The former case is already eliminated by heap-growth scavenging.
 The latter case may be eliminated by scavenging memory when we allocate from
 scavenged memory, which as of Go 1.12 we also already do.

 The additional scavenging during allocation could prove expensive, given the
 costs associated with the madvise syscall.
 I believe we can dramatically reduce the amount of times this is necessary by
 reusing unscavenged memory before scavenged memory when allocating.
 Thus, where currently we try to find the best-fit span across both scavenged
 and unscavenged spans, I propose we *prefer unscavenged to scavenged spans
 during allocation*.

 The benefits of this policy are that unscavenged pages are now significantly
 more likely to be reused.

 ### Which memory should we scavenge?

 At first, this question appears to be a lot like trying to pick an eviction
 policy for caches or a page replacement policy.
 The naive answer is thus to favor least-recently used memory, since there’s a
 cost to allocating scavenged memory (much like a cache miss).
 Indeed, this is the route which jemalloc takes.

 However unlike cache eviction policies or page replacement policies, which
 cannot make any assumptions about memory accesses, scavenging policy is deeply
 tied to allocation policy.
 Fundamentally, we want to scavenge the memory that the allocator is least
 likely to pick in the future.

 For a best-fit allocation policy, one idea (the current one) is to pick the
 largest contiguous chunks of memory first for scavenging.

 This scavenging
 policy does well and picks the least likely to be reused spans assuming that
 most allocations are small.
 If most allocations are small, then smaller contiguous free spaces will be used
 up first, and larger ones may be scavenged with little consequence.
 Consider also that even though the cost of scavenging memory is generally
 proportional to how many physical pages are scavenged at once, scavenging
 memory still has fixed costs that may be amortized by picking larger spans
 first.
 In essence, by making fewer madvise syscalls, we pay the cost of the syscall
 itself less often.
 In the cases where most span allocations aren’t small, however, we’ll be making
 the same number of madvise syscalls but we will incur many more page faults.

 Thus, I propose a more robust alternative: *change the Go runtime’s span
 allocation policy to be first-fit, rather than best-fit*.
 Address-ordered first-fit allocation policies generally perform as well as
 best-fit in practice when it comes to fragmentation [Johnstone98], a claim
 which I verified holds true for the Go runtime by simulating a large span
 allocation trace.

 Furthermore, I propose we then *scavenge the spans with the highest base address
 first*.
 The advantage of a first-fit allocation policy here is that we know something
 about which chunks of memory will actually be chosen, which leads us to a
 sensible scavenging policy.

 First-fit allocation paired with scavenging the "last" spans has a clear
 preference for taking spans which are less likely to be used, even if the
 assumption that most allocations are small does not hold.
 Therefore this policy is more robust than the current one and should therefore
 incur fewer page faults overall.

 There’s still the more general question of how performant this policy will be.
 First and foremost, efficient implementations of first fit-allocation exist (see
 Appendix).
 Secondly, a valid concern with this new policy is that it no longer amortizes
 the fixed costs of scavenging because it may choose smaller spans to scavenge,
 thereby making more syscalls to scavenge the same amount of memory.

 In the case where most span allocations are small, a first-fit allocation
 policy actually works to our advantage since it tends to aggregate smaller
 fragments at lower addresses and larger fragments at higher addresses
 [Wilson95].
 In this case I expect performance to be on par with best-fit
 allocation and largest-spans-first scavenging. Where this assumption does not
 hold true, it’s true that this new policy may end up making more syscalls.
 However, the sum total of the marginal costs in scavenging generally outweigh
 the fixed costs.
 The one exception here is huge pages, which have very tiny marginal costs, but
 it's unclear how good of a job we or anyone else is doing with keeping huge
 pages intact, and this demands more research that is outside the scope of this
 design.
 Overall, I suspect any performance degradation will be minimal.

 ## Implementation

 Michael Knyszek will implement this functionality.
 The rough plan will be as follows:
 1. Remove the existing periodic scavenger.
 1. Track the last N heap goals, as in the prompt scavenging proposal.
 1. Add a background goroutine which performs proportional scavenging.
 1. Modify and augment the treap implementation to efficiently implement first-fit
    allocation.
    * This step will simultaneously change the policy to pick higher addresses
      first without any additional work.
    * Add tests to ensure the augmented treap works as intended.

 ## Other Considerations

 *Heap Lock Contention.*
 Currently the scavenging process happens with the heap lock held.
 With each scavenging operation taking on the order of 10µs, this can add up
 fast and block progress.
 The way jemalloc combats this is to give up the heap lock when actually making
 any scavenging-related syscalls.
 Unfortunately this comes with the caveat that any spans currently being
 scavenged are not available for allocation, which could cause more heap growths
 and discourage reuse of existing virtual address space.
 Also, a process’s memory map is protected by a single coarse-grained read-write
 lock on many modern operating systems and writers typically need to queue
 behind readers.
 Since scavengers are usually readers of this lock and heap growth is a writer
 on this lock it may mean that letting go of the heap lock doesn’t help so much.

 ## Appendix: Implementing a First-fit Data Structure

 We can efficiently find the lowest-address available chunk of memory that also
 satisfies the allocation request by modifying and augmenting any existing
 balanced binary tree.

 For brevity we’ll focus just on the treap implementation in the runtime here.
 The technique shown here is similar to that found in [Rezaei00].
 The recipe for transforming out best-fit treap into a first-fit treap consists
 of the following steps:

 First, modify the existing treap implementation to sort by a span’s base
 address.

 Next, attach a new field to each binary tree node called maxPages.
 This field represents the maximum size in 8 KiB pages of a span in the subtree
 rooted at that node.

 For a leaf node, maxPages is always equal to the node’s span’s
 length.
 This invariant is maintained every time the tree changes. For most balanced
 trees, the tree may change in one of three ways: insertion, removal, and tree
 rotations.

 Tree rotations are simple: one only needs to update the two rotated nodes by
 checking their span’s size and comparing it with maxPages of their left and
 right subtrees, taking the maximum (effectively just recomputing maxPages
 non-recursively).

 A newly-inserted node in a treap is always a leaf, so that case is handled.
 Once we insert it, however, any number of subtrees from the parent may now have
 a different maxPages, so we start from the newly-inserted node and walk up the
 tree, updating maxPages.
 Once we reach a point where maxPages does not change, we may stop.
 Then we may rotate the leaf into place. At most, we travel the height of the
 tree in this update, but usually we’ll travel less.

 On removal, a treap uses rotations to make the node to-be-deleted a leaf. Once
 the node becomes a leaf, we remove it, and then update its ancestors starting
 from its new parent.
 We may stop, as before, when maxPages is no longer affected by the change.

 Finally, we modify the algorithm which finds a suitable span to use for
 allocation, or returns nil if one is not found.
 To find the first-fit span in the tree we leverage maxPages in the following
 algorithm (in pseudo-Go pseudocode):

 ```
 1  func Find(root, pages):
 2    t = root
 3    for t != nil:
 4      if t.left != nil and t.left.maxPages >= pages:
 5        t = t.left
 6      else if t.span.pages >= pages:
 7        return t.span
 8      else t.right != nil and t.right.maxPages >= pages:
 9        t = t.right
 10     else:
 11       return nil
 ```

 By only going down paths where we’re sure there’s at least one span that can
 satisfy the allocation, we ensure that the algorithm always returns a span of
 at least `pages` in size.

 Because we prefer going left if possible (line 4) over taking the current
 node’s span (line 6) over going right if possible (line 8), we ensure that we
 allocate the node with the lowest base address.

 The case where we cannot go left, cannot take the current node, and cannot go
 right (line 10) should only be possible at the root if maxPages is managed
 properly.
 That is, just by looking at the root, one can tell whether an allocation
 request is satisfiable.

 ## References

 Johnstone, Mark S., and Paul R. Wilson. "The Memory Fragmentation Problem:
 Solved?" Proceedings of the First International Symposium on Memory Management -
 ISMM 98, 1998. doi:10.1145/286860.286864.

 M. Rezaei and K. M. Kavi, "A new implementation technique for memory
 management," Proceedings of the IEEE SoutheastCon 2000. 'Preparing for The New
 Millennium' (Cat.No.00CH37105), Nashville, TN, USA, 2000, pp. 332-339.
 doi:10.1109/SECON.2000.845587

 Wilson, Paul R., Mark S. Johnstone, Michael Neely, and David Boles. "Dynamic
 Storage Allocation: A Survey and Critical Review." Memory Management Lecture
 Notes in Computer Science, 1995, 1-116. doi:10.1007/3-540-60368-9_19.