| # Proposal: Smarter Scavenging |
| |
| Author(s): Michael Knyszek \<mknyszek@google.com\> Last Updated: 2019-02-20 |
| |
| ## Motivation & Purpose |
| |
| Out-of-memory errors (OOMs) have long been a pain-point for Go applications. |
| A class of these errors come from the same underlying cause: a temporary spike |
| in memory causes the Go runtime to grow the heap, but it takes a very long time |
| (on the order of minutes) to return that unneeded memory back to the system. |
| |
| The system can end up killing the application in many situations, such as if |
| the system has no swap space or if system monitors count this space against |
| your application. |
| In addition, if this additional space is counted against your application, you |
| end up paying more for memory when you don’t really need it. |
| |
| The Go runtime does have internal mechanisms to help deal with this, but they |
| don’t react to changes in the application promptly enough. |
| The way users solve this problem today is through a runtime library function |
| called `debug.FreeOSMemory`. |
| `debug.FreeOSMemory` performs a GC and subsequently returns all unallocated |
| memory back to the underlying system. |
| However, this solution is very heavyweight: |
| * Returning all free memory back to the underlying system at once is expensive, |
| and can lead to latency spikes as it holds the heap lock through the whole |
| process. |
| * It’s an invasive solution: you need to modify your code to call it when you |
| need it. |
| * Reusing free chunks of memory becomes more expensive. On UNIX-y systems that |
| means an extra page fault (which is surprisingly expensive on some systems). |
| |
| The purpose of this document is to propose we replace the existing mechanisms in |
| the runtime with something stronger that responds promptly to the memory |
| requirements of Go applications, ensuring the application is only charged for as |
| much as it needs to remain performant. |
| |
| ## Background |
| |
| ### Scavenging |
| |
| Dynamic memory allocators typically obtain memory from the operating system by |
| requesting for it to be mapped into their virtual address space. |
| Sometimes this space ends up unused, and modern operating systems provide a way |
| to tell the OS that certain virtual memory address regions won’t be used |
| without unmapping them. This means the physical memory backing those regions |
| may be taken back by the OS and used elsewhere. |
| We in the Go runtime refer to this technique as “scavenging”. |
| |
| Scavenging is especially useful in dealing with page-level external |
| fragmentation, since we can give these fragments back to the OS, reducing the |
| process’ resident set size (RSS). |
| That is, the amount of memory that is backed by physical memory in the |
| application’s address space. |
| |
| ### Go 1.11 |
| |
| As of Go 1.11, the only scavenging process in the Go runtime was a periodic |
| scavenger which runs every 2.5 minutes. |
| This scavenger combs over all the free spans in the heap and scavenge them if |
| they have been unused for at least 5 minutes. |
| When the runtime coalesced spans, it would track how much of the new span was |
| scavenged. |
| |
| While this simple technique is surprisingly effective for long-running |
| applications, the peak RSS of an application can end up wildly exaggerated in |
| many circumstances, even though the application’s peak in-use memory is |
| significantly smaller. |
| The periodic scavenger just does not react quickly enough to changes in the |
| application’s memory usage. |
| |
| ### Go 1.12 |
| |
| As of Go 1.12, in addition to the periodic scavenger, the Go runtime |
| also performs heap-growth scavenging. |
| On each heap growth up to N bytes of the largest spans are scavenged, where N |
| is the amount of bytes the heap grew by. |
| The idea here is to “pay back” the cost of a heap growth. |
| This technique helped to reduce the peak RSS of some applications. |
| |
| #### Note on Span Coalescing Rules |
| |
| As part of the Go 1.12 release, the span coalescing rules had changed such that |
| scavenged and unscavenged spans would not coalesce. |
| |
| Earlier in the Go 1.12 cycle a choice was made to coalesce the two different |
| kinds of spans by scavenging them, but this turned out to be far too aggressive |
| in practice since most spans would become scavenged over time. |
| This policy was especially costly if scavenging was particularly expensive on a |
| given platform. |
| |
| In addition to avoiding the problem above, there’s a key reason why not to merge |
| across this boundary: if most spans end up scavenged over time, then we do not |
| have the fine-grained control we need over memory to create good policies and |
| mechanisms for scavenging memory. |
| |
| ### Prior Art |
| |
| For scavenging, we look to C/C++ allocators which have a much richer history of |
| scavenging memory than allocators for managed languages. For example, the |
| HotSpot VM just started scavenging memory, and even then its policies are very |
| conservative, [only returning memory during low application |
| activity](https://openjdk.java.net/jeps/346). |
| The [Shenandoah |
| collector](https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2018-June/022203.html) |
| has had this functionality for a little while, but it just does the same thing |
| Go 1.11 did, as far as I can tell. |
| |
| For the purposes of this document, we will focus our comparisons on |
| [jemalloc](https://jemalloc.net), which appears to me to be the state-of-the-art |
| in scavenging. |
| |
| ## Goals |
| |
| The goal in scavenging smarter is two-fold: |
| * Reduce the average and peak RSS of Go applications. |
| * Minimize the CPU impact of keeping the RSS low. |
| |
| The two goals go hand-in-hand. On the one hand, you want to keep the RSS of the |
| application as close to its in-use memory usage as possible. |
| On the other hand, doing so is expensive in terms of CPU time, having to make |
| syscalls and handle page faults. |
| If we’re too aggressive and scavenge every free space we have, then on every |
| span allocation we effectively incur a hard page fault (or invoke a syscall), |
| and we’re calling a syscall on every span free. |
| |
| The ideal scenario, in my view, is that the RSS of the application “tracks” the |
| peak in-use memory over time. |
| * We should keep the RSS close to the actual in-use heap, but leave enough of a |
| buffer such that the application has a pool of unscavenged memory to allocate |
| from. |
| * We should try to smooth over fast and transient changes in heap size. |
| |
| The goal of this proposal is to improve the Go runtime’s scavenging mechanisms |
| such that it exhibits the behavior shown above. |
| Compared with today’s implementation, this behavior should reduce the average |
| overall RSS of most Go applications with minimal impact on performance. |
| |
| ## Proposal |
| |
| Three questions represent the key policy decisions that describe a memory |
| scavenging system. |
| 1. At what rate is memory scavenged? |
| 1. How much memory should we retain (not scavenge)? |
| 1. Which memory should we scavenge? |
| |
| I propose that for the Go runtime, we: |
| 1. Scavenge at a rate proportional to the rate at which the application is |
| allocating memory. |
| 1. Retain some constant times the peak heap goal over the last `N` GCs. |
| 1. Scavenge the unscavenged spans with the highest base addresses first. |
| |
| Additionally, I propose we change the span allocation policy to prefer |
| unscavenged spans over scavenged spans, and to be first-fit rather than |
| best-fit. |
| |
| ## Rationale |
| |
| ### How much memory should we retain? |
| |
| As part of our goal to keep the program’s reported RSS to a minimum, we ideally |
| want to scavenge as many pages as it takes to track the program’s in-use memory. |
| |
| However, there’s a performance trade-off in tracking the program’s in-use memory |
| too closely. |
| For example, if the heap very suddenly shrinks but then grows again, there's a |
| significant cost in terms of syscalls and page faults incurred. |
| On the other hand, if we scavenge too passively, then the program’s reported |
| RSS may be inflated significantly. |
| |
| This question is difficult to answer in general, because generally allocators |
| can not predict the future behavior of the application. |
| jemalloc avoids this question entirely, relying solely on having a good (albeit |
| complicated) answer to the “rate” question (see next section). |
| |
| But, as Austin mentioned in |
| [golang/go#16930](https://github.com/golang/go/issues/16930), Go has an |
| advantage over C/C++ allocators in this respect. |
| The Go runtime knows that before the next GC, the heap will grow to the heap |
| goal. |
| |
| This suggests that between GCs there may be some body of free memory that one |
| can drop with relatively few consequences. |
| Thus, I propose the following heuristic, borrowed from #16930: retain |
| `C*max(heap goal, max(heap goal over the last N GCs))` bytes of memory, and |
| scavenge the rest. |
| For a full rationale of the formula, see |
| [golang/go#16930](https://github.com/golang/go/issues/16930). |
| `C` is the "steady state variance factor" mentioned in #16930. |
| `C` also represents a pool of unscavenged memory in addition to that guaranteed |
| by the heap goal which the application may allocate from, increasing the |
| probability that a given allocation will be satisfied by unscavenged memory and |
| thus not incur a page fault on access. |
| The initial proposed values for `C` and `N` are 1.125 (9/8) and 16, |
| respectively. |
| |
| ### At what rate is memory scavenged? |
| |
| In order to have the application’s RSS track the amount of heap space it’s |
| actually using over time, we want to be able to grow and shrink the RSS at a |
| rate proportional to how the in-use memory of the application is growing and |
| shrinking, with smoothing. |
| |
| When it comes to growth, that problem is generally solved. |
| The application may cause the heap to grow, and the allocator will map new, |
| unscavenged memory in response. |
| Or, similarly, the application may allocate out of scavenged memory. |
| |
| On the flip side, figuring out the rate at which to shrink the RSS is harder. |
| Ideally the rate is “as soon as possible”, but unfortunately this could result |
| in latency issues. |
| |
| jemalloc solves this by having its memory “decay” according to a sigmoid-like |
| curve. |
| |
| Each contiguous extent of allocable memory decays according to a |
| globally-defined tunable rate, and how many of them end up available for |
| scavenging is governed by a sigmoid-like function. |
| |
| The result of this policy is that the heap shrinks in sigmoidal fashion: |
| carefully turning down to smooth out noise in in-use memory but at some point |
| committing and scavenging lots of memory at once. |
| While this strategy works well in general, it’s still prone to making bad |
| decisions in certain cases, and relies on the developer to tune the decay rate |
| for the application. |
| Furthermore, I believe that this design by jemalloc was a direct result of not |
| knowing anything about the future state of the heap. |
| |
| As mentioned earlier, the Go runtime does know that the heap will grow to the |
| heap goal. |
| |
| Thus, I propose a *proportional scavenging policy*, in the same vein as the |
| runtime’s proportional sweeping implementation. |
| Because of how Go’s GC is paced, we know that the heap will grow to the heap |
| goal in the future and we can measure how quickly it’s approaching that goal by |
| seeing how quickly it’s allocating. |
| Between GCs, I propose that the scavenger do its best to scavenge down to the |
| scavenge goal by the time the next GC comes in. |
| |
| The proportional scavenger will run asynchronously, much like the Go runtime’s |
| background sweeper, but will be more aggressive, batching more scavenging work |
| if it finds itself falling behind. |
| |
| One issue with this design is situations where the application goes idle. |
| In that case, the scavenger will do at least one unit of work (scavenge one |
| span) on wake-up to ensure it makes progress as long as there's work to be |
| done. |
| |
| Another issue with having the scavenger be fully asynchronous is that the |
| application could actively create more work for the scavenger to do. |
| There are two ways this could happen: |
| * An allocation causes the heap to grow. |
| * An allocation is satisfied using scavenged memory. |
| The former case is already eliminated by heap-growth scavenging. |
| The latter case may be eliminated by scavenging memory when we allocate from |
| scavenged memory, which as of Go 1.12 we also already do. |
| |
| The additional scavenging during allocation could prove expensive, given the |
| costs associated with the madvise syscall. |
| I believe we can dramatically reduce the amount of times this is necessary by |
| reusing unscavenged memory before scavenged memory when allocating. |
| Thus, where currently we try to find the best-fit span across both scavenged |
| and unscavenged spans, I propose we *prefer unscavenged to scavenged spans |
| during allocation*. |
| |
| The benefits of this policy are that unscavenged pages are now significantly |
| more likely to be reused. |
| |
| ### Which memory should we scavenge? |
| |
| At first, this question appears to be a lot like trying to pick an eviction |
| policy for caches or a page replacement policy. |
| The naive answer is thus to favor least-recently used memory, since there’s a |
| cost to allocating scavenged memory (much like a cache miss). |
| Indeed, this is the route which jemalloc takes. |
| |
| However unlike cache eviction policies or page replacement policies, which |
| cannot make any assumptions about memory accesses, scavenging policy is deeply |
| tied to allocation policy. |
| Fundamentally, we want to scavenge the memory that the allocator is least |
| likely to pick in the future. |
| |
| For a best-fit allocation policy, one idea (the current one) is to pick the |
| largest contiguous chunks of memory first for scavenging. |
| |
| This scavenging |
| policy does well and picks the least likely to be reused spans assuming that |
| most allocations are small. |
| If most allocations are small, then smaller contiguous free spaces will be used |
| up first, and larger ones may be scavenged with little consequence. |
| Consider also that even though the cost of scavenging memory is generally |
| proportional to how many physical pages are scavenged at once, scavenging |
| memory still has fixed costs that may be amortized by picking larger spans |
| first. |
| In essence, by making fewer madvise syscalls, we pay the cost of the syscall |
| itself less often. |
| In the cases where most span allocations aren’t small, however, we’ll be making |
| the same number of madvise syscalls but we will incur many more page faults. |
| |
| Thus, I propose a more robust alternative: *change the Go runtime’s span |
| allocation policy to be first-fit, rather than best-fit*. |
| Address-ordered first-fit allocation policies generally perform as well as |
| best-fit in practice when it comes to fragmentation [Johnstone98], a claim |
| which I verified holds true for the Go runtime by simulating a large span |
| allocation trace. |
| |
| Furthermore, I propose we then *scavenge the spans with the highest base address |
| first*. |
| The advantage of a first-fit allocation policy here is that we know something |
| about which chunks of memory will actually be chosen, which leads us to a |
| sensible scavenging policy. |
| |
| First-fit allocation paired with scavenging the "last" spans has a clear |
| preference for taking spans which are less likely to be used, even if the |
| assumption that most allocations are small does not hold. |
| Therefore this policy is more robust than the current one and should therefore |
| incur fewer page faults overall. |
| |
| There’s still the more general question of how performant this policy will be. |
| First and foremost, efficient implementations of first fit-allocation exist (see |
| Appendix). |
| Secondly, a valid concern with this new policy is that it no longer amortizes |
| the fixed costs of scavenging because it may choose smaller spans to scavenge, |
| thereby making more syscalls to scavenge the same amount of memory. |
| |
| In the case where most span allocations are small, a first-fit allocation |
| policy actually works to our advantage since it tends to aggregate smaller |
| fragments at lower addresses and larger fragments at higher addresses |
| [Wilson95]. |
| In this case I expect performance to be on par with best-fit |
| allocation and largest-spans-first scavenging. Where this assumption does not |
| hold true, it’s true that this new policy may end up making more syscalls. |
| However, the sum total of the marginal costs in scavenging generally outweigh |
| the fixed costs. |
| The one exception here is huge pages, which have very tiny marginal costs, but |
| it's unclear how good of a job we or anyone else is doing with keeping huge |
| pages intact, and this demands more research that is outside the scope of this |
| design. |
| Overall, I suspect any performance degradation will be minimal. |
| |
| ## Implementation |
| |
| Michael Knyszek will implement this functionality. |
| The rough plan will be as follows: |
| 1. Remove the existing periodic scavenger. |
| 1. Track the last N heap goals, as in the prompt scavenging proposal. |
| 1. Add a background goroutine which performs proportional scavenging. |
| 1. Modify and augment the treap implementation to efficiently implement first-fit |
| allocation. |
| * This step will simultaneously change the policy to pick higher addresses |
| first without any additional work. |
| * Add tests to ensure the augmented treap works as intended. |
| |
| ## Other Considerations |
| |
| *Heap Lock Contention.* |
| Currently the scavenging process happens with the heap lock held. |
| With each scavenging operation taking on the order of 10µs, this can add up |
| fast and block progress. |
| The way jemalloc combats this is to give up the heap lock when actually making |
| any scavenging-related syscalls. |
| Unfortunately this comes with the caveat that any spans currently being |
| scavenged are not available for allocation, which could cause more heap growths |
| and discourage reuse of existing virtual address space. |
| Also, a process’s memory map is protected by a single coarse-grained read-write |
| lock on many modern operating systems and writers typically need to queue |
| behind readers. |
| Since scavengers are usually readers of this lock and heap growth is a writer |
| on this lock it may mean that letting go of the heap lock doesn’t help so much. |
| |
| ## Appendix: Implementing a First-fit Data Structure |
| |
| We can efficiently find the lowest-address available chunk of memory that also |
| satisfies the allocation request by modifying and augmenting any existing |
| balanced binary tree. |
| |
| For brevity we’ll focus just on the treap implementation in the runtime here. |
| The technique shown here is similar to that found in [Rezaei00]. |
| The recipe for transforming out best-fit treap into a first-fit treap consists |
| of the following steps: |
| |
| First, modify the existing treap implementation to sort by a span’s base |
| address. |
| |
| Next, attach a new field to each binary tree node called maxPages. |
| This field represents the maximum size in 8 KiB pages of a span in the subtree |
| rooted at that node. |
| |
| For a leaf node, maxPages is always equal to the node’s span’s |
| length. |
| This invariant is maintained every time the tree changes. For most balanced |
| trees, the tree may change in one of three ways: insertion, removal, and tree |
| rotations. |
| |
| Tree rotations are simple: one only needs to update the two rotated nodes by |
| checking their span’s size and comparing it with maxPages of their left and |
| right subtrees, taking the maximum (effectively just recomputing maxPages |
| non-recursively). |
| |
| A newly-inserted node in a treap is always a leaf, so that case is handled. |
| Once we insert it, however, any number of subtrees from the parent may now have |
| a different maxPages, so we start from the newly-inserted node and walk up the |
| tree, updating maxPages. |
| Once we reach a point where maxPages does not change, we may stop. |
| Then we may rotate the leaf into place. At most, we travel the height of the |
| tree in this update, but usually we’ll travel less. |
| |
| On removal, a treap uses rotations to make the node to-be-deleted a leaf. Once |
| the node becomes a leaf, we remove it, and then update its ancestors starting |
| from its new parent. |
| We may stop, as before, when maxPages is no longer affected by the change. |
| |
| Finally, we modify the algorithm which finds a suitable span to use for |
| allocation, or returns nil if one is not found. |
| To find the first-fit span in the tree we leverage maxPages in the following |
| algorithm (in pseudo-Go pseudocode): |
| |
| ``` |
| 1 func Find(root, pages): |
| 2 t = root |
| 3 for t != nil: |
| 4 if t.left != nil and t.left.maxPages >= pages: |
| 5 t = t.left |
| 6 else if t.span.pages >= pages: |
| 7 return t.span |
| 8 else t.right != nil and t.right.maxPages >= pages: |
| 9 t = t.right |
| 10 else: |
| 11 return nil |
| ``` |
| |
| By only going down paths where we’re sure there’s at least one span that can |
| satisfy the allocation, we ensure that the algorithm always returns a span of |
| at least `pages` in size. |
| |
| Because we prefer going left if possible (line 4) over taking the current |
| node’s span (line 6) over going right if possible (line 8), we ensure that we |
| allocate the node with the lowest base address. |
| |
| The case where we cannot go left, cannot take the current node, and cannot go |
| right (line 10) should only be possible at the root if maxPages is managed |
| properly. |
| That is, just by looking at the root, one can tell whether an allocation |
| request is satisfiable. |
| |
| ## References |
| |
| Johnstone, Mark S., and Paul R. Wilson. "The Memory Fragmentation Problem: |
| Solved?" Proceedings of the First International Symposium on Memory Management - |
| ISMM 98, 1998. doi:10.1145/286860.286864. |
| |
| M. Rezaei and K. M. Kavi, "A new implementation technique for memory |
| management," Proceedings of the IEEE SoutheastCon 2000. 'Preparing for The New |
| Millennium' (Cat.No.00CH37105), Nashville, TN, USA, 2000, pp. 332-339. |
| doi:10.1109/SECON.2000.845587 |
| |
| Wilson, Paul R., Mark S. Johnstone, Michael Neely, and David Boles. "Dynamic |
| Storage Allocation: A Survey and Critical Review." Memory Management Lecture |
| Notes in Computer Science, 1995, 1-116. doi:10.1007/3-540-60368-9_19. |