Proposal: Improve scalability of runtime.lock2

Author(s): Rhys Hiltner

Last updated: 2024-10-16

Discussion at https://go.dev/issue/68578.

Abstract

Improve multi-core scalability of the runtime's internal mutex implementation by minimizing wakeups of waiting threads.

Avoiding wakeups of threads that are waiting for the lock allows those threads to sleep for longer. That reduces the number of concurrent threads that are attempting to read the mutex's state word. Fewer reads of that cache line mean less cache coherency traffic within the processor when a thread needs to make an update. Fast updates (to acquire and release the lock) even when many threads need the lock means better scalability.

This is not an API change, so is not part of the formal proposal process.

Background

One of the simplest mutex designs is a single bit that is “0” when unlocked or “1” when locked. To acquire the lock, a thread attempts to swap in a “1”, looping until the result it gets is “0”. To unlock, the thread swaps in a “0”.

The performance of such a spinlock is poor in at least two ways. First, threads that are trying to acquire an already-held lock waste their own on-CPU time. Second, those software threads execute on hardware resources that need a local copy of the mutex state word in cache.

Having the state word in cache for read access requires it not be writeable by any other processors. Writing to that memory location requires the hardware to invalidate all cached copies of that memory, one in each processor that had loaded it for reading. The hardware-internal communication necessary to implement those guarantees has a cost, which appears as a slowdown when writing to that memory location.

Go‘s current mutex design is several steps more advanced than the simple spinlock, but under certain conditions its performance can degrade in a similar way. First, when runtime.lock2 is unable to immediately obtain the mutex it will pause for a moment before retrying, primarily using hardware-level delay instructions (such as PAUSE on 386 and amd64). Then, if it’s unable to acquire the mutex after several retries it will ask the OS to put it to sleep until another thread requests a wakeup. On Linux, we use the futex syscall to sleep directly on the mutex address, implemented in src/runtime/lock_futex.go. On many other platforms (including Windows and macOS),the waiting threads form a LIFO stack with the mutex state word as a pointer to the top of the stack, implemented in src/runtime/lock_sema.go.

When the futex syscall is available, the OS maintains a list of waiting threads and will choose which it wakes. Otherwise, the Go runtime maintains that list and names a specific thread when it asks the OS to do a wakeup. To avoid a futex syscall when there‘s no contention, we split the “locked” state into two variants: 1 meaning “locked with no contention” and 2 meaning “locked, and a thread may be asleep”. (With the semaphore-based implementation, the Go runtime can--and must--know for itself whether a thread is asleep.) Go’s mutex implementation has those three logical states (unlocked, locked, locked-with-sleepers) on all multi-threaded platforms. For the purposes of the Go runtime, I'm calling this design “tristate”.

After releasing the mutex, runtime.unlock2 will wake a thread whenever one is sleeping. It does not consider whether one of the waiting threads is already awake. If a waiting thread is already awake, it's not necessary to wake another.

Waking additional threads results in higher concurrent demand for the mutex state word's cache line. Every thread that is awake and spinning in a loop to reload the state word leads to more cache coherency traffic within the processor, and to slower writes to that cache line.

Consider the case where many threads all need to use the same mutex many times in a row. Furthermore, consider that the critical section is short relative to the time it takes a thread to give up on spinning and go (back) to sleep. At the end of each critical section, the thread that is releasing the mutex will see that a waiting thread is asleep, and will wake it. It takes a relatively long time for a thread to decide to go to sleep, and there's a relatively short time until the next runtime.unlock2 call will wake it. Many threads will be awake, all reloading the state word in a loop, all slowing down updates to its value.

Without a limit on the number of threads that can spin on the state word, higher demand for a mutex value degrades its performance.

See also https://go.dev/issue/68578.

Proposal

Expand the mutex state word to include a new flag, “spinning”. Use the “spinning” bit to communicate whether one of the waiting threads is awake and looping while trying to acquire the lock. Threads mutually exclude each other from the “spinning” state, but they won‘t block while attempting to acquire the bit. Only the thread that owns the “spinning” bit is allowed to reload the state word in a loop. It releases the “spinning” bit before going to sleep. The other waiting threads go directly to sleep. The thread that unlocks a mutex can avoid waking a thread if it sees that one is already awake and spinning. For the purposes of the Go runtime, I’m calling this design “spinbit”.

futex-based option, https://go.dev/cl/601597

I‘ve prepared https://go.dev/cl/601597, which implements the “spinbit” design for GOOS=linux and GOARCH=amd64. I’ve prepared a matching TLA+ model to check for lost wakeups. (When relying on the futex syscall to maintain the list of sleeping Ms, it's easy to write lost-wakeup bugs.)

It uses an atomic Xchg8 operation on two different bytes of the mutex state word. The low byte records whether the mutex is locked, and whether one or more waiting Ms may be asleep. The “spinning” flag is in a separate byte and so can be independently manipulated with atomic Xchg8 operations. The two bytes are within a single uintptr field (runtime.mutex.key). When the spinning M attempts to acquire the lock, it can do a CAS on the entire state word, setting the “locked” flag and clearing the “spinning” flag in a single operation.

Cross-OS option, https://go.dev/cl/620435

I‘ve also prepared https://go.dev/cl/620435 which unifies the lock_sema.go and lock_futex.go implementations and so supports all GOOS values for which Go supports multiple threads. (It uses futex to implement the runtime.sema{create,sleep,wakeup} functions for lock_futex.go platforms.) Go’s development branch now includes Xchg8 support for GOARCH=amd64,arm64,ppc64,ppc64le, and so that CL supports all of those architectures.

The fast path for runtime.lock2 and runtime.unlock2 use Xchg8 operations to interact with the “locked” flag. The lowest byte of the state word is dedicated to use with those Xchg8 operations. Most of the upper bytes hold a partial pointer to an M. (The runtime.m datastructure is large enough to allow reconstructing the low bits from the partial pointer, with special handling for the non-heap-allocated runtime.m0 value.) Beyond the 8 bits needed for use with Xchg8, a few more low bits are available for use as flags. One of those bits holds the “spinning” flag, which is manipulated with pointer-length Load and CAS operations.

When Ms go to sleep they form a LIFO stack linked via runtime.m.nextwaitm pointers, as lock_sema.go does today. The list of waiting Ms is a multi-producer, single-consumer stack. Each M can add itself, but inspecting or removing Ms requires exclusive access. Today, lock_sema.go's runtime.unlock2 uses the mutex itself to control that ownership. That puts any use of the sleeping M list in the critical path of the mutex.

My proposal uses another bit of the state word as a try-lock to control inspecting and removing Ms from the list. This allows additional list-management code without slowing the critical path of a busy mutex, and use of efficient Xchg8 operations in the fast paths. We‘ll need access to the list in order to attribute contention delay to the right critical section in the mutex profile. Access to the list will also let us periodically wake an M even when it’s not strictly necessary, to combat tail latency that may be introduced by the reduction in wakeups.

Here‘s the full layout of the runtime.mutex.key state word: Bit 0 holds the “locked” flag, the primary purpose of the mutex. Bit 1 is the “sleeping” flag, and is set when the upper bits point to an M. Bits 2 through 7 are unused, since they’re lost with every Xchg8 operation. Bit 8 holds the “spinning” try-lock, allowing the holder to reload the state word in a loop. Bit 9 holds the “stack” try-lock, allowing the holder to inspect and remove sleeping Ms from the LIFO stack. Bits 10 and higher of the state word hold bits 10 and higher of a pointer to the M at the top of the LIFO stack of sleeping waiters.

Rationale

The status quo is a runtime.lock2 implementation that experiences congestion collapse under high contention on machines with many hardware threads. Addressing that will require fewer threads loading the same cache line in a loop.

The literature presents several options for scalable / non-collapsing mutexes. Some require an additional memory footprint for each mutex in proportion to the number of threads that may seek to acquire the lock. Some require threads to store a reference to a value that they will use to release each lock they hold. Go includes a runtime.mutex as part of every chan, and in some applications those values are the ones with the most contention. Coupled with select, there's no limit to the number of mutexes that an M can hold. That means neither of those forms of increased memory footprint is acceptable.

The performance of fully uncontended runtime.lock2/runtime.unlock2 pairs is also important to the project. That limits the use of many of the literature's proposed locking algorithms, if they include FIFO queue handoff behavior. On my test hardware (a linux/amd64 machine with i7-13700H, and a darwin/arm64 M1), a runtime.mutex value with zero or moderate contention can support 50,000,000 uses per second on any threads, or can move between active threads 10,000,000 times per second, or can move between inactive threads (with sleep mediated by the OS) about 40,000 to 500,000 times per second (depending on the OS). Some amount of capture or barging, rather than queueing, is required to maintain the level of throughput that Go users have come to expect.

Keeping the size of runtime.mutex values as they are today but allowing threads to sleep with fewer interruptions seems like fulfilling the goal of the original design. The main disadvantage I know of is the risk of increased tail latency: A small set of threads may be able to capture a contended mutex, passing it back and forth among themselves while the other threads sleep indefinitely. That's already a risk of the current lock_sema.go implementation, but the high volume of wakeups means threads are unlikely to sleep for long, and the list of sleeping threads may regularly dip to zero.

The “cross-OS” option has an edge here: with it, the Go runtime maintains an explicit list of sleeping Ms and so can do targeted wakes or even direct handoffs to reduce starvation.

Compatibility

There is no change in exported APIs.

Implementation

I've prepared two options for the Go 1.24 release cycle. One relies on the futex syscall and the Xchg8 operation, and so initially supports GOOS=linux and GOARCH=amd64: https://go.dev/cl/601597. The other relies on only the Xchg8 operation and works with any GOOS value that supports threads: https://go.dev/cl/620435. Both are controlled by GOEXPERIMENT=spinbitmutex, enabled by default on supported platforms.

Open issues (if applicable)

I appreciate feedback on the balance between simplicity, performance at zero or low contention, performance under extreme contention, both the performance and maintenance burden for non-first-class ports, and the accuracy of contention profiles.