blob: 85f9e93133321aadfa2719e0bce70dc511b92de1 [file] [view]
# Proposal: Improve scalability of runtime.lock2
Author(s): Rhys Hiltner
Last updated: 2024-10-16
Discussion at https://go.dev/issue/68578.
## Abstract
Improve multi-core scalability of the runtime's internal mutex implementation
by minimizing wakeups of waiting threads.
Avoiding wakeups of threads that are waiting for the lock allows those threads
to sleep for longer.
That reduces the number of concurrent threads that are attempting to read the
mutex's state word.
Fewer reads of that cache line mean less cache coherency traffic within the
processor when a thread needs to make an update.
Fast updates (to acquire and release the lock) even when many threads need the
lock means better scalability.
This is not an API change, so is not part of the formal proposal process.
## Background
One of the simplest mutex designs is a single bit that is "0" when unlocked or
"1" when locked.
To acquire the lock, a thread attempts to swap in a "1",
looping until the result it gets is "0".
To unlock, the thread swaps in a "0".
The performance of such a spinlock is poor in at least two ways.
First, threads that are trying to acquire an already-held lock waste their own
on-CPU time.
Second, those software threads execute on hardware resources that need a local
copy of the mutex state word in cache.
Having the state word in cache for read access requires it not be writeable by
any other processors.
Writing to that memory location requires the hardware to invalidate all cached
copies of that memory, one in each processor that had loaded it for reading.
The hardware-internal communication necessary to implement those guarantees
has a cost, which appears as a slowdown when writing to that memory location.
Go's current mutex design is several steps more advanced than the simple
spinlock, but under certain conditions its performance can degrade in a similar way.
First, when `runtime.lock2` is unable to immediately obtain the mutex it will
pause for a moment before retrying, primarily using hardware-level delay
instructions (such as `PAUSE` on 386 and amd64).
Then, if it's unable to acquire the mutex after several retries it will ask
the OS to put it to sleep until another thread requests a wakeup.
On Linux, we use the `futex` syscall to sleep directly on the mutex address,
implemented in src/runtime/lock_futex.go.
On many other platforms (including Windows and macOS),the waiting threads
form a LIFO stack with the mutex state word as a pointer to the top of the
stack, implemented in src/runtime/lock_sema.go.
When the `futex` syscall is available,
the OS maintains a list of waiting threads and will choose which it wakes.
Otherwise, the Go runtime maintains that list and names a specific thread
when it asks the OS to do a wakeup.
To avoid a `futex` syscall when there's no contention,
we split the "locked" state into two variants:
1 meaning "locked with no contention" and
2 meaning "locked, and a thread may be asleep".
(With the semaphore-based implementation,
the Go runtime can--and must--know for itself whether a thread is asleep.)
Go's mutex implementation has those three logical states
(unlocked, locked, locked-with-sleepers) on all multi-threaded platforms.
For the purposes of the Go runtime, I'm calling this design "tristate".
After releasing the mutex,
`runtime.unlock2` will wake a thread whenever one is sleeping.
It does not consider whether one of the waiting threads is already awake.
If a waiting thread is already awake, it's not necessary to wake another.
Waking additional threads results in higher concurrent demand for the mutex
state word's cache line.
Every thread that is awake and spinning in a loop to reload the state word
leads to more cache coherency traffic within the processor,
and to slower writes to that cache line.
Consider the case where many threads all need to use the same mutex many times
in a row.
Furthermore, consider that the critical section is short relative to the time
it takes a thread to give up on spinning and go (back) to sleep.
At the end of each critical section, the thread that is releasing the mutex
will see that a waiting thread is asleep, and will wake it.
It takes a relatively long time for a thread to decide to go to sleep,
and there's a relatively short time until the next `runtime.unlock2` call will
wake it.
Many threads will be awake, all reloading the state word in a loop,
all slowing down updates to its value.
Without a limit on the number of threads that can spin on the state word,
higher demand for a mutex value degrades its performance.
See also https://go.dev/issue/68578.
## Proposal
Expand the mutex state word to include a new flag, "spinning".
Use the "spinning" bit to communicate whether one of the waiting threads is
awake and looping while trying to acquire the lock.
Threads mutually exclude each other from the "spinning" state,
but they won't block while attempting to acquire the bit.
Only the thread that owns the "spinning" bit is allowed to reload the state
word in a loop.
It releases the "spinning" bit before going to sleep.
The other waiting threads go directly to sleep.
The thread that unlocks a mutex can avoid waking a thread if it sees that one
is already awake and spinning.
For the purposes of the Go runtime, I'm calling this design "spinbit".
### futex-based option, https://go.dev/cl/601597
I've prepared https://go.dev/cl/601597,
which implements the "spinbit" design for GOOS=linux and GOARCH=amd64.
I've prepared a matching [TLA+ model](./68578/spinbit.tla)
to check for lost wakeups.
(When relying on the `futex` syscall to maintain the list of sleeping Ms,
it's easy to write lost-wakeup bugs.)
It uses an atomic `Xchg8` operation on two different bytes of the mutex state
word.
The low byte records whether the mutex is locked,
and whether one or more waiting Ms may be asleep.
The "spinning" flag is in a separate byte and so can be independently
manipulated with atomic `Xchg8` operations.
The two bytes are within a single uintptr field (`runtime.mutex.key`).
When the spinning M attempts to acquire the lock,
it can do a CAS on the entire state word,
setting the "locked" flag and clearing the "spinning" flag
in a single operation.
### Cross-OS option, https://go.dev/cl/620435
I've also prepared https://go.dev/cl/620435 which unifies the lock_sema.go and
lock_futex.go implementations and so supports all GOOS values for which Go
supports multiple threads.
(It uses `futex` to implement the `runtime.sema{create,sleep,wakeup}`
functions for lock_futex.go platforms.)
Go's development branch now includes `Xchg8` support for
GOARCH=amd64,arm64,ppc64,ppc64le,
and so that CL supports all of those architectures.
The fast path for `runtime.lock2` and `runtime.unlock2` use `Xchg8` operations
to interact with the "locked" flag.
The lowest byte of the state word is dedicated to use with those `Xchg8`
operations.
Most of the upper bytes hold a partial pointer to an M.
(The `runtime.m` datastructure is large enough to allow reconstructing the low
bits from the partial pointer,
with special handling for the non-heap-allocated `runtime.m0` value.)
Beyond the 8 bits needed for use with `Xchg8`,
a few more low bits are available for use as flags.
One of those bits holds the "spinning" flag,
which is manipulated with pointer-length `Load` and `CAS` operations.
When Ms go to sleep they form a LIFO stack linked via `runtime.m.nextwaitm`
pointers, as lock_sema.go does today.
The list of waiting Ms is a multi-producer, single-consumer stack.
Each M can add itself,
but inspecting or removing Ms requires exclusive access.
Today, lock_sema.go's `runtime.unlock2` uses the mutex itself to control that
ownership.
That puts any use of the sleeping M list in the critical path of the mutex.
My proposal uses another bit of the state word as a try-lock to control
inspecting and removing Ms from the list.
This allows additional list-management code without slowing the critical path
of a busy mutex, and use of efficient `Xchg8` operations in the fast paths.
We'll need access to the list in order to attribute contention delay to the
right critical section in the [mutex profile](https://go.dev/issue/66999).
Access to the list will also let us periodically wake an M even when it's not
strictly necessary, to combat tail latency that may be introduced by the
reduction in wakeups.
Here's the full layout of the `runtime.mutex.key` state word:
Bit 0 holds the "locked" flag, the primary purpose of the mutex.
Bit 1 is the "sleeping" flag, and is set when the upper bits point to an M.
Bits 2 through 7 are unused, since they're lost with every `Xchg8` operation.
Bit 8 holds the "spinning" try-lock, allowing the holder to reload the state
word in a loop.
Bit 9 holds the "stack" try-lock, allowing the holder to inspect and remove
sleeping Ms from the LIFO stack.
Bits 10 and higher of the state word hold bits 10 and higher of a pointer to
the M at the top of the LIFO stack of sleeping waiters.
## Rationale
The status quo is a `runtime.lock2` implementation that experiences congestion
collapse under high contention on machines with many hardware threads.
Addressing that will require fewer threads loading the same cache line in a
loop.
The literature presents several options for scalable / non-collapsing mutexes.
Some require an additional memory footprint for each mutex in proportion to
the number of threads that may seek to acquire the lock.
Some require threads to store a reference to a value that they will use to
release each lock they hold.
Go includes a `runtime.mutex` as part of every `chan`, and in some
applications those values are the ones with the most contention.
Coupled with `select`, there's no limit to the number of mutexes that an M can
hold.
That means neither of those forms of increased memory footprint is acceptable.
The performance of fully uncontended `runtime.lock2`/`runtime.unlock2` pairs
is also important to the project.
That limits the use of many of the literature's proposed locking algorithms,
if they include FIFO queue handoff behavior.
On my test hardware
(a linux/amd64 machine with i7-13700H, and a darwin/arm64 M1),
a `runtime.mutex` value with zero or moderate contention can support
50,000,000 uses per second on any threads,
or can move between active threads 10,000,000 times per second,
or can move between inactive threads (with sleep mediated by the OS)
about 40,000 to 500,000 times per second (depending on the OS).
Some amount of capture or barging, rather than queueing, is required to
maintain the level of throughput that Go users have come to expect.
Keeping the size of `runtime.mutex` values as they are today but allowing
threads to sleep with fewer interruptions seems like fulfilling the goal of
the original design.
The main disadvantage I know of is the risk of increased tail latency:
A small set of threads may be able to capture a contended mutex,
passing it back and forth among themselves while the other threads sleep
indefinitely.
That's already a risk of the current lock_sema.go implementation,
but the high volume of wakeups means threads are unlikely to sleep for long,
and the list of sleeping threads may regularly dip to zero.
The "cross-OS" option has an edge here:
with it, the Go runtime maintains an explicit list of sleeping Ms and so can do
targeted wakes or even direct handoffs to reduce starvation.
## Compatibility
There is no change in exported APIs.
## Implementation
I've prepared two options for the Go 1.24 release cycle.
One relies on the `futex` syscall and the `Xchg8` operation, and so initially
supports GOOS=linux and GOARCH=amd64: https://go.dev/cl/601597.
The other relies on only the `Xchg8` operation and works with any GOOS value
that supports threads: https://go.dev/cl/620435.
Both are controlled by `GOEXPERIMENT=spinbitmutex`,
enabled by default on supported platforms.
## Open issues (if applicable)
I appreciate feedback on the balance between simplicity,
performance at zero or low contention,
performance under extreme contention,
both the performance and maintenance burden for non-first-class ports,
and the accuracy of contention profiles.