design/68578-mutex-spinbit.md - proposal - Git at Google

 # Proposal: Improve scalability of runtime.lock2

 Author(s): Rhys Hiltner

 Last updated: 2024-10-16

 Discussion at https://go.dev/issue/68578.

 ## Abstract

 Improve multi-core scalability of the runtime's internal mutex implementation
 by minimizing wakeups of waiting threads.

 Avoiding wakeups of threads that are waiting for the lock allows those threads
 to sleep for longer.
 That reduces the number of concurrent threads that are attempting to read the
 mutex's state word.
 Fewer reads of that cache line mean less cache coherency traffic within the
 processor when a thread needs to make an update.
 Fast updates (to acquire and release the lock) even when many threads need the
 lock means better scalability.

 This is not an API change, so is not part of the formal proposal process.

 ## Background

 One of the simplest mutex designs is a single bit that is "0" when unlocked or
 "1" when locked.
 To acquire the lock, a thread attempts to swap in a "1",
 looping until the result it gets is "0".
 To unlock, the thread swaps in a "0".

 The performance of such a spinlock is poor in at least two ways.
 First, threads that are trying to acquire an already-held lock waste their own
 on-CPU time.
 Second, those software threads execute on hardware resources that need a local
 copy of the mutex state word in cache.

 Having the state word in cache for read access requires it not be writeable by
 any other processors.
 Writing to that memory location requires the hardware to invalidate all cached
 copies of that memory, one in each processor that had loaded it for reading.
 The hardware-internal communication necessary to implement those guarantees
 has a cost, which appears as a slowdown when writing to that memory location.

 Go's current mutex design is several steps more advanced than the simple
 spinlock, but under certain conditions its performance can degrade in a similar way.
 First, when `runtime.lock2` is unable to immediately obtain the mutex it will
 pause for a moment before retrying, primarily using hardware-level delay
 instructions (such as `PAUSE` on 386 and amd64).
 Then, if it's unable to acquire the mutex after several retries it will ask
 the OS to put it to sleep until another thread requests a wakeup.
 On Linux, we use the `futex` syscall to sleep directly on the mutex address,
 implemented in src/runtime/lock_futex.go.
 On many other platforms (including Windows and macOS),the waiting threads
 form a LIFO stack with the mutex state word as a pointer to the top of the
 stack, implemented in src/runtime/lock_sema.go.

 When the `futex` syscall is available,
 the OS maintains a list of waiting threads and will choose which it wakes.
 Otherwise, the Go runtime maintains that list and names a specific thread
 when it asks the OS to do a wakeup.
 To avoid a `futex` syscall when there's no contention,
 we split the "locked" state into two variants:
 1 meaning "locked with no contention" and
 2 meaning "locked, and a thread may be asleep".
 (With the semaphore-based implementation,
 the Go runtime can--and must--know for itself whether a thread is asleep.)
 Go's mutex implementation has those three logical states
 (unlocked, locked, locked-with-sleepers) on all multi-threaded platforms.
 For the purposes of the Go runtime, I'm calling this design "tristate".

 After releasing the mutex,
 `runtime.unlock2` will wake a thread whenever one is sleeping.
 It does not consider whether one of the waiting threads is already awake.
 If a waiting thread is already awake, it's not necessary to wake another.

 Waking additional threads results in higher concurrent demand for the mutex
 state word's cache line.
 Every thread that is awake and spinning in a loop to reload the state word
 leads to more cache coherency traffic within the processor,
 and to slower writes to that cache line.

 Consider the case where many threads all need to use the same mutex many times
 in a row.
 Furthermore, consider that the critical section is short relative to the time
 it takes a thread to give up on spinning and go (back) to sleep.
 At the end of each critical section, the thread that is releasing the mutex
 will see that a waiting thread is asleep, and will wake it.
 It takes a relatively long time for a thread to decide to go to sleep,
 and there's a relatively short time until the next `runtime.unlock2` call will
 wake it.
 Many threads will be awake, all reloading the state word in a loop,
 all slowing down updates to its value.

 Without a limit on the number of threads that can spin on the state word,
 higher demand for a mutex value degrades its performance.

 See also https://go.dev/issue/68578.

 ## Proposal

 Expand the mutex state word to include a new flag, "spinning".
 Use the "spinning" bit to communicate whether one of the waiting threads is
 awake and looping while trying to acquire the lock.
 Threads mutually exclude each other from the "spinning" state,
 but they won't block while attempting to acquire the bit.
 Only the thread that owns the "spinning" bit is allowed to reload the state
 word in a loop.
 It releases the "spinning" bit before going to sleep.
 The other waiting threads go directly to sleep.
 The thread that unlocks a mutex can avoid waking a thread if it sees that one
 is already awake and spinning.
 For the purposes of the Go runtime, I'm calling this design "spinbit".

 ### futex-based option, https://go.dev/cl/601597

 I've prepared https://go.dev/cl/601597,
 which implements the "spinbit" design for GOOS=linux and GOARCH=amd64.
 I've prepared a matching [TLA+ model](./68578/spinbit.tla)
 to check for lost wakeups.
 (When relying on the `futex` syscall to maintain the list of sleeping Ms,
 it's easy to write lost-wakeup bugs.)

 It uses an atomic `Xchg8` operation on two different bytes of the mutex state
 word.
 The low byte records whether the mutex is locked,
 and whether one or more waiting Ms may be asleep.
 The "spinning" flag is in a separate byte and so can be independently
 manipulated with atomic `Xchg8` operations.
 The two bytes are within a single uintptr field (`runtime.mutex.key`).
 When the spinning M attempts to acquire the lock,
 it can do a CAS on the entire state word,
 setting the "locked" flag and clearing the "spinning" flag
 in a single operation.

 ### Cross-OS option, https://go.dev/cl/620435

 I've also prepared https://go.dev/cl/620435 which unifies the lock_sema.go and
 lock_futex.go implementations and so supports all GOOS values for which Go
 supports multiple threads.
 (It uses `futex` to implement the `runtime.sema{create,sleep,wakeup}`
 functions for lock_futex.go platforms.)
 Go's development branch now includes `Xchg8` support for
 GOARCH=amd64,arm64,ppc64,ppc64le,
 and so that CL supports all of those architectures.

 The fast path for `runtime.lock2` and `runtime.unlock2` use `Xchg8` operations
 to interact with the "locked" flag.
 The lowest byte of the state word is dedicated to use with those `Xchg8`
 operations.
 Most of the upper bytes hold a partial pointer to an M.
 (The `runtime.m` datastructure is large enough to allow reconstructing the low
 bits from the partial pointer,
 with special handling for the non-heap-allocated `runtime.m0` value.)
 Beyond the 8 bits needed for use with `Xchg8`,
 a few more low bits are available for use as flags.
 One of those bits holds the "spinning" flag,
 which is manipulated with pointer-length `Load` and `CAS` operations.

 When Ms go to sleep they form a LIFO stack linked via `runtime.m.nextwaitm`
 pointers, as lock_sema.go does today.
 The list of waiting Ms is a multi-producer, single-consumer stack.
 Each M can add itself,
 but inspecting or removing Ms requires exclusive access.
 Today, lock_sema.go's `runtime.unlock2` uses the mutex itself to control that
 ownership.
 That puts any use of the sleeping M list in the critical path of the mutex.

 My proposal uses another bit of the state word as a try-lock to control
 inspecting and removing Ms from the list.
 This allows additional list-management code without slowing the critical path
 of a busy mutex, and use of efficient `Xchg8` operations in the fast paths.
 We'll need access to the list in order to attribute contention delay to the
 right critical section in the [mutex profile](https://go.dev/issue/66999).
 Access to the list will also let us periodically wake an M even when it's not
 strictly necessary, to combat tail latency that may be introduced by the
 reduction in wakeups.

 Here's the full layout of the `runtime.mutex.key` state word:
 Bit 0 holds the "locked" flag, the primary purpose of the mutex.
 Bit 1 is the "sleeping" flag, and is set when the upper bits point to an M.
 Bits 2 through 7 are unused, since they're lost with every `Xchg8` operation.
 Bit 8 holds the "spinning" try-lock, allowing the holder to reload the state
 word in a loop.
 Bit 9 holds the "stack" try-lock, allowing the holder to inspect and remove
 sleeping Ms from the LIFO stack.
 Bits 10 and higher of the state word hold bits 10 and higher of a pointer to
 the M at the top of the LIFO stack of sleeping waiters.

 ## Rationale

 The status quo is a `runtime.lock2` implementation that experiences congestion
 collapse under high contention on machines with many hardware threads.
 Addressing that will require fewer threads loading the same cache line in a
 loop.

 The literature presents several options for scalable / non-collapsing mutexes.
 Some require an additional memory footprint for each mutex in proportion to
 the number of threads that may seek to acquire the lock.
 Some require threads to store a reference to a value that they will use to
 release each lock they hold.
 Go includes a `runtime.mutex` as part of every `chan`, and in some
 applications those values are the ones with the most contention.
 Coupled with `select`, there's no limit to the number of mutexes that an M can
 hold.
 That means neither of those forms of increased memory footprint is acceptable.

 The performance of fully uncontended `runtime.lock2`/`runtime.unlock2` pairs
 is also important to the project.
 That limits the use of many of the literature's proposed locking algorithms,
 if they include FIFO queue handoff behavior.
 On my test hardware
 (a linux/amd64 machine with i7-13700H, and a darwin/arm64 M1),
 a `runtime.mutex` value with zero or moderate contention can support
 50,000,000 uses per second on any threads,
 or can move between active threads 10,000,000 times per second,
 or can move between inactive threads (with sleep mediated by the OS)
 about 40,000 to 500,000 times per second (depending on the OS).
 Some amount of capture or barging, rather than queueing, is required to
 maintain the level of throughput that Go users have come to expect.

 Keeping the size of `runtime.mutex` values as they are today but allowing
 threads to sleep with fewer interruptions seems like fulfilling the goal of
 the original design.
 The main disadvantage I know of is the risk of increased tail latency:
 A small set of threads may be able to capture a contended mutex,
 passing it back and forth among themselves while the other threads sleep
 indefinitely.
 That's already a risk of the current lock_sema.go implementation,
 but the high volume of wakeups means threads are unlikely to sleep for long,
 and the list of sleeping threads may regularly dip to zero.

 The "cross-OS" option has an edge here:
 with it, the Go runtime maintains an explicit list of sleeping Ms and so can do
 targeted wakes or even direct handoffs to reduce starvation.

 ## Compatibility

 There is no change in exported APIs.

 ## Implementation

 I've prepared two options for the Go 1.24 release cycle.
 One relies on the `futex` syscall and the `Xchg8` operation, and so initially
 supports GOOS=linux and GOARCH=amd64: https://go.dev/cl/601597.
 The other relies on only the `Xchg8` operation and works with any GOOS value
 that supports threads: https://go.dev/cl/620435.
 Both are controlled by `GOEXPERIMENT=spinbitmutex`,
 enabled by default on supported platforms.

 ## Open issues (if applicable)

 I appreciate feedback on the balance between simplicity,
 performance at zero or low contention,
 performance under extreme contention,
 both the performance and maintenance burden for non-first-class ports,
 and the accuracy of contention profiles.
	# Proposal: Improve scalability of runtime.lock2

	Author(s): Rhys Hiltner

	Last updated: 2024-10-16

	Discussion at https://go.dev/issue/68578.

	## Abstract

	Improve multi-core scalability of the runtime's internal mutex implementation
	by minimizing wakeups of waiting threads.

	Avoiding wakeups of threads that are waiting for the lock allows those threads
	to sleep for longer.
	That reduces the number of concurrent threads that are attempting to read the
	mutex's state word.
	Fewer reads of that cache line mean less cache coherency traffic within the
	processor when a thread needs to make an update.
	Fast updates (to acquire and release the lock) even when many threads need the
	lock means better scalability.

	This is not an API change, so is not part of the formal proposal process.

	## Background

	One of the simplest mutex designs is a single bit that is "0" when unlocked or
	"1" when locked.
	To acquire the lock, a thread attempts to swap in a "1",
	looping until the result it gets is "0".
	To unlock, the thread swaps in a "0".

	The performance of such a spinlock is poor in at least two ways.
	First, threads that are trying to acquire an already-held lock waste their own
	on-CPU time.
	Second, those software threads execute on hardware resources that need a local
	copy of the mutex state word in cache.

	Having the state word in cache for read access requires it not be writeable by
	any other processors.
	Writing to that memory location requires the hardware to invalidate all cached
	copies of that memory, one in each processor that had loaded it for reading.
	The hardware-internal communication necessary to implement those guarantees
	has a cost, which appears as a slowdown when writing to that memory location.

	Go's current mutex design is several steps more advanced than the simple
	spinlock, but under certain conditions its performance can degrade in a similar way.
	First, when `runtime.lock2` is unable to immediately obtain the mutex it will
	pause for a moment before retrying, primarily using hardware-level delay
	instructions (such as `PAUSE` on 386 and amd64).
	Then, if it's unable to acquire the mutex after several retries it will ask
	the OS to put it to sleep until another thread requests a wakeup.
	On Linux, we use the `futex` syscall to sleep directly on the mutex address,
	implemented in src/runtime/lock_futex.go.
	On many other platforms (including Windows and macOS),the waiting threads
	form a LIFO stack with the mutex state word as a pointer to the top of the
	stack, implemented in src/runtime/lock_sema.go.

	When the `futex` syscall is available,
	the OS maintains a list of waiting threads and will choose which it wakes.
	Otherwise, the Go runtime maintains that list and names a specific thread
	when it asks the OS to do a wakeup.
	To avoid a `futex` syscall when there's no contention,
	we split the "locked" state into two variants:
	1 meaning "locked with no contention" and
	2 meaning "locked, and a thread may be asleep".
	(With the semaphore-based implementation,
	the Go runtime can--and must--know for itself whether a thread is asleep.)
	Go's mutex implementation has those three logical states
	(unlocked, locked, locked-with-sleepers) on all multi-threaded platforms.
	For the purposes of the Go runtime, I'm calling this design "tristate".

	After releasing the mutex,
	`runtime.unlock2` will wake a thread whenever one is sleeping.
	It does not consider whether one of the waiting threads is already awake.
	If a waiting thread is already awake, it's not necessary to wake another.

	Waking additional threads results in higher concurrent demand for the mutex
	state word's cache line.
	Every thread that is awake and spinning in a loop to reload the state word
	leads to more cache coherency traffic within the processor,
	and to slower writes to that cache line.

	Consider the case where many threads all need to use the same mutex many times
	in a row.
	Furthermore, consider that the critical section is short relative to the time
	it takes a thread to give up on spinning and go (back) to sleep.
	At the end of each critical section, the thread that is releasing the mutex
	will see that a waiting thread is asleep, and will wake it.
	It takes a relatively long time for a thread to decide to go to sleep,
	and there's a relatively short time until the next `runtime.unlock2` call will
	wake it.
	Many threads will be awake, all reloading the state word in a loop,
	all slowing down updates to its value.

	Without a limit on the number of threads that can spin on the state word,
	higher demand for a mutex value degrades its performance.

	See also https://go.dev/issue/68578.

	## Proposal

	Expand the mutex state word to include a new flag, "spinning".
	Use the "spinning" bit to communicate whether one of the waiting threads is
	awake and looping while trying to acquire the lock.
	Threads mutually exclude each other from the "spinning" state,
	but they won't block while attempting to acquire the bit.
	Only the thread that owns the "spinning" bit is allowed to reload the state
	word in a loop.
	It releases the "spinning" bit before going to sleep.
	The other waiting threads go directly to sleep.
	The thread that unlocks a mutex can avoid waking a thread if it sees that one
	is already awake and spinning.
	For the purposes of the Go runtime, I'm calling this design "spinbit".

	### futex-based option, https://go.dev/cl/601597

	I've prepared https://go.dev/cl/601597,
	which implements the "spinbit" design for GOOS=linux and GOARCH=amd64.
	I've prepared a matching [TLA+ model](./68578/spinbit.tla)
	to check for lost wakeups.
	(When relying on the `futex` syscall to maintain the list of sleeping Ms,
	it's easy to write lost-wakeup bugs.)

	It uses an atomic `Xchg8` operation on two different bytes of the mutex state
	word.
	The low byte records whether the mutex is locked,
	and whether one or more waiting Ms may be asleep.
	The "spinning" flag is in a separate byte and so can be independently
	manipulated with atomic `Xchg8` operations.
	The two bytes are within a single uintptr field (`runtime.mutex.key`).
	When the spinning M attempts to acquire the lock,
	it can do a CAS on the entire state word,
	setting the "locked" flag and clearing the "spinning" flag
	in a single operation.

	### Cross-OS option, https://go.dev/cl/620435

	I've also prepared https://go.dev/cl/620435 which unifies the lock_sema.go and
	lock_futex.go implementations and so supports all GOOS values for which Go
	supports multiple threads.
	(It uses `futex` to implement the `runtime.sema{create,sleep,wakeup}`
	functions for lock_futex.go platforms.)
	Go's development branch now includes `Xchg8` support for
	GOARCH=amd64,arm64,ppc64,ppc64le,
	and so that CL supports all of those architectures.

	The fast path for `runtime.lock2` and `runtime.unlock2` use `Xchg8` operations
	to interact with the "locked" flag.
	The lowest byte of the state word is dedicated to use with those `Xchg8`
	operations.
	Most of the upper bytes hold a partial pointer to an M.
	(The `runtime.m` datastructure is large enough to allow reconstructing the low
	bits from the partial pointer,
	with special handling for the non-heap-allocated `runtime.m0` value.)
	Beyond the 8 bits needed for use with `Xchg8`,
	a few more low bits are available for use as flags.
	One of those bits holds the "spinning" flag,
	which is manipulated with pointer-length `Load` and `CAS` operations.

	When Ms go to sleep they form a LIFO stack linked via `runtime.m.nextwaitm`
	pointers, as lock_sema.go does today.
	The list of waiting Ms is a multi-producer, single-consumer stack.
	Each M can add itself,
	but inspecting or removing Ms requires exclusive access.
	Today, lock_sema.go's `runtime.unlock2` uses the mutex itself to control that
	ownership.
	That puts any use of the sleeping M list in the critical path of the mutex.

	My proposal uses another bit of the state word as a try-lock to control
	inspecting and removing Ms from the list.
	This allows additional list-management code without slowing the critical path
	of a busy mutex, and use of efficient `Xchg8` operations in the fast paths.
	We'll need access to the list in order to attribute contention delay to the
	right critical section in the [mutex profile](https://go.dev/issue/66999).
	Access to the list will also let us periodically wake an M even when it's not
	strictly necessary, to combat tail latency that may be introduced by the
	reduction in wakeups.

	Here's the full layout of the `runtime.mutex.key` state word:
	Bit 0 holds the "locked" flag, the primary purpose of the mutex.
	Bit 1 is the "sleeping" flag, and is set when the upper bits point to an M.
	Bits 2 through 7 are unused, since they're lost with every `Xchg8` operation.
	Bit 8 holds the "spinning" try-lock, allowing the holder to reload the state
	word in a loop.
	Bit 9 holds the "stack" try-lock, allowing the holder to inspect and remove
	sleeping Ms from the LIFO stack.
	Bits 10 and higher of the state word hold bits 10 and higher of a pointer to
	the M at the top of the LIFO stack of sleeping waiters.

	## Rationale

	The status quo is a `runtime.lock2` implementation that experiences congestion
	collapse under high contention on machines with many hardware threads.
	Addressing that will require fewer threads loading the same cache line in a
	loop.

	The literature presents several options for scalable / non-collapsing mutexes.
	Some require an additional memory footprint for each mutex in proportion to
	the number of threads that may seek to acquire the lock.
	Some require threads to store a reference to a value that they will use to
	release each lock they hold.
	Go includes a `runtime.mutex` as part of every `chan`, and in some
	applications those values are the ones with the most contention.
	Coupled with `select`, there's no limit to the number of mutexes that an M can
	hold.
	That means neither of those forms of increased memory footprint is acceptable.

	The performance of fully uncontended `runtime.lock2`/`runtime.unlock2` pairs
	is also important to the project.
	That limits the use of many of the literature's proposed locking algorithms,
	if they include FIFO queue handoff behavior.
	On my test hardware
	(a linux/amd64 machine with i7-13700H, and a darwin/arm64 M1),
	a `runtime.mutex` value with zero or moderate contention can support
	50,000,000 uses per second on any threads,
	or can move between active threads 10,000,000 times per second,
	or can move between inactive threads (with sleep mediated by the OS)
	about 40,000 to 500,000 times per second (depending on the OS).
	Some amount of capture or barging, rather than queueing, is required to
	maintain the level of throughput that Go users have come to expect.

	Keeping the size of `runtime.mutex` values as they are today but allowing
	threads to sleep with fewer interruptions seems like fulfilling the goal of
	the original design.
	The main disadvantage I know of is the risk of increased tail latency:
	A small set of threads may be able to capture a contended mutex,
	passing it back and forth among themselves while the other threads sleep
	indefinitely.
	That's already a risk of the current lock_sema.go implementation,
	but the high volume of wakeups means threads are unlikely to sleep for long,
	and the list of sleeping threads may regularly dip to zero.

	The "cross-OS" option has an edge here:
	with it, the Go runtime maintains an explicit list of sleeping Ms and so can do
	targeted wakes or even direct handoffs to reduce starvation.

	## Compatibility

	There is no change in exported APIs.

	## Implementation

	I've prepared two options for the Go 1.24 release cycle.
	One relies on the `futex` syscall and the `Xchg8` operation, and so initially
	supports GOOS=linux and GOARCH=amd64: https://go.dev/cl/601597.
	The other relies on only the `Xchg8` operation and works with any GOOS value
	that supports threads: https://go.dev/cl/620435.
	Both are controlled by `GOEXPERIMENT=spinbitmutex`,
	enabled by default on supported platforms.

	## Open issues (if applicable)

	I appreciate feedback on the balance between simplicity,
	performance at zero or low contention,
	performance under extreme contention,
	both the performance and maintenance burden for non-first-class ports,
	and the accuracy of contention profiles.