design/68578-mutex-spinbit.md: describe protocol

Add a design doc describing the general approach of the "spinbit" mutex
protocol, and the details of the two drafts that implement it.

Based on futex, for linux/amd64: https://go.dev/cl/601597
For all GOOS values and four architectures: https://go.dev/cl/620435

For golang/go#68578

Change-Id: Ie9665085c9b8cf1741deeb431acfa12fba550b63
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/617618
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Rhys Hiltner <rhys.hiltner@gmail.com>
Commit-Queue: Rhys Hiltner <rhys.hiltner@gmail.com>
diff --git a/design/68578-mutex-spinbit.md b/design/68578-mutex-spinbit.md
new file mode 100644
index 0000000..85f9e93
--- /dev/null
+++ b/design/68578-mutex-spinbit.md
@@ -0,0 +1,258 @@
+# Proposal: Improve scalability of runtime.lock2
+
+Author(s): Rhys Hiltner
+
+Last updated: 2024-10-16
+
+Discussion at https://go.dev/issue/68578.
+
+## Abstract
+
+Improve multi-core scalability of the runtime's internal mutex implementation
+by minimizing wakeups of waiting threads.
+
+Avoiding wakeups of threads that are waiting for the lock allows those threads
+to sleep for longer.
+That reduces the number of concurrent threads that are attempting to read the
+mutex's state word.
+Fewer reads of that cache line mean less cache coherency traffic within the
+processor when a thread needs to make an update.
+Fast updates (to acquire and release the lock) even when many threads need the
+lock means better scalability.
+
+This is not an API change, so is not part of the formal proposal process.
+
+## Background
+
+One of the simplest mutex designs is a single bit that is "0" when unlocked or
+"1" when locked.
+To acquire the lock, a thread attempts to swap in a "1",
+looping until the result it gets is "0".
+To unlock, the thread swaps in a "0".
+
+The performance of such a spinlock is poor in at least two ways.
+First, threads that are trying to acquire an already-held lock waste their own
+on-CPU time.
+Second, those software threads execute on hardware resources that need a local
+copy of the mutex state word in cache.
+
+Having the state word in cache for read access requires it not be writeable by
+any other processors.
+Writing to that memory location requires the hardware to invalidate all cached
+copies of that memory, one in each processor that had loaded it for reading.
+The hardware-internal communication necessary to implement those guarantees
+has a cost, which appears as a slowdown when writing to that memory location.
+
+Go's current mutex design is several steps more advanced than the simple
+spinlock, but under certain conditions its performance can degrade in a similar way.
+First, when `runtime.lock2` is unable to immediately obtain the mutex it will
+pause for a moment before retrying, primarily using hardware-level delay
+instructions (such as `PAUSE` on 386 and amd64).
+Then, if it's unable to acquire the mutex after several retries it will ask
+the OS to put it to sleep until another thread requests a wakeup.
+On Linux, we use the `futex` syscall to sleep directly on the mutex address,
+implemented in src/runtime/lock_futex.go.
+On many other platforms (including Windows and macOS),the waiting threads
+form a LIFO stack with the mutex state word as a pointer to the top of the
+stack, implemented in src/runtime/lock_sema.go.
+
+When the `futex` syscall is available,
+the OS maintains a list of waiting threads and will choose which it wakes.
+Otherwise, the Go runtime maintains that list and names a specific thread
+when it asks the OS to do a wakeup.
+To avoid a `futex` syscall when there's no contention,
+we split the "locked" state into two variants:
+1 meaning "locked with no contention" and
+2 meaning "locked, and a thread may be asleep".
+(With the semaphore-based implementation,
+the Go runtime can--and must--know for itself whether a thread is asleep.)
+Go's mutex implementation has those three logical states
+(unlocked, locked, locked-with-sleepers) on all multi-threaded platforms.
+For the purposes of the Go runtime, I'm calling this design "tristate".
+
+After releasing the mutex,
+`runtime.unlock2` will wake a thread whenever one is sleeping.
+It does not consider whether one of the waiting threads is already awake.
+If a waiting thread is already awake, it's not necessary to wake another.
+
+Waking additional threads results in higher concurrent demand for the mutex
+state word's cache line.
+Every thread that is awake and spinning in a loop to reload the state word
+leads to more cache coherency traffic within the processor,
+and to slower writes to that cache line.
+
+Consider the case where many threads all need to use the same mutex many times
+in a row.
+Furthermore, consider that the critical section is short relative to the time
+it takes a thread to give up on spinning and go (back) to sleep.
+At the end of each critical section, the thread that is releasing the mutex
+will see that a waiting thread is asleep, and will wake it.
+It takes a relatively long time for a thread to decide to go to sleep,
+and there's a relatively short time until the next `runtime.unlock2` call will
+wake it.
+Many threads will be awake, all reloading the state word in a loop,
+all slowing down updates to its value.
+
+Without a limit on the number of threads that can spin on the state word,
+higher demand for a mutex value degrades its performance.
+
+See also https://go.dev/issue/68578.
+
+## Proposal
+
+Expand the mutex state word to include a new flag, "spinning".
+Use the "spinning" bit to communicate whether one of the waiting threads is
+awake and looping while trying to acquire the lock.
+Threads mutually exclude each other from the "spinning" state,
+but they won't block while attempting to acquire the bit.
+Only the thread that owns the "spinning" bit is allowed to reload the state
+word in a loop.
+It releases the "spinning" bit before going to sleep.
+The other waiting threads go directly to sleep.
+The thread that unlocks a mutex can avoid waking a thread if it sees that one
+is already awake and spinning.
+For the purposes of the Go runtime, I'm calling this design "spinbit".
+
+### futex-based option, https://go.dev/cl/601597
+
+I've prepared https://go.dev/cl/601597,
+which implements the "spinbit" design for GOOS=linux and GOARCH=amd64.
+I've prepared a matching [TLA+ model](./68578/spinbit.tla)
+to check for lost wakeups.
+(When relying on the `futex` syscall to maintain the list of sleeping Ms,
+it's easy to write lost-wakeup bugs.)
+
+It uses an atomic `Xchg8` operation on two different bytes of the mutex state
+word.
+The low byte records whether the mutex is locked,
+and whether one or more waiting Ms may be asleep.
+The "spinning" flag is in a separate byte and so can be independently
+manipulated with atomic `Xchg8` operations.
+The two bytes are within a single uintptr field (`runtime.mutex.key`).
+When the spinning M attempts to acquire the lock,
+it can do a CAS on the entire state word,
+setting the "locked" flag and clearing the "spinning" flag
+in a single operation.
+
+### Cross-OS option, https://go.dev/cl/620435
+
+I've also prepared https://go.dev/cl/620435 which unifies the lock_sema.go and
+lock_futex.go implementations and so supports all GOOS values for which Go
+supports multiple threads.
+(It uses `futex` to implement the `runtime.sema{create,sleep,wakeup}`
+functions for lock_futex.go platforms.)
+Go's development branch now includes `Xchg8` support for
+GOARCH=amd64,arm64,ppc64,ppc64le,
+and so that CL supports all of those architectures.
+
+The fast path for `runtime.lock2` and `runtime.unlock2` use `Xchg8` operations
+to interact with the "locked" flag.
+The lowest byte of the state word is dedicated to use with those `Xchg8`
+operations.
+Most of the upper bytes hold a partial pointer to an M.
+(The `runtime.m` datastructure is large enough to allow reconstructing the low
+bits from the partial pointer,
+with special handling for the non-heap-allocated `runtime.m0` value.)
+Beyond the 8 bits needed for use with `Xchg8`,
+a few more low bits are available for use as flags.
+One of those bits holds the "spinning" flag,
+which is manipulated with pointer-length `Load` and `CAS` operations.
+
+When Ms go to sleep they form a LIFO stack linked via `runtime.m.nextwaitm`
+pointers, as lock_sema.go does today.
+The list of waiting Ms is a multi-producer, single-consumer stack.
+Each M can add itself,
+but inspecting or removing Ms requires exclusive access.
+Today, lock_sema.go's `runtime.unlock2` uses the mutex itself to control that
+ownership.
+That puts any use of the sleeping M list in the critical path of the mutex.
+
+My proposal uses another bit of the state word as a try-lock to control
+inspecting and removing Ms from the list.
+This allows additional list-management code without slowing the critical path
+of a busy mutex, and use of efficient `Xchg8` operations in the fast paths.
+We'll need access to the list in order to attribute contention delay to the
+right critical section in the [mutex profile](https://go.dev/issue/66999).
+Access to the list will also let us periodically wake an M even when it's not
+strictly necessary, to combat tail latency that may be introduced by the
+reduction in wakeups.
+
+Here's the full layout of the `runtime.mutex.key` state word:
+Bit 0 holds the "locked" flag, the primary purpose of the mutex.
+Bit 1 is the "sleeping" flag, and is set when the upper bits point to an M.
+Bits 2 through 7 are unused, since they're lost with every `Xchg8` operation.
+Bit 8 holds the "spinning" try-lock, allowing the holder to reload the state
+word in a loop.
+Bit 9 holds the "stack" try-lock, allowing the holder to inspect and remove
+sleeping Ms from the LIFO stack.
+Bits 10 and higher of the state word hold bits 10 and higher of a pointer to
+the M at the top of the LIFO stack of sleeping waiters.
+
+## Rationale
+
+The status quo is a `runtime.lock2` implementation that experiences congestion
+collapse under high contention on machines with many hardware threads.
+Addressing that will require fewer threads loading the same cache line in a
+loop.
+
+The literature presents several options for scalable / non-collapsing mutexes.
+Some require an additional memory footprint for each mutex in proportion to
+the number of threads that may seek to acquire the lock.
+Some require threads to store a reference to a value that they will use to
+release each lock they hold.
+Go includes a `runtime.mutex` as part of every `chan`, and in some
+applications those values are the ones with the most contention.
+Coupled with `select`, there's no limit to the number of mutexes that an M can
+hold.
+That means neither of those forms of increased memory footprint is acceptable.
+
+The performance of fully uncontended `runtime.lock2`/`runtime.unlock2` pairs
+is also important to the project.
+That limits the use of many of the literature's proposed locking algorithms,
+if they include FIFO queue handoff behavior.
+On my test hardware
+(a linux/amd64 machine with i7-13700H, and a darwin/arm64 M1),
+a `runtime.mutex` value with zero or moderate contention can support
+50,000,000 uses per second on any threads,
+or can move between active threads 10,000,000 times per second,
+or can move between inactive threads (with sleep mediated by the OS)
+about 40,000 to 500,000 times per second (depending on the OS).
+Some amount of capture or barging, rather than queueing, is required to
+maintain the level of throughput that Go users have come to expect.
+
+Keeping the size of `runtime.mutex` values as they are today but allowing
+threads to sleep with fewer interruptions seems like fulfilling the goal of
+the original design.
+The main disadvantage I know of is the risk of increased tail latency:
+A small set of threads may be able to capture a contended mutex,
+passing it back and forth among themselves while the other threads sleep
+indefinitely.
+That's already a risk of the current lock_sema.go implementation,
+but the high volume of wakeups means threads are unlikely to sleep for long,
+and the list of sleeping threads may regularly dip to zero.
+
+The "cross-OS" option has an edge here:
+with it, the Go runtime maintains an explicit list of sleeping Ms and so can do
+targeted wakes or even direct handoffs to reduce starvation.
+
+## Compatibility
+
+There is no change in exported APIs.
+
+## Implementation
+
+I've prepared two options for the Go 1.24 release cycle.
+One relies on the `futex` syscall and the `Xchg8` operation, and so initially
+supports GOOS=linux and GOARCH=amd64: https://go.dev/cl/601597.
+The other relies on only the `Xchg8` operation and works with any GOOS value
+that supports threads: https://go.dev/cl/620435.
+Both are controlled by `GOEXPERIMENT=spinbitmutex`,
+enabled by default on supported platforms.
+
+## Open issues (if applicable)
+
+I appreciate feedback on the balance between simplicity,
+performance at zero or low contention,
+performance under extreme contention,
+both the performance and maintenance burden for non-first-class ports,
+and the accuracy of contention profiles.
diff --git a/design/68578/spinbit.cfg b/design/68578/spinbit.cfg
new file mode 100644
index 0000000..d6dc6eb
--- /dev/null
+++ b/design/68578/spinbit.cfg
@@ -0,0 +1,12 @@
+SPECIFICATION Spec
+
+INVARIANT TypeInvariant
+PROPERTY NoLostWakeups
+PROPERTY HaveAcquisitions
+
+CONSTANT
+  NULL = NULL
+  NumThreads = 3
+  NumAcquires = 1
+  NumSpins = 0
+  WakeAny = TRUE
diff --git a/design/68578/spinbit.tla b/design/68578/spinbit.tla
new file mode 100644
index 0000000..3023b20
--- /dev/null
+++ b/design/68578/spinbit.tla
@@ -0,0 +1,533 @@
+------------------------------ MODULE spinbit ------------------------------
+EXTENDS TLC, Integers, Sequences, FiniteSets
+
+CONSTANT NumThreads
+CONSTANT NumAcquires
+CONSTANT NumSpins \* performance tunable "spin"
+CONSTANT WakeAny \* reduce state space: wake any of the sleepers
+CONSTANT NULL \* [ model value ]
+
+ASSUME NumThreads >= 1
+ASSUME NumAcquires >= 0
+ASSUME NumSpins >= 0
+ASSUME WakeAny \in BOOLEAN
+
+Threads == 1..NumThreads
+
+(* --algorithm spinbit
+
+variables
+
+    \* logical owner
+    owner = NULL;
+
+    \* lock state, stored in l.key word
+    l_key_locked = FALSE; \* bit 0
+    l_key_sleeping = FALSE; \* bit 1
+    l_key_spinning = FALSE; \* bit 63
+
+    \* local variables
+    v_locked = [t \in Threads |-> FALSE];
+    v_sleeping = [t \in Threads |-> FALSE];
+    v_spinning = [t \in Threads |-> FALSE];
+    v8_locked = [t \in Threads |-> FALSE];
+    v8_sleeping = [t \in Threads |-> FALSE];
+    weSpin = [t \in Threads |-> FALSE];
+    i = [t \in Threads |-> 0];
+
+    \* OS state
+    sleepers = {};
+
+    \*
+    acquisitions = 0;
+
+    \* performance tunables
+    spin = NumSpins;
+
+define
+
+    TypeInvariant ==
+        /\ owner \in Threads \union {NULL}
+        /\ l_key_locked \in BOOLEAN
+        /\ l_key_sleeping \in BOOLEAN
+        /\ l_key_spinning \in BOOLEAN
+        /\ \A t \in Threads: v_locked[t] \in BOOLEAN
+        /\ \A t \in Threads: v_sleeping[t] \in BOOLEAN
+        /\ \A t \in Threads: v_spinning[t] \in BOOLEAN
+        /\ \A t \in Threads: v8_locked[t] \in BOOLEAN
+        /\ \A t \in Threads: v8_sleeping[t] \in BOOLEAN
+        /\ \A t \in Threads: weSpin[t] \in BOOLEAN
+        /\ \A t \in Threads: i[t] \in Nat
+        /\ sleepers \subseteq Threads
+        /\ acquisitions \in Nat
+
+    NoLostWakeups ==
+        (Cardinality(sleepers) > 0) ~> (Cardinality(sleepers) = 0)
+
+    HaveAcquisitions ==
+        <>[](acquisitions >= NumAcquires)
+
+end define;
+
+\* This is a lock, so if a process dies while holding it the rest of the simulation
+\* won't be able to make progress. We need "fair process" so the threads don't die.
+fair process thread \in Threads
+begin
+    NonCriticalSection:
+        if acquisitions >= NumAcquires then
+            goto Done;
+        end if;
+
+    SpeculativeGrab:
+        \* v8 = atomic.Xchg8(k8, mutexLocked)
+        v8_locked[self] := l_key_locked;
+        v8_sleeping[self] := l_key_sleeping;
+        l_key_locked := TRUE;
+        l_key_sleeping := FALSE;
+    SpeculativeGrabCheck:
+        if v8_locked[self] = FALSE then
+            if v8_sleeping[self] = TRUE then
+                \* atomic.Or8(k8, mutexSleeping)
+                l_key_sleeping := TRUE;
+            end if;
+            goto Locked;
+        else
+            \* var v uintptr = mutexLocked
+            v_locked[self] := TRUE;
+            v_sleeping[self] := FALSE;
+            v_spinning[self] := FALSE;
+            weSpin[self] := FALSE;
+        end if;
+
+    EnterSlowPath:
+        i[self] := 0;
+
+    TryAcquire:
+        if v_locked[self] = TRUE then
+            goto SetSleepBit
+        else
+            if weSpin[self] then
+                \* next := (v &^ mutexSpinning) | mutexLocked | mutexSleeping
+                \* if atomic.Casuintptr(&l.key, v, next) {
+                if v_locked[self] = l_key_locked
+                    /\ v_sleeping[self] = l_key_sleeping
+                    /\ v_spinning[self] = l_key_spinning then
+                    l_key_locked := TRUE;
+                    l_key_sleeping := TRUE;
+                    l_key_spinning := FALSE;
+                    goto Locked;
+                end if;
+            else
+                \* prev8 := atomic.Xchg8(k8, mutexLocked|mutexSleeping)
+                \* if prev8&mutexLocked == 0 {
+                if l_key_locked = FALSE then
+                    l_key_locked := TRUE;
+                    l_key_sleeping := TRUE;
+                    goto Locked;
+                else
+                    l_key_locked := TRUE;
+                    l_key_sleeping := TRUE;
+                end if;
+            end if;
+        end if;
+    TryAcquireLoop:
+        \* v = atomic.Loaduintptr(&l.key)
+        v_locked[self] := l_key_locked;
+        v_sleeping[self] := l_key_sleeping;
+        v_spinning[self] := l_key_spinning;
+        goto TryAcquire;
+
+    SetSleepBit:
+        \* atomic.Or8(k8, mutexSleeping)
+        if v_sleeping[self] = FALSE then
+            l_key_sleeping := TRUE;
+            v_sleeping[self] := TRUE;
+        end if;
+
+    SetSpinBit:
+        \* if !weSpin && atomic.Xchg8(key8Upper(&l.key), mutexSpinning>>((goarch.PtrSize-1)*8)) == 0 {
+        if weSpin[self] = FALSE then
+            weSpin[self] := (l_key_spinning = FALSE);
+            l_key_spinning := TRUE;
+        end if;
+
+    DoSpin:
+        if weSpin[self] = TRUE then
+            if i[self] < spin then
+                \* procyield(active_spin_cnt)
+                \* v = atomic.Loaduintptr(&l.key)
+                v_locked[self] := l_key_locked;
+                v_sleeping[self] := l_key_sleeping;
+                v_spinning[self] := l_key_spinning;
+                goto TryAcquire;
+            else
+                weSpin[self] := FALSE;
+                \* atomic.Xchg8(key8Upper(&l.key), 0)
+                l_key_spinning := FALSE;
+            end if;
+        end if;
+
+    Sleep:
+        \* v = atomic.Loaduintptr(&l.key)
+        v_locked[self] := l_key_locked;
+        v_sleeping[self] := l_key_sleeping;
+        v_spinning[self] := l_key_spinning;
+    Futex:
+        if v_locked[self] = FALSE then
+            goto TryAcquire;
+        else
+            \* futexsleep(k32, uint32(v|mutexSleeping), -1)
+            if l_key_locked = v_locked[self]
+                /\ l_key_sleeping = TRUE then
+                sleepers := sleepers \union {self};
+            end if;
+        end if;
+
+    Sleeping:
+        await self \notin sleepers;
+        i[self] := 0;
+        \* v = atomic.Loaduintptr(&l.key)
+        v_locked[self] := l_key_locked;
+        v_sleeping[self] := l_key_sleeping;
+        v_spinning[self] := l_key_spinning;
+        goto TryAcquire;
+
+    Locked:
+        owner := self;
+    CriticalSection:
+        acquisitions := acquisitions + 1;
+    Unlock:
+        owner := NULL;
+        \* prev8 := atomic.Xchg8(key8(&l.key), 0)
+        v8_locked[self] := l_key_locked;
+        v8_sleeping[self] := l_key_sleeping;
+        l_key_locked := FALSE;
+        l_key_sleeping := FALSE;
+        if v8_sleeping[self] = TRUE then
+            goto Wakeup;
+        else
+            goto NonCriticalSection;
+        end if;
+
+    Wakeup:
+        \* v := atomic.Loaduintptr(&l.key)
+        v_locked[self] := l_key_locked;
+        v_sleeping[self] := l_key_sleeping;
+        v_spinning[self] := l_key_spinning;
+    Wake2:
+        if v_spinning[self] = FALSE then
+            if Cardinality(sleepers) > 0 then
+                if WakeAny then
+                    with other = CHOOSE other \in sleepers: TRUE do
+                        sleepers := sleepers \ {other};
+                    end with;
+                else
+                    with other \in sleepers do
+                        sleepers := sleepers \ {other};
+                    end with;
+                end if;
+            end if;
+        end if;
+        goto NonCriticalSection;
+
+end process;
+
+end algorithm; *)
+\* BEGIN TRANSLATION (chksum(pcal) = "a993096" /\ chksum(tla) = "41bbe278")
+VARIABLES owner, l_key_locked, l_key_sleeping, l_key_spinning, v_locked, 
+          v_sleeping, v_spinning, v8_locked, v8_sleeping, weSpin, i, sleepers, 
+          acquisitions, spin, pc
+
+(* define statement *)
+TypeInvariant ==
+    /\ owner \in Threads \union {NULL}
+    /\ l_key_locked \in BOOLEAN
+    /\ l_key_sleeping \in BOOLEAN
+    /\ l_key_spinning \in BOOLEAN
+    /\ \A t \in Threads: v_locked[t] \in BOOLEAN
+    /\ \A t \in Threads: v_sleeping[t] \in BOOLEAN
+    /\ \A t \in Threads: v_spinning[t] \in BOOLEAN
+    /\ \A t \in Threads: v8_locked[t] \in BOOLEAN
+    /\ \A t \in Threads: v8_sleeping[t] \in BOOLEAN
+    /\ \A t \in Threads: weSpin[t] \in BOOLEAN
+    /\ \A t \in Threads: i[t] \in Nat
+    /\ sleepers \subseteq Threads
+    /\ acquisitions \in Nat
+
+NoLostWakeups ==
+    (Cardinality(sleepers) > 0) ~> (Cardinality(sleepers) = 0)
+
+HaveAcquisitions ==
+    <>[](acquisitions >= NumAcquires)
+
+
+vars == << owner, l_key_locked, l_key_sleeping, l_key_spinning, v_locked, 
+           v_sleeping, v_spinning, v8_locked, v8_sleeping, weSpin, i, 
+           sleepers, acquisitions, spin, pc >>
+
+ProcSet == (Threads)
+
+Init == (* Global variables *)
+        /\ owner = NULL
+        /\ l_key_locked = FALSE
+        /\ l_key_sleeping = FALSE
+        /\ l_key_spinning = FALSE
+        /\ v_locked = [t \in Threads |-> FALSE]
+        /\ v_sleeping = [t \in Threads |-> FALSE]
+        /\ v_spinning = [t \in Threads |-> FALSE]
+        /\ v8_locked = [t \in Threads |-> FALSE]
+        /\ v8_sleeping = [t \in Threads |-> FALSE]
+        /\ weSpin = [t \in Threads |-> FALSE]
+        /\ i = [t \in Threads |-> 0]
+        /\ sleepers = {}
+        /\ acquisitions = 0
+        /\ spin = NumSpins
+        /\ pc = [self \in ProcSet |-> "NonCriticalSection"]
+
+NonCriticalSection(self) == /\ pc[self] = "NonCriticalSection"
+                            /\ IF acquisitions >= NumAcquires
+                                  THEN /\ pc' = [pc EXCEPT ![self] = "Done"]
+                                  ELSE /\ pc' = [pc EXCEPT ![self] = "SpeculativeGrab"]
+                            /\ UNCHANGED << owner, l_key_locked, 
+                                            l_key_sleeping, l_key_spinning, 
+                                            v_locked, v_sleeping, v_spinning, 
+                                            v8_locked, v8_sleeping, weSpin, i, 
+                                            sleepers, acquisitions, spin >>
+
+SpeculativeGrab(self) == /\ pc[self] = "SpeculativeGrab"
+                         /\ v8_locked' = [v8_locked EXCEPT ![self] = l_key_locked]
+                         /\ v8_sleeping' = [v8_sleeping EXCEPT ![self] = l_key_sleeping]
+                         /\ l_key_locked' = TRUE
+                         /\ l_key_sleeping' = FALSE
+                         /\ pc' = [pc EXCEPT ![self] = "SpeculativeGrabCheck"]
+                         /\ UNCHANGED << owner, l_key_spinning, v_locked, 
+                                         v_sleeping, v_spinning, weSpin, i, 
+                                         sleepers, acquisitions, spin >>
+
+SpeculativeGrabCheck(self) == /\ pc[self] = "SpeculativeGrabCheck"
+                              /\ IF v8_locked[self] = FALSE
+                                    THEN /\ IF v8_sleeping[self] = TRUE
+                                               THEN /\ l_key_sleeping' = TRUE
+                                               ELSE /\ TRUE
+                                                    /\ UNCHANGED l_key_sleeping
+                                         /\ pc' = [pc EXCEPT ![self] = "Locked"]
+                                         /\ UNCHANGED << v_locked, v_sleeping, 
+                                                         v_spinning, weSpin >>
+                                    ELSE /\ v_locked' = [v_locked EXCEPT ![self] = TRUE]
+                                         /\ v_sleeping' = [v_sleeping EXCEPT ![self] = FALSE]
+                                         /\ v_spinning' = [v_spinning EXCEPT ![self] = FALSE]
+                                         /\ weSpin' = [weSpin EXCEPT ![self] = FALSE]
+                                         /\ pc' = [pc EXCEPT ![self] = "EnterSlowPath"]
+                                         /\ UNCHANGED l_key_sleeping
+                              /\ UNCHANGED << owner, l_key_locked, 
+                                              l_key_spinning, v8_locked, 
+                                              v8_sleeping, i, sleepers, 
+                                              acquisitions, spin >>
+
+EnterSlowPath(self) == /\ pc[self] = "EnterSlowPath"
+                       /\ i' = [i EXCEPT ![self] = 0]
+                       /\ pc' = [pc EXCEPT ![self] = "TryAcquire"]
+                       /\ UNCHANGED << owner, l_key_locked, l_key_sleeping, 
+                                       l_key_spinning, v_locked, v_sleeping, 
+                                       v_spinning, v8_locked, v8_sleeping, 
+                                       weSpin, sleepers, acquisitions, spin >>
+
+TryAcquire(self) == /\ pc[self] = "TryAcquire"
+                    /\ IF v_locked[self] = TRUE
+                          THEN /\ pc' = [pc EXCEPT ![self] = "SetSleepBit"]
+                               /\ UNCHANGED << l_key_locked, l_key_sleeping, 
+                                               l_key_spinning >>
+                          ELSE /\ IF weSpin[self]
+                                     THEN /\ IF v_locked[self] = l_key_locked
+                                                 /\ v_sleeping[self] = l_key_sleeping
+                                                 /\ v_spinning[self] = l_key_spinning
+                                                THEN /\ l_key_locked' = TRUE
+                                                     /\ l_key_sleeping' = TRUE
+                                                     /\ l_key_spinning' = FALSE
+                                                     /\ pc' = [pc EXCEPT ![self] = "Locked"]
+                                                ELSE /\ pc' = [pc EXCEPT ![self] = "TryAcquireLoop"]
+                                                     /\ UNCHANGED << l_key_locked, 
+                                                                     l_key_sleeping, 
+                                                                     l_key_spinning >>
+                                     ELSE /\ IF l_key_locked = FALSE
+                                                THEN /\ l_key_locked' = TRUE
+                                                     /\ l_key_sleeping' = TRUE
+                                                     /\ pc' = [pc EXCEPT ![self] = "Locked"]
+                                                ELSE /\ l_key_locked' = TRUE
+                                                     /\ l_key_sleeping' = TRUE
+                                                     /\ pc' = [pc EXCEPT ![self] = "TryAcquireLoop"]
+                                          /\ UNCHANGED l_key_spinning
+                    /\ UNCHANGED << owner, v_locked, v_sleeping, v_spinning, 
+                                    v8_locked, v8_sleeping, weSpin, i, 
+                                    sleepers, acquisitions, spin >>
+
+TryAcquireLoop(self) == /\ pc[self] = "TryAcquireLoop"
+                        /\ v_locked' = [v_locked EXCEPT ![self] = l_key_locked]
+                        /\ v_sleeping' = [v_sleeping EXCEPT ![self] = l_key_sleeping]
+                        /\ v_spinning' = [v_spinning EXCEPT ![self] = l_key_spinning]
+                        /\ pc' = [pc EXCEPT ![self] = "TryAcquire"]
+                        /\ UNCHANGED << owner, l_key_locked, l_key_sleeping, 
+                                        l_key_spinning, v8_locked, v8_sleeping, 
+                                        weSpin, i, sleepers, acquisitions, 
+                                        spin >>
+
+SetSleepBit(self) == /\ pc[self] = "SetSleepBit"
+                     /\ IF v_sleeping[self] = FALSE
+                           THEN /\ l_key_sleeping' = TRUE
+                                /\ v_sleeping' = [v_sleeping EXCEPT ![self] = TRUE]
+                           ELSE /\ TRUE
+                                /\ UNCHANGED << l_key_sleeping, v_sleeping >>
+                     /\ pc' = [pc EXCEPT ![self] = "SetSpinBit"]
+                     /\ UNCHANGED << owner, l_key_locked, l_key_spinning, 
+                                     v_locked, v_spinning, v8_locked, 
+                                     v8_sleeping, weSpin, i, sleepers, 
+                                     acquisitions, spin >>
+
+SetSpinBit(self) == /\ pc[self] = "SetSpinBit"
+                    /\ IF weSpin[self] = FALSE
+                          THEN /\ weSpin' = [weSpin EXCEPT ![self] = (l_key_spinning = FALSE)]
+                               /\ l_key_spinning' = TRUE
+                          ELSE /\ TRUE
+                               /\ UNCHANGED << l_key_spinning, weSpin >>
+                    /\ pc' = [pc EXCEPT ![self] = "DoSpin"]
+                    /\ UNCHANGED << owner, l_key_locked, l_key_sleeping, 
+                                    v_locked, v_sleeping, v_spinning, 
+                                    v8_locked, v8_sleeping, i, sleepers, 
+                                    acquisitions, spin >>
+
+DoSpin(self) == /\ pc[self] = "DoSpin"
+                /\ IF weSpin[self] = TRUE
+                      THEN /\ IF i[self] < spin
+                                 THEN /\ v_locked' = [v_locked EXCEPT ![self] = l_key_locked]
+                                      /\ v_sleeping' = [v_sleeping EXCEPT ![self] = l_key_sleeping]
+                                      /\ v_spinning' = [v_spinning EXCEPT ![self] = l_key_spinning]
+                                      /\ pc' = [pc EXCEPT ![self] = "TryAcquire"]
+                                      /\ UNCHANGED << l_key_spinning, weSpin >>
+                                 ELSE /\ weSpin' = [weSpin EXCEPT ![self] = FALSE]
+                                      /\ l_key_spinning' = FALSE
+                                      /\ pc' = [pc EXCEPT ![self] = "Sleep"]
+                                      /\ UNCHANGED << v_locked, v_sleeping, 
+                                                      v_spinning >>
+                      ELSE /\ pc' = [pc EXCEPT ![self] = "Sleep"]
+                           /\ UNCHANGED << l_key_spinning, v_locked, 
+                                           v_sleeping, v_spinning, weSpin >>
+                /\ UNCHANGED << owner, l_key_locked, l_key_sleeping, v8_locked, 
+                                v8_sleeping, i, sleepers, acquisitions, spin >>
+
+Sleep(self) == /\ pc[self] = "Sleep"
+               /\ v_locked' = [v_locked EXCEPT ![self] = l_key_locked]
+               /\ v_sleeping' = [v_sleeping EXCEPT ![self] = l_key_sleeping]
+               /\ v_spinning' = [v_spinning EXCEPT ![self] = l_key_spinning]
+               /\ pc' = [pc EXCEPT ![self] = "Futex"]
+               /\ UNCHANGED << owner, l_key_locked, l_key_sleeping, 
+                               l_key_spinning, v8_locked, v8_sleeping, weSpin, 
+                               i, sleepers, acquisitions, spin >>
+
+Futex(self) == /\ pc[self] = "Futex"
+               /\ IF v_locked[self] = FALSE
+                     THEN /\ pc' = [pc EXCEPT ![self] = "TryAcquire"]
+                          /\ UNCHANGED sleepers
+                     ELSE /\ IF l_key_locked = v_locked[self]
+                                 /\ l_key_sleeping = TRUE
+                                THEN /\ sleepers' = (sleepers \union {self})
+                                ELSE /\ TRUE
+                                     /\ UNCHANGED sleepers
+                          /\ pc' = [pc EXCEPT ![self] = "Sleeping"]
+               /\ UNCHANGED << owner, l_key_locked, l_key_sleeping, 
+                               l_key_spinning, v_locked, v_sleeping, 
+                               v_spinning, v8_locked, v8_sleeping, weSpin, i, 
+                               acquisitions, spin >>
+
+Sleeping(self) == /\ pc[self] = "Sleeping"
+                  /\ self \notin sleepers
+                  /\ i' = [i EXCEPT ![self] = 0]
+                  /\ v_locked' = [v_locked EXCEPT ![self] = l_key_locked]
+                  /\ v_sleeping' = [v_sleeping EXCEPT ![self] = l_key_sleeping]
+                  /\ v_spinning' = [v_spinning EXCEPT ![self] = l_key_spinning]
+                  /\ pc' = [pc EXCEPT ![self] = "TryAcquire"]
+                  /\ UNCHANGED << owner, l_key_locked, l_key_sleeping, 
+                                  l_key_spinning, v8_locked, v8_sleeping, 
+                                  weSpin, sleepers, acquisitions, spin >>
+
+Locked(self) == /\ pc[self] = "Locked"
+                /\ owner' = self
+                /\ pc' = [pc EXCEPT ![self] = "CriticalSection"]
+                /\ UNCHANGED << l_key_locked, l_key_sleeping, l_key_spinning, 
+                                v_locked, v_sleeping, v_spinning, v8_locked, 
+                                v8_sleeping, weSpin, i, sleepers, acquisitions, 
+                                spin >>
+
+CriticalSection(self) == /\ pc[self] = "CriticalSection"
+                         /\ acquisitions' = acquisitions + 1
+                         /\ pc' = [pc EXCEPT ![self] = "Unlock"]
+                         /\ UNCHANGED << owner, l_key_locked, l_key_sleeping, 
+                                         l_key_spinning, v_locked, v_sleeping, 
+                                         v_spinning, v8_locked, v8_sleeping, 
+                                         weSpin, i, sleepers, spin >>
+
+Unlock(self) == /\ pc[self] = "Unlock"
+                /\ owner' = NULL
+                /\ v8_locked' = [v8_locked EXCEPT ![self] = l_key_locked]
+                /\ v8_sleeping' = [v8_sleeping EXCEPT ![self] = l_key_sleeping]
+                /\ l_key_locked' = FALSE
+                /\ l_key_sleeping' = FALSE
+                /\ IF v8_sleeping'[self] = TRUE
+                      THEN /\ pc' = [pc EXCEPT ![self] = "Wakeup"]
+                      ELSE /\ pc' = [pc EXCEPT ![self] = "NonCriticalSection"]
+                /\ UNCHANGED << l_key_spinning, v_locked, v_sleeping, 
+                                v_spinning, weSpin, i, sleepers, acquisitions, 
+                                spin >>
+
+Wakeup(self) == /\ pc[self] = "Wakeup"
+                /\ v_locked' = [v_locked EXCEPT ![self] = l_key_locked]
+                /\ v_sleeping' = [v_sleeping EXCEPT ![self] = l_key_sleeping]
+                /\ v_spinning' = [v_spinning EXCEPT ![self] = l_key_spinning]
+                /\ pc' = [pc EXCEPT ![self] = "Wake2"]
+                /\ UNCHANGED << owner, l_key_locked, l_key_sleeping, 
+                                l_key_spinning, v8_locked, v8_sleeping, weSpin, 
+                                i, sleepers, acquisitions, spin >>
+
+Wake2(self) == /\ pc[self] = "Wake2"
+               /\ IF v_spinning[self] = FALSE
+                     THEN /\ IF Cardinality(sleepers) > 0
+                                THEN /\ IF WakeAny
+                                           THEN /\ LET other == CHOOSE other \in sleepers: TRUE IN
+                                                     sleepers' = sleepers \ {other}
+                                           ELSE /\ \E other \in sleepers:
+                                                     sleepers' = sleepers \ {other}
+                                ELSE /\ TRUE
+                                     /\ UNCHANGED sleepers
+                     ELSE /\ TRUE
+                          /\ UNCHANGED sleepers
+               /\ pc' = [pc EXCEPT ![self] = "NonCriticalSection"]
+               /\ UNCHANGED << owner, l_key_locked, l_key_sleeping, 
+                               l_key_spinning, v_locked, v_sleeping, 
+                               v_spinning, v8_locked, v8_sleeping, weSpin, i, 
+                               acquisitions, spin >>
+
+thread(self) == NonCriticalSection(self) \/ SpeculativeGrab(self)
+                   \/ SpeculativeGrabCheck(self) \/ EnterSlowPath(self)
+                   \/ TryAcquire(self) \/ TryAcquireLoop(self)
+                   \/ SetSleepBit(self) \/ SetSpinBit(self) \/ DoSpin(self)
+                   \/ Sleep(self) \/ Futex(self) \/ Sleeping(self)
+                   \/ Locked(self) \/ CriticalSection(self) \/ Unlock(self)
+                   \/ Wakeup(self) \/ Wake2(self)
+
+(* Allow infinite stuttering to prevent deadlock on termination. *)
+Terminating == /\ \A self \in ProcSet: pc[self] = "Done"
+               /\ UNCHANGED vars
+
+Next == (\E self \in Threads: thread(self))
+           \/ Terminating
+
+Spec == /\ Init /\ [][Next]_vars
+        /\ \A self \in Threads : WF_vars(thread(self))
+
+Termination == <>(\A self \in ProcSet: pc[self] = "Done")
+
+\* END TRANSLATION 
+
+=============================================================================
+\* Modification History
+\* Last modified Tue Oct 01 13:18:42 PDT 2024 by rhysh
+\* Created Tue Sep 17 12:10:43 PDT 2024 by rhysh