design: add 18802-percpu-sharded.md

For golang/go#18802.

Change-Id: I10678de235500dc9a9a217e4cf114731c8bd1162
GitHub-Last-Rev: 4cec2dbfd52a4fe71747399a693e4a26b946f98f
GitHub-Pull-Request: golang/proposal#13
Reviewed-on: https://go-review.googlesource.com/118135
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
diff --git a/design/18802-percpu-sharded.md b/design/18802-percpu-sharded.md
new file mode 100644
index 0000000..69347e4
--- /dev/null
+++ b/design/18802-percpu-sharded.md
@@ -0,0 +1,652 @@
+# Proposal: percpu.Sharder, an API for reducing cache contention
+
+Discussion at https://golang.org/issue/18802
+
+## Abstract
+
+As it stands, Go programs do not have a good way to avoid contention when
+combining highly concurrent code with shared memory locations that frequently
+require mutation. This proposal describes a new package and type to satisfy this
+need.
+
+## Background
+
+There are several scenarios in which a Go program might want to have shared data
+that is mutated frequently. We will briefly discuss some of these scenarios,
+such that we can evaluate the proposed API against these concrete use-cases.
+
+### Counters
+
+In RPC servers or in scientific computing code, there is often a need for global
+counters. For instance, in an RPC server, a global counter might count the
+number of requests received by the server, or the number of bytes received by
+the server. Go makes it easy to write RPC servers which are inherently
+concurrent, often processing each connection and each request on a concurrent
+goroutine. This means that in the context of a multicore machine, several
+goroutines can be incrementing the same global counter in parallel. Using an API
+like `atomic.AddInt64` will ensure that such a counter is lock-free, but
+parallel goroutines will be contending for the same cache line, so this counter
+will not experience linear scalability as the number of cores is increased.
+(Indeed, one might even expect scalability to unexpectedly decrease due to
+increased core-to-core communication).
+
+It's probably also worth noting that there are other similar use-cases in this
+space (e.g. types that record distributions rather than just sums, max-value
+trackers, etc).
+
+### Read-write locks
+
+It is common in programs to have data that is read frequently, but written very
+rarely. In these cases, a common synchronization primitive is `sync.RWMutex`,
+which offers an `RLock`/`RUnlock` API for readers. When no writers are
+interacting with an `RWMutex`, an arbitrary number of goroutines can use the
+read-side of the `RWMutex` without blocking.
+
+However, in order to correctly pair calls to `RLock` with calls to `RUnlock`,
+`RWMutex` internally maintains a counter, which is incremented during `RLock`
+and decremented during `RUnlock`. There is also other shared mutable state that
+is atomically updated inside the `RWMutex` during these calls (and during
+calls to `Lock` and `Unlock`). For reasons similar to the previous example, if
+many goroutines are acquiring and releasing read-locks concurrently and the
+program is running on a multicore machine, then it is likely that the
+performance of `RWMutex.RLock`/`RWMutex.RUnlock` will not scale linearly with
+the number of cores given to the program.
+
+### RPC clients
+
+In programs that make many RPC calls in parallel, there can be
+contention on shared mutable state stored inside the RPC or HTTP clients. For
+instance, an RPC client might support connecting to a pool of servers, and
+implements a configurable load balancing policy to select a connection to use
+for a given RPC; these load balancing policies often need to maintain state for
+each connection in the pool of managed connections. For instance, an
+implementation of the "Least-Loaded" policy needs to maintain a counter of
+active requests per connection, such that a new request can select the
+connection with the least number of active requests. In scenarios where a client
+is performing a large number of requests in parallel (perhaps enqueueing many
+RPCs before finally waiting on them at a later point in the program), then
+contention on this internal state can start to affect the rate at which the
+requests can be dispatched.
+
+### Order-independent accumulators
+
+In data-processing pipelines, code running in a particular stage may want to
+'batch' its output, such that it only sends data downstream in N-element
+batches, rather than sending single elements through the pipeline at a time. In
+the single-goroutine case and where the element type is 'byte', then the
+familiar type `bufio.Writer` implements this pattern. Indeed, one
+option for the general data-processing pipeline, is to have a single goroutine
+run every stage in the pipeline from end-to-end, and then instantiate a small
+number of parallel pipeline instances. This strategy effectively handles
+pipelines composed solely of stages dominated by CPU-time. However, if a
+pipeline contains any IO (e.g. initially reading the input from a distributed
+file system, making RPCs, or writing the result back to a distributed file
+system), then this setup will not be efficient, as a single stall in IO will
+take out a significant chunk of your throughput.
+
+To mitigate this problem, IO bound stages need to run many goroutines. Indeed,
+a clever framework (like Apache Beam) can detect these sorts of situations
+dynamically, by measuring the rate of stage input compared to the rate of stage
+output; they can even reactively increase or decrease the "concurrency level" of
+a stage in response to these measurements. In Beam's case, it might do this by
+dynamically changing the number of threads-per-binary, or number of
+workers-per-stage.
+
+When stages have varying concurrency levels, but are connected to each other in
+a pipeline structure, it is important to place a concurrency-safe abstraction
+between the stages to buffer elements waiting to be processed by the next stage.
+Ideally, this structure would minimize the contention experienced by the caller.
+
+## The Proposed API
+
+To solve these problems, we propose an API with a single new type
+`percpu.Sharded`. Here is an outline of the proposed API.
+
+```go
+// Package percpu provides low-level utilities for avoiding contention on
+// multicore machines.
+package percpu
+
+// A Sharded is a container of homogenously typed values.
+//
+// On a best effort basis, the runtime will strongly associate a given value
+// with a CPU core. That is to say, retrieving a value twice on the CPU core
+// will return the same value with high probablity. Note that the runtime
+// cannot guarantee this fact, and clients must assume that retrieved values
+// can be shared between concurrently executing goroutines.
+//
+// Once a value is placed in a Sharded, the Sharded will retain a reference to
+// this value permanently. Clients can control the maximum number of distinct
+// values created using the SetMaxShards API.
+//
+// A Sharded must not be copied after first use.
+//
+// All methods are safe to call from multiple goroutines.
+type Sharded struct {
+  // contains unexported fields
+}
+
+// SetMaxShards sets a limit on the maximum number of elements stored in the
+// Sharded.
+//
+// It will not apply retroactively, any elements already created will remain
+// inside the Sharded.
+//
+// If maxShards is less than 1, Sharded will panic.
+func (s *Sharded) SetMaxShards(maxShards int)
+
+// GetOrCreate retrieves a value roughly associated with the current CPU. If
+// there is no such value, then createFn is called to create a value, store it
+// in the Sharded, and return it.
+//
+// All calls to createFn are serialized; this means that one must complete
+// before the next one is started.
+//
+// createFn should not return nil, or Sharded will panic.
+//
+// If createFn is called with a ShardInfo.ShardIndex equal to X, no future call
+// to GetOrCreate will call createFn again with a ShardInfo.ShardIndex equal to
+// X.
+func (s *Sharded) GetOrCreate(createFn func(ShardInfo) interface{}) interface{}
+
+// Get retrieves any preexisting value associated with the current CPU. If
+// there is no such value, nil is returned.
+func (s *Sharded) Get() interface{}
+
+// Do iterates over a snapshot of all elements stored in the Sharded, and calls
+// fn once for each element.
+//
+// If more elements are created during the iteration itself, they may be
+// visible to the iteration, but this is not guaranteed. For stronger
+// guarantees, see DoLocked.
+func (s \*Sharded) Do(fn func(interface{}))
+
+// DoLocked iterates over all the elements stored in the Sharded, and calls fn
+// once for each element.
+//
+// DoLocked will observe a consistent snapshot of the elements in the Sharded;
+// any previous creations will complete before the iteration begins, and all
+// subsequent creations will wait until the iteration ends.
+func (s \*Sharded) DoLocked(fn func(interface{}))
+
+// ShardInfo contains information about a CPU core.
+type ShardInfo struct {
+  // ShardIndex is strictly less than any call to any prior call to SetMaxShards.
+  ShardIndex int
+}
+```
+
+## Evaluating the use-cases
+
+Here, we evaluate the proposed API in light of the use-cases described above.
+
+### Counters
+
+A counter API can be fairly easily be built on top of `percpu.Sharded`.
+Specifically, it would offer two methods `IncrementBy(int64)`, and `Sum() int64`.
+The former would only allow positive increments (if required, clients can build
+negative increments by composing two counters of additions and subtractions).
+
+The implementation of `IncrementBy`, would call `GetOrCreate`, passing a
+function that returned an `*int64`. To avoid false sharing between cache lines,
+it would probably return it as an interior pointer into a struct with
+appropriate padding. Once the pointer is retrieved from `GetOrCreate`, the
+function would then use `atomic.AddInt64` on that pointer with the value passed
+to `IncrementBy`.
+
+The implementation of `Sum` would call `Do` to retrieve a snapshot of all
+previously created values, then sum up their values using `atomic.LoadInt64`.
+
+If the application is managing many long-lived counters, then one possible
+optimization would be implement the `Counter` type in terms of a `counterBatch`
+(which logically encapsulates `N` independent counters). This can drastically
+limit the padding required to fix false sharing between cache lines.
+
+### Read-write locks
+
+It is a little tricky to implement a drop-in replacement for `sync.RWMutex` on
+top of `percpu.Sharded`. Naively, one could imagine a sharded lock composed of
+many internal `sync.RWMutex` instances. Calling `RLock()` on the aggregate lock
+would grab a single `sync.RWMutex` instance using `GetOrCreate` and then call
+`RLock()` on that instance. Unfortunately, because there is no state passed
+between `RLock()` and `RUnlock()` (something we should probably consider fixing
+for Go 2), we cannot implement `RUnlock()` efficiently, as the `percpu.Sharded`
+might have migrated to a different shard and therefore we've lost the
+association to the original `RLock()`.
+
+That said, since such a sharded lock would be considerably more memory-hungry
+than a normal `sync.RWMutex`, callers should only replace truly contended
+mutexes with a sharded lock, so requiring them to make minor API changes should
+not be too onerous (particularly for mutexes, which should always be private
+implementation details, and therefore not cross API boundaries). In particular,
+one could have `RLock()` on the sharded lock return a `RLockHandle` type, which
+has a `RUnlock()` method. This `RLockHandle` could keep an internal pointer to
+the `sync.RWMutex` that was initially chosen, and it can then `RUnlock()` that
+specific instance.
+
+It's worth noting that its also possible to drastically change the standard
+library's `sync.RWMutex` implementation itself to be scalable by default using
+`percpu.Sharded`; this is why the implementation sketch below is careful not not
+use the `sync` package to avoid circular dependencies. See Facebook's
+[SharedMutex](https://github.com/facebook/folly/blob/a440441d2c6ba08b91ce3a320a61cf0f120fe7f3/folly/SharedMutex.h#L148)
+class to get a sense of how this could be done. However, that requires
+significant research and deserves a proposal of its own.
+
+### RPC clients
+
+It's straightforward to use `percpu.Sharded` to implement a sharded RPC client.
+
+This is a case where its likely that the default implementation will continue to
+be unsharded, and a program will need to explicitly say something like
+`grpc.Dial("some-server", grpc.ShardedClient(4))` (where the "4" might come from
+an application flag). This kind of client-contrallable sharding is one place
+where the `SetMaxShards` API can be useful.
+
+### Order-independent accumulators
+
+This can be implemented using `percpu.Sharded`. For instance, a writer would
+call `GetOrCreate` to retrieve a shard-local buffer, they would acquire a lock,
+and insert the element into the buffer. If the buffer became full, they would
+flush it downstream.
+
+A watchdog goroutine could walk the buffers periodically using `AppendAll`, and
+flush partially-full buffers to ensure that elements are flushed fairly
+promptly. If it finds no elements to flush, it could start incrementing a
+counter of "useless" scans, and stop scanning after it reaches a threshold. If a
+writer is enqueuing the first element in a buffer, and it sees the counter over
+the threshold, it could reset the counter, and wake the watchdog.
+
+## Sketch of Implementation
+
+What follows is a rough sketch of an implementation of `percpu.Sharded`. This is
+to show that this is implementable, and to give some context to the discussion
+of performance below.
+
+First, a sketch of an implementation for `percpu.sharder`, an internal helper
+type for `percpu.Sharded`.
+
+```go
+const (
+  defaultUserDefinedMaxShards = 32
+)
+
+type sharder struct {
+  maxShards int32
+}
+
+func (s *sharder) SetMaxShards(maxShards int) {
+  if maxShards < 1 {
+    panic("maxShards < 1")
+  }
+  atomic.StoreInt32(&s.maxShards, roundDownToPowerOf2(int32(maxShards)))
+}
+
+func (s *sharder) userDefinedMaxShards() int32 {
+  s := atomic.LoadInt32(&s.maxShards)
+  if s == 0 {
+    return defaultUserDefinedMaxShards
+  }
+  return s
+}
+
+func (s *sharder) shardInfo() ShardInfo {
+  shardId := runtime_getShardIndex()
+
+  // If we're in race mode, then all bets are off. Half the time, randomize the
+  // shardId completely, the rest of the time, use shardId 0.
+  //
+  // If we're in a test but not in race mode, then we want an implementation
+  // that keeps cache contention to a minimum so benchmarks work properly, but
+  // we still want to flush out any assumption of a stable mapping to shardId.
+  // So half the time, we double the id. This catches fewer problems than what
+  // we get in race mode, but it should still catch one class of issue (clients
+  // assuming that two sequential calls to Get() will return the same value).
+  if raceEnabled {
+    rnd := runtime_fastRand()
+    if rnd%2 == 0 {
+      shardId = 0
+    } else {
+      shardId += rnd / 2
+    }
+  } else if testing {
+    if runtime_fastRand()%2 == 0 {
+      shardId *= 2
+    }
+  }
+
+  shardId &= runtimeDefinedMaxShards()-1
+  shardId &= userDefinedMaxShards()-1
+
+  return ShardInfo{ShardId: shardId}
+}
+
+func runtimeDefinedMaxShards() int32 {
+  max := runtime_getMaxShards()
+  if (testing || raceEnabled) && max < 4 {
+    max = 4
+  }
+  return max
+}
+
+// Implemented in the runtime, should effectively be
+// roundUpToPowerOf2(min(GOMAXPROCS, NumCPU)).
+// (maybe caching that periodically in the P).
+func runtime_getMaxShards() int32 {
+  return 4
+}
+
+// Implemented in the runtime, should effectively be the result of the getcpu
+// syscall, or similar. The returned index should densified if possible (i.e.
+// if binary is locked to cores 2 and 4), they should return 0 and 1
+// respectively, not 2 and 4.
+//
+// Densification can be best-effort, and done with a process-wide mapping table
+// maintained by sysmon periodically.
+//
+// Does not have to be bounded by runtime_getMaxShards(), or indeed by
+// anything.
+func runtime_getShardIndex() int32 {
+  return 0
+}
+
+// Implemented in the runtime. Only technically needs an implementation for
+// raceEnabled and tests. Should be scalable (e.g. using a per-P seed and
+// state).
+func runtime_fastRand() int32 {
+  return 0
+}
+```
+
+Next, a sketch of `percpu.Sharded` itself.
+
+```go
+type Sharded struct {
+  sharder
+
+  lock uintptr
+  data atomic.Value // *shardedData
+  typ  unsafe.Pointer
+}
+
+func (s *Sharded) loadData() *shardedData {
+  return s.data.Load().(*shardedData)
+}
+
+func (s *Sharded) getFastPath(createFn func(ShardInfo) interface{}) (out interface{}) {
+  shardInfo := s.shardInfo()
+
+  curData := s.loadData()
+  if curData == nil || shardInfo.ShardIndex >= len(curData.elems) {
+    if createFn == nil {
+      return nil
+    }
+    return s.getSlowPath(shardInfo, createFn)
+  }
+
+  existing := curData.load(shardInfo.ShardIndex)
+  if existing == nil {
+    if createFn == nil {
+      return nil
+    }
+    return s.getSlowPath(shardInfo, createFn)
+  }
+
+  outp := (*ifaceWords)(unsafe.Pointer(&out))
+  outp.typ = s.typ
+  outp.data = existing
+  return
+}
+
+func (s *Sharded) getSlowPath(shardInfo ShardInfo, createFn func(ShardInfo) interface{}) (out interface{}) {
+  runtime_lock(&s.lock)
+  defer runtime_unlock(&s.lock)
+
+  curData := s.loadData()
+  if curData == nil || shardInfo.ShardIndex >= len(curData.elems) {
+    curData = allocShardedData(curData, shardInfo)
+    s.data.Store(curData)
+  }
+
+  existing := curData.load(shardInfo.ShardIndex)
+  if existing != nil {
+    outp := (*ifaceWords)(unsafe.Pointer(&out))
+    outp.typ = s.typ
+    outp.data = existing
+    return
+  }
+
+  newElem := createFn(shardInfo)
+  if newElem == nil {
+    panic("createFn returned nil value")
+  }
+
+  newElemP := *(*ifaceWords)(unsafe.Pointer(&newElem))
+
+  // If this is the first call to createFn, then stash the type-pointer for
+  // later verification.
+  //
+  // Otherwise, verify its the same as the previous.
+  if s.typ == nil {
+    s.typ = newElemP.typ
+  } else if s.typ != newElemP.typ {
+    panic("percpu: GetOrCreate was called with function that returned inconsistently typed value")
+  }
+
+  // Store back the new value.
+  curData.store(shardInfo.ShardIndex, newElemP.val)
+
+  // Return it.
+  outp := (*ifaceWords)(unsafe.Pointer(&out))
+  outp.typ = s.typ
+  outp.data = newElemP.val
+}
+
+func (s *Sharded) loadData() *shardedData {
+  return s.data.Load().(*shardedData)
+}
+
+func (s *Sharded) GetOrCreate(createFn func(ShardInfo) interface{}) interface{} {
+  if createFn == nil {
+    panic("createFn nil")
+  }
+  return s.getFastPath(createFn)
+}
+
+func (s *Sharded) Get() interface{} {
+  return s.getFastPath(nil)
+}
+
+func (s *Sharded) Do(fn func(interface{})) {
+  curData := s.loadData()
+  if curData == nil {
+    return nil
+  }
+
+  for i := range curData.elems {
+    elem := curData.load(i)
+    if elem == nil {
+      continue
+    }
+
+    var next interface{}
+    nextP := (*ifaceWords)(unsafe.Pointer(&next))
+    nextP.typ = s.typ
+    nextP.val = elem
+
+    fn(next)
+  }
+
+  return elems
+}
+
+func (s *Sharded) DoLocked(fn func(interface{})) {
+  runtime_lock(&s.lock)
+  defer runtime_unlock(&s.lock)
+  s.Do(fn)
+}
+
+type shardedData struct {
+  elems []unsafe.Pointer
+}
+```
+
+## Performance
+
+### `percpu.sharder`
+
+As presented, calling `shardInfo` on a `percpu.sharder` makes two calls to the
+runtime, and does a single atomic load.
+
+However, both of the calls to the runtime would be satisfied with stale values.
+So, an obvious avenue of optimization is to squeeze these two pieces of
+information (effectively "current shard", and "max shards") into a single word,
+and cache it on the `P` when first calling the `shardInfo` API. To accommodate
+changes in the underlying values, a `P` can store a timestamp when it last
+computed these values, and clear the cache when the value is older than `X` and
+the `P` is in the process of switching goroutines.
+
+This means that effectively, `shardInfo` will consist of 2 atomic loads, and a
+little bit of math on the resulting values.
+
+### `percpu.Sharded`
+
+In the get-for-current-shard path, `percpu.Sharded` will call `shardInfo`, and
+then perform 2 atomic loads (to retrieve the list of elements and to retrieve
+the specific element for the current shard, respectively). If either of these
+loads fails, it might fall back to a much-slower slow path. In the fast path,
+there's no allocation.
+
+In the get-all path, `percpu.Sharded` will perform a single atomic followed by a
+`O(n)` atomic loads, proportional to the number of elements stored in the
+`percpu.Sharded`. It will not allocate.
+
+
+# Discussion
+
+## Garbage collection of stale values in `percpu.Sharded`
+
+With the given API, if `GOMAXPROCS` is temporarily increased (or the CPUs
+assigned to the given program), and then decreased to its original value, a
+`percpu.Sharded` might have allocated additional elements to satisfy the
+additional CPUs. These additional elements would not be eligible for garbage
+collection, as the `percpu.Sharded` would retain an internal reference.
+
+First, its worth noting that we cannot unilaterally shrink the number of
+elements stored in a `percpu.Sharded`, because this might affect program
+correctness. For instance, this could result in counters losing values, or in
+breaking the invariants of sharded locks.
+
+The presented solution just sidesteps this problem by defaulting to a fairly low
+value of `MaxShards`. This can be overridden by the user with explicit action
+(though the runtime has the freedom to bound the number more strictly than the
+user's value, e.g. to limit the size of internal data-structures to reasonable
+levels.).
+
+One thing to keep in mind, clients who require garbage collection of stale
+values can build this on top of `percpu.Sharded`. For instance, one could
+imagine a design where clients would maintain a counter recording each use. A
+watchdog goroutine can then scan the elements and if a particular value has not
+been used for some period of time, swap in a `nil` pointer, and then gracefully
+tear down the value (potentially transferring the logical data encapsulated to
+other elements in the `percpu.Sharded`).
+
+Requiring clients to implement their own GC in this way seems kinda gross, but
+on the other hand, its unclear to me how to generically solve this problem
+without knowledge of the client use-case. One could imagine some sort of
+reference-counting design, but again, without knowing the semantics of the
+use-case, its hard to know if its safe to clear the reference to the type.
+
+Also, for a performance-oriented type, like `percpu.Sharded`, it seems
+unfortunate to add unnecessary synchronization to the fast path of the type (and
+I don't see how to implement something in this direction without adding
+synchronization).
+
+## Why is `ShardInfo` a struct and not just an int?
+
+This is mostly to retain the ability to extend the API in a compatible manner.
+One concrete avenue is to add additional details to allow clients to optimize
+their code for the NUMA architecture of the machine. For instance, for a sharded
+buffering scheme (i.e. the "Order-independent accumulator" above), it might make
+sense to have multiple levels of buffering in play, with another level at the
+NUMA-node layer.
+
+## Is `ShardInfo.ShardIndex` returning an id for the CPU, or the `P` executing the goroutine?
+
+This is left unspecified, but the name of the package seems to imply the former.
+In practice, I think we want a combination.
+
+That is to say, we would prefer that a program running on a 2-core machine with
+`GOMAXPROCS` set to 100 should use 2 shards, not 100. On the other hand, we
+would also prefer that a program running on a 100-core machine with `GOMAXPROCS`
+set to 2 should also use 2 shards, not 100.
+
+This ideal state should be achievable on systems that provide reasonable APIs to
+retrieve the id of the current CPU.
+
+That said, any implementation effort would likely start with a simple portable
+implementation which uses the id of the local `P`. This will allow us to get a
+sense of the performance of the type, and to serve as a fallback implementation
+for platforms where the necessary APIs are either not available, or require
+privileged execution.
+
+## Is it a good idea for `percpu.Sharded` to behave differently during tests?
+
+This is a good question; I am not certain of the answer here. I am confident
+that during race mode, we should definitely randomize the behaviour of
+`percpu.Sharded` significantly (and the implementation sketch above does that).
+However, for tests, the answer seems less clear to me.
+
+As presented, the implementation sketch above randomizes the value by flipping
+randomly between two values for every CPU. That seems like it will catch bugs
+where the client assumes that sequential calls to `Get`/`GetOrCreate` will return
+the same values. That amount of randomness seems warranted to me, though I'd
+understand if folks would prefer to avoid it in favor of keeping non-test code
+and test code behaving identically.
+
+On a more mundane note: I'm not entirely sure if this is implementable with
+zero-cost. One fairly efficient strategy would be an internal package that
+exposes an "IsTesting bool", which is set by the `testing` package and read by
+the `percpu` package. But ideally, this could be optimized away at compile time;
+I don't believe we have any mechanism to do this now.
+
+## Should we expose `ShardInfo.ShardId` at all?
+
+I think so. Even if we don't, clients can retrieve an effectively equivalent
+value by just incrementing an atomic integer inside the `createFn` passed to
+`GetOrCreate`. For pre-allocated use-cases (e.g. see the Facebook `SharedMutex`
+linked above), it seems important to let clients index into pre-allocated
+memory.
+
+## Should we expose both of Get and GetOrCreate?
+
+We could define `GetOrCreate` to behave like `Get` if the passed `createFn` is
+nil. This is less API (and might be more efficient, until mid-stack inlining
+works), but seems less semantically clean to me. It seems better to just have
+clients say what they want explicitly.
+
+## Should we expose both of Do and DoLocked?
+
+If we had to choose one of those, then I would say we should expose `Do`. This
+is because it is the higher performance, minimal-synchronization version, and
+`DoLocked` can be implemented on top. That said, I do think we should just
+provide both. The implementation is simple, and implementing it on top feels
+odd.
+
+Of the 4 use-cases presented above, 2 would probably use `Do` (counters and
+order-independent accumulators), and 2 would probably use `DoLocked` (read-write
+locks, and RPC clients (for the latter, probably just for implementing
+`Close()`)).
+
+## Naming
+
+I'm not particularly wedded to any of the names in the API sketch above, so I'm
+happy to see it changed to whatever people prefer.
+
+# Backwards compatibility
+
+The API presented above is straightforward to implement without any runtime
+support; in particular, this could be implemented as a thin wrapper around a
+`sync.Once`. This will not effectively reduce contention, but it would still be
+a correct implementation. It's probably a good idea to implement such a shim,
+and put it in the `x/sync` repo, with appropriate build tags and type-aliases to
+allow clients to immediately start using the new type.