design/18802-percpu-sharded.md - proposal - Git at Google

 # Proposal: percpu.Sharded, an API for reducing cache contention

 Discussion at https://golang.org/issue/18802

 ## Abstract

 As it stands, Go programs do not have a good way to avoid contention when
 combining highly concurrent code with shared memory locations that frequently
 require mutation. This proposal describes a new package and type to satisfy this
 need.

 ## Background

 There are several scenarios in which a Go program might want to have shared data
 that is mutated frequently. We will briefly discuss some of these scenarios,
 such that we can evaluate the proposed API against these concrete use-cases.

 ### Counters

 In RPC servers or in scientific computing code, there is often a need for global
 counters. For instance, in an RPC server, a global counter might count the
 number of requests received by the server, or the number of bytes received by
 the server. Go makes it easy to write RPC servers which are inherently
 concurrent, often processing each connection and each request on a concurrent
 goroutine. This means that in the context of a multicore machine, several
 goroutines can be incrementing the same global counter in parallel. Using an API
 like `atomic.AddInt64` will ensure that such a counter is lock-free, but
 parallel goroutines will be contending for the same cache line, so this counter
 will not experience linear scalability as the number of cores is increased.
 (Indeed, one might even expect scalability to unexpectedly decrease due to
 increased core-to-core communication).

 It's probably also worth noting that there are other similar use-cases in this
 space (e.g. types that record distributions rather than just sums, max-value
 trackers, etc).

 ### Read-write locks

 It is common in programs to have data that is read frequently, but written very
 rarely. In these cases, a common synchronization primitive is `sync.RWMutex`,
 which offers an `RLock`/`RUnlock` API for readers. When no writers are
 interacting with an `RWMutex`, an arbitrary number of goroutines can use the
 read-side of the `RWMutex` without blocking.

 However, in order to correctly pair calls to `RLock` with calls to `RUnlock`,
 `RWMutex` internally maintains a counter, which is incremented during `RLock`
 and decremented during `RUnlock`. There is also other shared mutable state that
 is atomically updated inside the `RWMutex` during these calls (and during
 calls to `Lock` and `Unlock`). For reasons similar to the previous example, if
 many goroutines are acquiring and releasing read-locks concurrently and the
 program is running on a multicore machine, then it is likely that the
 performance of `RWMutex.RLock`/`RWMutex.RUnlock` will not scale linearly with
 the number of cores given to the program.

 ### RPC clients

 In programs that make many RPC calls in parallel, there can be
 contention on shared mutable state stored inside the RPC or HTTP clients. For
 instance, an RPC client might support connecting to a pool of servers, and
 implements a configurable load balancing policy to select a connection to use
 for a given RPC; these load balancing policies often need to maintain state for
 each connection in the pool of managed connections. For instance, an
 implementation of the "Least-Loaded" policy needs to maintain a counter of
 active requests per connection, such that a new request can select the
 connection with the least number of active requests. In scenarios where a client
 is performing a large number of requests in parallel (perhaps enqueueing many
 RPCs before finally waiting on them at a later point in the program), then
 contention on this internal state can start to affect the rate at which the
 requests can be dispatched.

 ### Order-independent accumulators

 In data-processing pipelines, code running in a particular stage may want to
 'batch' its output, such that it only sends data downstream in N-element
 batches, rather than sending single elements through the pipeline at a time. In
 the single-goroutine case and where the element type is 'byte', then the
 familiar type `bufio.Writer` implements this pattern. Indeed, one
 option for the general data-processing pipeline, is to have a single goroutine
 run every stage in the pipeline from end-to-end, and then instantiate a small
 number of parallel pipeline instances. This strategy effectively handles
 pipelines composed solely of stages dominated by CPU-time. However, if a
 pipeline contains any IO (e.g. initially reading the input from a distributed
 file system, making RPCs, or writing the result back to a distributed file
 system), then this setup will not be efficient, as a single stall in IO will
 take out a significant chunk of your throughput.

 To mitigate this problem, IO bound stages need to run many goroutines. Indeed,
 a clever framework (like Apache Beam) can detect these sorts of situations
 dynamically, by measuring the rate of stage input compared to the rate of stage
 output; they can even reactively increase or decrease the "concurrency level" of
 a stage in response to these measurements. In Beam's case, it might do this by
 dynamically changing the number of threads-per-binary, or number of
 workers-per-stage.

 When stages have varying concurrency levels, but are connected to each other in
 a pipeline structure, it is important to place a concurrency-safe abstraction
 between the stages to buffer elements waiting to be processed by the next stage.
 Ideally, this structure would minimize the contention experienced by the caller.

 ## The Proposed API

 To solve these problems, we propose an API with a single new type
 `percpu.Sharded`. Here is an outline of the proposed API.

 ```go
 // Package percpu provides low-level utilities for avoiding contention on
 // multicore machines.
 package percpu

 // A Sharded is a container of homogenously typed values.
 //
 // On a best effort basis, the runtime will strongly associate a given value
 // with a CPU core. That is to say, retrieving a value twice on the same CPU
 // core will return the same value with high probablity. Note that the runtime
 // cannot guarantee this fact, and clients must assume that retrieved values
 // can be shared between concurrently executing goroutines.
 //
 // Once a value is placed in a Sharded, the Sharded will retain a reference to
 // this value permanently. Clients can control the maximum number of distinct
 // values created using the SetMaxShards API.
 //
 // A Sharded must not be copied after first use.
 //
 // All methods are safe to call from multiple goroutines.
 type Sharded struct {
   // contains unexported fields
 }

 // SetMaxShards sets a limit on the maximum number of elements stored in the
 // Sharded.
 //
 // It will not apply retroactively, any elements already created will remain
 // inside the Sharded.
 //
 // If maxShards is less than 1, Sharded will panic.
 func (s *Sharded) SetMaxShards(maxShards int)

 // GetOrCreate retrieves a value roughly associated with the current CPU. If
 // there is no such value, then createFn is called to create a value, store it
 // in the Sharded, and return it.
 //
 // All calls to createFn are serialized; this means that one must complete
 // before the next one is started.
 //
 // createFn should not return nil, or Sharded will panic.
 //
 // If createFn is called with a ShardInfo.ShardIndex equal to X, no future call
 // to GetOrCreate will call createFn again with a ShardInfo.ShardIndex equal to
 // X.
 func (s *Sharded) GetOrCreate(createFn func(ShardInfo) interface{}) interface{}

 // Get retrieves any preexisting value associated with the current CPU. If
 // there is no such value, nil is returned.
 func (s *Sharded) Get() interface{}

 // Do iterates over a snapshot of all elements stored in the Sharded, and calls
 // fn once for each element.
 //
 // If more elements are created during the iteration itself, they may be
 // visible to the iteration, but this is not guaranteed. For stronger
 // guarantees, see DoLocked.
 func (s *Sharded) Do(fn func(interface{}))

 // DoLocked iterates over all the elements stored in the Sharded, and calls fn
 // once for each element.
 //
 // DoLocked will observe a consistent snapshot of the elements in the Sharded;
 // any previous creations will complete before the iteration begins, and all
 // subsequent creations will wait until the iteration ends.
 func (s *Sharded) DoLocked(fn func(interface{}))

 // ShardInfo contains information about a CPU core.
 type ShardInfo struct {
   // ShardIndex is strictly less than any call to any prior call to SetMaxShards.
   ShardIndex int
 }
 ```

 ## Evaluating the use-cases

 Here, we evaluate the proposed API in light of the use-cases described above.

 ### Counters

 A counter API can be fairly easily built on top of `percpu.Sharded`.
 Specifically, it would offer two methods `IncrementBy(int64)`, and `Sum() int64`.
 The former would only allow positive increments (if required, clients can build
 negative increments by composing two counters of additions and subtractions).

 The implementation of `IncrementBy`, would call `GetOrCreate`, passing a
 function that returned an `*int64`. To avoid false sharing between cache lines,
 it would probably return it as an interior pointer into a struct with
 appropriate padding. Once the pointer is retrieved from `GetOrCreate`, the
 function would then use `atomic.AddInt64` on that pointer with the value passed
 to `IncrementBy`.

 The implementation of `Sum` would call `Do` to retrieve a snapshot of all
 previously created values, then sum up their values using `atomic.LoadInt64`.

 If the application is managing many long-lived counters, then one possible
 optimization would be to implement the `Counter` type in terms of a
 `counterBatch` (which logically encapsulates `N` independent counters). This can
 drastically limit the padding required to fix false sharing between cache lines.

 ### Read-write locks

 It is a little tricky to implement a drop-in replacement for `sync.RWMutex` on
 top of `percpu.Sharded`. Naively, one could imagine a sharded lock composed of
 many internal `sync.RWMutex` instances. Calling `RLock()` on the aggregate lock
 would grab a single `sync.RWMutex` instance using `GetOrCreate` and then call
 `RLock()` on that instance. Unfortunately, because there is no state passed
 between `RLock()` and `RUnlock()` (something we should probably consider fixing
 for Go 2), we cannot implement `RUnlock()` efficiently, as the `percpu.Sharded`
 might have migrated to a different shard and therefore we've lost the
 association to the original `RLock()`.

 That said, since such a sharded lock would be considerably more memory-hungry
 than a normal `sync.RWMutex`, callers should only replace truly contended
 mutexes with a sharded lock, so requiring them to make minor API changes should
 not be too onerous (particularly for mutexes, which should always be private
 implementation details, and therefore not cross API boundaries). In particular,
 one could have `RLock()` on the sharded lock return a `RLockHandle` type, which
 has a `RUnlock()` method. This `RLockHandle` could keep an internal pointer to
 the `sync.RWMutex` that was initially chosen, and it can then `RUnlock()` that
 specific instance.

 It's worth noting that it's also possible to drastically change the standard
 library's `sync.RWMutex` implementation itself to be scalable by default using
 `percpu.Sharded`; this is why the implementation sketch below is careful not not
 use the `sync` package to avoid circular dependencies. See Facebook's
 [SharedMutex](https://github.com/facebook/folly/blob/a440441d2c6ba08b91ce3a320a61cf0f120fe7f3/folly/SharedMutex.h#L148)
 class to get a sense of how this could be done. However, that requires
 significant research and deserves a proposal of its own.

 ### RPC clients

 It's straightforward to use `percpu.Sharded` to implement a sharded RPC client.

 This is a case where its likely that the default implementation will continue to
 be unsharded, and a program will need to explicitly say something like
 `grpc.Dial("some-server", grpc.ShardedClient(4))` (where the "4" might come from
 an application flag). This kind of client-contrallable sharding is one place
 where the `SetMaxShards` API can be useful.

 ### Order-independent accumulators

 This can be implemented using `percpu.Sharded`. For instance, a writer would
 call `GetOrCreate` to retrieve a shard-local buffer, they would acquire a lock,
 and insert the element into the buffer. If the buffer became full, they would
 flush it downstream.

 A watchdog goroutine could walk the buffers periodically using `AppendAll`, and
 flush partially-full buffers to ensure that elements are flushed fairly
 promptly. If it finds no elements to flush, it could start incrementing a
 counter of "useless" scans, and stop scanning after it reaches a threshold. If a
 writer is enqueuing the first element in a buffer, and it sees the counter over
 the threshold, it could reset the counter, and wake the watchdog.

 ## Sketch of Implementation

 What follows is a rough sketch of an implementation of `percpu.Sharded`. This is
 to show that this is implementable, and to give some context to the discussion
 of performance below.

 First, a sketch of an implementation for `percpu.sharder`, an internal helper
 type for `percpu.Sharded`.

 ```go
 const (
   defaultUserDefinedMaxShards = 32
 )

 type sharder struct {
   maxShards int32
 }

 func (s *sharder) SetMaxShards(maxShards int) {
   if maxShards < 1 {
     panic("maxShards < 1")
   }
   atomic.StoreInt32(&s.maxShards, roundDownToPowerOf2(int32(maxShards)))
 }

 func (s *sharder) userDefinedMaxShards() int32 {
   s := atomic.LoadInt32(&s.maxShards)
   if s == 0 {
     return defaultUserDefinedMaxShards
   }
   return s
 }

 func (s *sharder) shardInfo() ShardInfo {
   shardId := runtime_getShardIndex()

   // If we're in race mode, then all bets are off. Half the time, randomize the
   // shardId completely, the rest of the time, use shardId 0.
   //
   // If we're in a test but not in race mode, then we want an implementation
   // that keeps cache contention to a minimum so benchmarks work properly, but
   // we still want to flush out any assumption of a stable mapping to shardId.
   // So half the time, we double the id. This catches fewer problems than what
   // we get in race mode, but it should still catch one class of issue (clients
   // assuming that two sequential calls to Get() will return the same value).
   if raceEnabled {
     rnd := runtime_fastRand()
     if rnd%2 == 0 {
       shardId = 0
     } else {
       shardId += rnd / 2
     }
   } else if testing {
     if runtime_fastRand()%2 == 0 {
       shardId *= 2
     }
   }

   shardId &= runtimeDefinedMaxShards()-1
   shardId &= userDefinedMaxShards()-1

   return ShardInfo{ShardIndex: shardId}
 }

 func runtimeDefinedMaxShards() int32 {
   max := runtime_getMaxShards()
   if (testing || raceEnabled) && max < 4 {
     max = 4
   }
   return max
 }

 // Implemented in the runtime, should effectively be
 // roundUpToPowerOf2(min(GOMAXPROCS, NumCPU)).
 // (maybe caching that periodically in the P).
 func runtime_getMaxShards() int32 {
   return 4
 }

 // Implemented in the runtime, should effectively be the result of the getcpu
 // syscall, or similar. The returned index should densified if possible (i.e.
 // if binary is locked to cores 2 and 4), they should return 0 and 1
 // respectively, not 2 and 4.
 //
 // Densification can be best-effort, and done with a process-wide mapping table
 // maintained by sysmon periodically.
 //
 // Does not have to be bounded by runtime_getMaxShards(), or indeed by
 // anything.
 func runtime_getShardIndex() int32 {
   return 0
 }

 // Implemented in the runtime. Only technically needs an implementation for
 // raceEnabled and tests. Should be scalable (e.g. using a per-P seed and
 // state).
 func runtime_fastRand() int32 {
   return 0
 }
 ```

 Next, a sketch of `percpu.Sharded` itself.

 ```go
 type Sharded struct {
   sharder

   lock uintptr
   data atomic.Value // *shardedData
   typ  unsafe.Pointer
 }

 func (s *Sharded) loadData() *shardedData {
   return s.data.Load().(*shardedData)
 }

 func (s *Sharded) getFastPath(createFn func(ShardInfo) interface{}) (out interface{}) {
   shardInfo := s.shardInfo()

   curData := s.loadData()
   if curData == nil || shardInfo.ShardIndex >= len(curData.elems) {
     if createFn == nil {
       return nil
     }
     return s.getSlowPath(shardInfo, createFn)
   }

   existing := curData.load(shardInfo.ShardIndex)
   if existing == nil {
     if createFn == nil {
       return nil
     }
     return s.getSlowPath(shardInfo, createFn)
   }

   outp := (*ifaceWords)(unsafe.Pointer(&out))
   outp.typ = s.typ
   outp.data = existing
   return
 }

 func (s *Sharded) getSlowPath(shardInfo ShardInfo, createFn func(ShardInfo) interface{}) (out interface{}) {
   runtime_lock(&s.lock)
   defer runtime_unlock(&s.lock)

   curData := s.loadData()
   if curData == nil || shardInfo.ShardIndex >= len(curData.elems) {
     curData = allocShardedData(curData, shardInfo)
     s.data.Store(curData)
   }

   existing := curData.load(shardInfo.ShardIndex)
   if existing != nil {
     outp := (*ifaceWords)(unsafe.Pointer(&out))
     outp.typ = s.typ
     outp.data = existing
     return
   }

   newElem := createFn(shardInfo)
   if newElem == nil {
     panic("createFn returned nil value")
   }

   newElemP := *(*ifaceWords)(unsafe.Pointer(&newElem))

   // If this is the first call to createFn, then stash the type-pointer for
   // later verification.
   //
   // Otherwise, verify its the same as the previous.
   if s.typ == nil {
     s.typ = newElemP.typ
   } else if s.typ != newElemP.typ {
     panic("percpu: GetOrCreate was called with function that returned inconsistently typed value")
   }

   // Store back the new value.
   curData.store(shardInfo.ShardIndex, newElemP.val)

   // Return it.
   outp := (*ifaceWords)(unsafe.Pointer(&out))
   outp.typ = s.typ
   outp.data = newElemP.val
 }

 func (s *Sharded) loadData() *shardedData {
   return s.data.Load().(*shardedData)
 }

 func (s *Sharded) GetOrCreate(createFn func(ShardInfo) interface{}) interface{} {
   if createFn == nil {
     panic("createFn nil")
   }
   return s.getFastPath(createFn)
 }

 func (s *Sharded) Get() interface{} {
   return s.getFastPath(nil)
 }

 func (s *Sharded) Do(fn func(interface{})) {
   curData := s.loadData()
   if curData == nil {
     return nil
   }

   for i := range curData.elems {
     elem := curData.load(i)
     if elem == nil {
       continue
     }

     var next interface{}
     nextP := (*ifaceWords)(unsafe.Pointer(&next))
     nextP.typ = s.typ
     nextP.val = elem

     fn(next)
   }

   return elems
 }

 func (s *Sharded) DoLocked(fn func(interface{})) {
   runtime_lock(&s.lock)
   defer runtime_unlock(&s.lock)
   s.Do(fn)
 }

 type shardedData struct {
   elems []unsafe.Pointer
 }
 ```

 ## Performance

 ### `percpu.sharder`

 As presented, calling `shardInfo` on a `percpu.sharder` makes two calls to the
 runtime, and does a single atomic load.

 However, both of the calls to the runtime would be satisfied with stale values.
 So, an obvious avenue of optimization is to squeeze these two pieces of
 information (effectively "current shard", and "max shards") into a single word,
 and cache it on the `P` when first calling the `shardInfo` API. To accommodate
 changes in the underlying values, a `P` can store a timestamp when it last
 computed these values, and clear the cache when the value is older than `X` and
 the `P` is in the process of switching goroutines.

 This means that effectively, `shardInfo` will consist of 2 atomic loads, and a
 little bit of math on the resulting values.

 ### `percpu.Sharded`

 In the get-for-current-shard path, `percpu.Sharded` will call `shardInfo`, and
 then perform 2 atomic loads (to retrieve the list of elements and to retrieve
 the specific element for the current shard, respectively). If either of these
 loads fails, it might fall back to a much-slower slow path. In the fast path,
 there's no allocation.

 In the get-all path, `percpu.Sharded` will perform a single atomic followed by a
 `O(n)` atomic loads, proportional to the number of elements stored in the
 `percpu.Sharded`. It will not allocate.


 # Discussion

 ## Garbage collection of stale values in `percpu.Sharded`

 With the given API, if `GOMAXPROCS` is temporarily increased (or the CPUs
 assigned to the given program), and then decreased to its original value, a
 `percpu.Sharded` might have allocated additional elements to satisfy the
 additional CPUs. These additional elements would not be eligible for garbage
 collection, as the `percpu.Sharded` would retain an internal reference.

 First, its worth noting that we cannot unilaterally shrink the number of
 elements stored in a `percpu.Sharded`, because this might affect program
 correctness. For instance, this could result in counters losing values, or in
 breaking the invariants of sharded locks.

 The presented solution just sidesteps this problem by defaulting to a fairly low
 value of `MaxShards`. This can be overridden by the user with explicit action
 (though the runtime has the freedom to bound the number more strictly than the
 user's value, e.g. to limit the size of internal data-structures to reasonable
 levels.).

 One thing to keep in mind, clients who require garbage collection of stale
 values can build this on top of `percpu.Sharded`. For instance, one could
 imagine a design where clients would maintain a counter recording each use. A
 watchdog goroutine can then scan the elements and if a particular value has not
 been used for some period of time, swap in a `nil` pointer, and then gracefully
 tear down the value (potentially transferring the logical data encapsulated to
 other elements in the `percpu.Sharded`).

 Requiring clients to implement their own GC in this way seems kinda gross, but
 on the other hand, its unclear to me how to generically solve this problem
 without knowledge of the client use-case. One could imagine some sort of
 reference-counting design, but again, without knowing the semantics of the
 use-case, its hard to know if its safe to clear the reference to the type.

 Also, for a performance-oriented type, like `percpu.Sharded`, it seems
 unfortunate to add unnecessary synchronization to the fast path of the type (and
 I don't see how to implement something in this direction without adding
 synchronization).

 ## Why is `ShardInfo` a struct and not just an int?

 This is mostly to retain the ability to extend the API in a compatible manner.
 One concrete avenue is to add additional details to allow clients to optimize
 their code for the NUMA architecture of the machine. For instance, for a sharded
 buffering scheme (i.e. the "Order-independent accumulator" above), it might make
 sense to have multiple levels of buffering in play, with another level at the
 NUMA-node layer.

 ## Is `ShardInfo.ShardIndex` returning an id for the CPU, or the `P` executing the goroutine?

 This is left unspecified, but the name of the package seems to imply the former.
 In practice, I think we want a combination.

 That is to say, we would prefer that a program running on a 2-core machine with
 `GOMAXPROCS` set to 100 should use 2 shards, not 100. On the other hand, we
 would also prefer that a program running on a 100-core machine with `GOMAXPROCS`
 set to 2 should also use 2 shards, not 100.

 This ideal state should be achievable on systems that provide reasonable APIs to
 retrieve the id of the current CPU.

 That said, any implementation effort would likely start with a simple portable
 implementation which uses the id of the local `P`. This will allow us to get a
 sense of the performance of the type, and to serve as a fallback implementation
 for platforms where the necessary APIs are either not available, or require
 privileged execution.

 ## Is it a good idea for `percpu.Sharded` to behave differently during tests?

 This is a good question; I am not certain of the answer here. I am confident
 that during race mode, we should definitely randomize the behaviour of
 `percpu.Sharded` significantly (and the implementation sketch above does that).
 However, for tests, the answer seems less clear to me.

 As presented, the implementation sketch above randomizes the value by flipping
 randomly between two values for every CPU. That seems like it will catch bugs
 where the client assumes that sequential calls to `Get`/`GetOrCreate` will return
 the same values. That amount of randomness seems warranted to me, though I'd
 understand if folks would prefer to avoid it in favor of keeping non-test code
 and test code behaving identically.

 On a more mundane note: I'm not entirely sure if this is implementable with
 zero-cost. One fairly efficient strategy would be an internal package that
 exposes an "IsTesting bool", which is set by the `testing` package and read by
 the `percpu` package. But ideally, this could be optimized away at compile time;
 I don't believe we have any mechanism to do this now.

 ## Should we expose `ShardInfo.ShardIndex` at all?

 I think so. Even if we don't, clients can retrieve an effectively equivalent
 value by just incrementing an atomic integer inside the `createFn` passed to
 `GetOrCreate`. For pre-allocated use-cases (e.g. see the Facebook `SharedMutex`
 linked above), it seems important to let clients index into pre-allocated
 memory.

 ## Should we expose both of Get and GetOrCreate?

 We could define `GetOrCreate` to behave like `Get` if the passed `createFn` is
 nil. This is less API (and might be more efficient, until mid-stack inlining
 works), but seems less semantically clean to me. It seems better to just have
 clients say what they want explicitly.

 ## Should we expose both of Do and DoLocked?

 If we had to choose one of those, then I would say we should expose `Do`. This
 is because it is the higher performance, minimal-synchronization version, and
 `DoLocked` can be implemented on top. That said, I do think we should just
 provide both. The implementation is simple, and implementing it on top feels
 odd.

 Of the 4 use-cases presented above, 2 would probably use `Do` (counters and
 order-independent accumulators), and 2 would probably use `DoLocked` (read-write
 locks, and RPC clients (for the latter, probably just for implementing
 `Close()`)).

 ## Naming

 I'm not particularly wedded to any of the names in the API sketch above, so I'm
 happy to see it changed to whatever people prefer.

 # Backwards compatibility

 The API presented above is straightforward to implement without any runtime
 support; in particular, this could be implemented as a thin wrapper around a
 `sync.Once`. This will not effectively reduce contention, but it would still be
 a correct implementation. It's probably a good idea to implement such a shim,
 and put it in the `x/sync` repo, with appropriate build tags and type-aliases to
 allow clients to immediately start using the new type.