design: add unbounded queue design doc.
Change-Id: I96a0b6c6b818fe0330cb211345c9bef5983bf3b7
Reviewed-on: https://go-review.googlesource.com/138576
Reviewed-by: Ian Lance Taylor <iant@golang.org>
diff --git a/design/27935-unbounded-queue-package.md b/design/27935-unbounded-queue-package.md
new file mode 100644
index 0000000..7e50cda
--- /dev/null
+++ b/design/27935-unbounded-queue-package.md
@@ -0,0 +1,994 @@
+# Proposal: Built in support for high performance unbounded queue
+
+Author: Christian Petrin.
+
+Last updated: September 29, 2018
+
+Discussion at https://golang.org/issue/NNNNN.
+
+## Abstract
+I propose to add a new package, "container/queue", to the standard library to
+support an in-memory, unbounded, general purpose queue implementation.
+
+[Queues](https://en.wikipedia.org/wiki/Queue_(abstract_data_type)) in computer
+science is a very old, well established and well known concept, yet Go doesn't
+provide a specialized, safe to use, performant and issue free unbounded queue
+implementation.
+
+Buffered channels provide an excellent option to be used as a queue, but
+buffered channels are bounded and so doesn't scale to support very large data
+sets. The same applies for the standard [ring
+package](https://github.com/golang/go/tree/master/src/container/ring).
+
+The standard [list
+package](https://github.com/golang/go/tree/master/src/container/list) can be
+used as the underlying data structure for building unbounded queues, but the
+performance yielded by this linked list based implementation is not optimal.
+
+Implementing a queue using dynamic slices as suggested
+[here](https://stackoverflow.com/a/26863706) is a feasible approach, but the
+performance yielded by this implementation can be abysmal in some high load
+scenarios. The memory footprint can also easily skyrocket in some scenarios.
+
+
+## Background
+Queues that grows dynamically has many uses. As an example, I'm working on a
+logging system called [CloudLogger](https://github.com/cloud-logger/docs) that
+sends all logged data to external logging management systems, such as
+[Stackdriver](https://cloud.google.com/stackdriver/) and
+[Cloudwatch](https://aws.amazon.com/cloudwatch/). External logging systems
+typically [rate limit](https://en.wikipedia.org/wiki/Rate_limiting) how much
+data their service will accept for a given account and time frame. So in a
+scenario where the hosting application is logging more data than the logging
+management system will accept at a given moment, CloudLogger has to queue the
+extra logs and send them to the logging management system at a pace the system
+will accept. As there's no telling how much data will have to be queued as it
+depends on the current traffic, an unbounded, dynamically growing queue is the
+ideal data structure to be used. Buffered channels in this scenario is not ideal
+as they have a limit on how much data they will accept, and once that limit has
+been reached, the producers (routines adding to the channel) start to block,
+making the adding to the channel operation an "eventual" synchronous process. A
+fully asynchronous operation in this scenario is highly desirable as logging
+data should not slow down significantly the hosting application.
+
+Above problem is a problem that, potentially, every system that calls another
+system faces. And in the cloud and microservices era, this is an extremely
+common scenario.
+
+Due to the lack of support for built in unbounded queues in Go, Go engineers are
+left to either:
+1) Research and use external packages, or
+2) Build their own queue implementation
+
+Both approaches are riddled with pitfalls.
+
+Using external packages, specially in enterprise level software, requires a lot
+of care as using external, potentially untested and hard to understand code can
+have unwanted consequences. This problem is made much worse by the fact that,
+currently, there's no well established and disseminated open source Go queue
+implementation according to [this stackoverflow
+discussion](https://stackoverflow.com/questions/2818852/is-there-a-queue-implementation),
+[this github search for Go
+queues](https://github.com/search?l=Go&q=go+queue&type=Repositories) and
+[Awesome Go](https://awesome-go.com/).
+
+Building a queue, on the other hand, might sound like a compelling argument, but
+building efficient, high performant, bug free unbounded queue is a hard job that
+requires a pretty solid computer science foundation as well a good deal of time
+to research different design approaches, test different implementations, make
+sure the code is bug and memory leak free, etc.
+
+In the end what Go engineers have been doing up to this point is building their
+own queues, which are for the most part inefficient and can have disastrous, yet
+hidden performance and memory issues. As examples of poorly designed and/or
+implemented queues, the approaches suggested
+[here](https://stackoverflow.com/a/26863706) and
+[here](https://stackoverflow.com/a/11757161) (among many others), requires
+linear copy of the internal slice for resizing purposes. Some implementations
+also has memory issues such as an ever expanding internal slice and memory
+leaks.
+
+
+## Proposal
+I propose to add a new package, "container/queue", to the standard library to
+support in-memory unbounded queues. The [proposed queue
+implementation](https://github.com/christianrpetrin/queue-tests/blob/master/queueimpl7/queueimpl7.go)
+offers [excellent performance and very low memory
+consumption](https://github.com/christianrpetrin/queue-tests/blob/master/bench_queue.md)
+when comparing it to three promising open source implementations
+([gammazero](https://github.com/gammazero/deque),
+[phf](https://github.com/phf/go-queue) and
+[juju](https://github.com/juju/utils/tree/master/deque)); to use Go channels as
+queue; the standard list package as a queue as well as six other experimental
+queue implementations.
+
+The [proposed queue
+implementation](https://github.com/christianrpetrin/queue-tests/blob/master/queueimpl7/queueimpl7.go)
+offers the most balanced approach to performance given different loads, being
+significantly faster and still uses less memory than every other queue
+implementation in the
+[tests](https://github.com/christianrpetrin/queue-tests/blob/master/benchmark_test.go).
+
+The closest data structure Go has to offer for building dynamically growing
+queues for large data sets is the [standard list
+package](https://github.com/golang/go/tree/master/src/container/list). When
+comparing the proposed solution to [using the list package as an unbounded
+queue](https://github.com/christianrpetrin/queue-tests/blob/master/benchmark_test.go)
+(refer to "BenchmarkList"), the proposed solution is consistently faster than
+using the list package as a queue as well as displaying a much lower memory
+footprint.
+
+
+### Reasoning
+There's [two well accepted
+approaches](https://en.wikipedia.org/wiki/Queue_(abstract_data_type)#Queue_implementation)
+to implementing queues when in comes to the queue underlying data structure:
+
+1) Using linked list
+2) Using array
+
+Linked list as the underlying data structure for an unbounded queue has the
+advantage of scaling efficiently when the underlying data structure needs to
+grow to accomodate more values. This is due to the fact that the existing
+elements doesn't need to be repositioned or copied around when the queue needs
+to grow.
+
+However, there's a few concerns with this approach:
+1) The use of prev/next pointers for each value requires a good deal of extra
+ memory
+2) Due to the fact that each "node" in the linked list can be allocated far away
+ from the previous one, navigating through the list can be slow due to its bad
+ [memory
+ locality](https://www.cs.cornell.edu/courses/cs3110/2012sp/lectures/lec25-locality/lec25.html)
+ properties
+3) Adding new values always require new memory allocations and pointers being
+ set, hindering performance
+
+On the other hand, using a slice as the underlying data structure for unbounded
+queues has the advantage of very good [memory
+locality](https://www.cs.cornell.edu/courses/cs3110/2012sp/lectures/lec25-locality/lec25.html)
+(values are stored next to each other with predictive indices) making retrieval
+of values faster when comparing to linked lists. Also an "alloc more than needed
+right now" approach can easily be implemented with slices.
+
+Also when the slice needs to expand, extra large new slices can be allocated
+with a single operation, thus not requiring memory allocations for each add
+operation.
+
+However, when the slice needs to exand to accomodate new values, the well
+adopted strategy is to allocate a new, larger slice, copy over all elements from
+the previous slice into the new one and use the new one to add the new elements.
+The problem with this approach is the obvious need to copy all the values from
+the older, small slice, into the new one, yielding a poor performance when the
+amount of values that need copying are fairly large. In other words, this
+approach doesn't scale well with large data sets.
+
+Having said that, there's a third, newer approach to implementing unbounded
+queues: use fixed size linked slices as the underlying data structure.
+
+The fixed sized linked slices approach is a hybrid between the first two,
+providing good memory locality arrays have alongside the efficient growing
+mechanism linked lists offer.
+
+
+## Rationale
+### Research
+[A first implementation](https://github.com/cloud-spin/queue) of the new design
+was built.
+
+The benchmark tests showed the new design was very promising, so I decided to
+research about other possible queue designs and implementations with the goal to
+improve the first design and implementation.
+
+As part of the research to identify the best possible queue designs and
+implementations, I implemented and probed below queue implementations.
+
+- [queueimpl1](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl1/queueimpl1.go):
+ custom queue implementation that stores the values in a simple slice. Pop
+ removes the first slice element. This is a slice based implementation that
+ tests [this](https://stackoverflow.com/a/26863706) suggestion.
+- [queueimpl2](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl2/queueimpl2.go):
+ custom queue implementation that stores the values in a simple slice. Pop
+ moves the current position to next one instead of removing the first element.
+ This is a slice based implementation similarly to queueimpl1, but differs in
+ the fact that it uses pointers to point to the current first element in the
+ queue instead of removing the first element.
+- [queueimpl3](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl3/queueimpl3.go):
+ custom queue implementation that stores the values in linked slices. This
+ implementation tests the queue performance when controlling the length and
+ current positions in the slices using the builtin len and append functions.
+- [queueimpl4](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl4/queueimpl4.go):
+ custom queue implementation that stores the values in linked arrays. This
+ implementation tests the queue performance when controlling the length and
+ current positions in the arrays using simple local variables instead of the
+ built in len and append functions (i.e. it uses arrays instead of slices).
+- [queueimpl5](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl5/queueimpl5.go):
+ custom queue implementation that stores the values in linked slices. This
+ implementation tests the queue performance when storing the "next" pointer as
+ part of the values slice instead of having it as a separate "next" field. The
+ next element is stored in the last position of the internal slices, which is a
+ reserved position.
+- [queueimpl6](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl6/queueimpl6.go):
+ custom queue implementation that stores the values in linked slices. This
+ implementation tests the queue performance when performing lazy creation of
+ the first slice as well as starting with an slice of size 1 and doubling its
+ size up to 128, everytime a new linked slice needs to be created.
+- [queueimpl7](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl7/queueimpl7.go):
+ custom queue implementation that stores the values in linked slices. This
+ implementation tests the queue performance when performing lazy creation of
+ the internal slice as well as starting with a 1-sized slice, allowing it to
+ grow up to 16 by using the built in append function. Subsequent slices are
+ created with 128 fixed size.
+
+Also as part of the research, I investigated and probed below open source queue
+implementations as well.
+- [gammazero](https://github.com/gammazero/deque): the deque implemented in this
+ package is a slice, ring based queue implementation.
+- [phf](https://github.com/phf/go-queue): this is also a slice, ring based queue
+ implementation. Interesting to note the author did a pretty good job
+ researching and probing other queue implementations as well.
+- [juju](https://github.com/juju/utils/tree/master/deque): the deque implemented
+ in this package uses a linked list based approach, similarly to other
+ experimental implementations in this package such as
+ [queueimpl3](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl3/queueimpl3.go).
+ The biggest difference between this implementation and the other expirimental
+ ones is the fact that this queue uses the standard list package as the linked
+ list. The standard list package implements a doubly linked list, while the
+ experimental implementations implements their own singly linked list.
+
+The [standard list
+package](https://github.com/golang/go/blob/master/src/container/list/list.go) as
+well as buffered channels were probed as well.
+
+
+### Benchmark Results
+
+Initialization time only<br/> Performance<br/>
+![ns/op](https://github.com/christianrpetrin/queue-tests/blob/master/images/queue-0-items-perf.jpg?raw=true
+"Benchmark tests") <br/>
+
+Memory<br/>
+![B/op](https://github.com/christianrpetrin/queue-tests/blob/master/images/queue-0-items-mem.jpg?raw=true
+"Benchmark tests")
+
+Add and remove 1k items<br/> Performance<br/>
+![ns/op](https://github.com/christianrpetrin/queue-tests/blob/master/images/queue-1k-items-perf.jpg?raw=true
+"Benchmark tests")
+
+Memory<br/>
+![B/op](https://github.com/christianrpetrin/queue-tests/blob/master/images/queue-1k-items-mem.jpg?raw=true
+"Benchmark tests") <br/>
+
+Add and remove 100k items<br/> Performance<br/>
+![ns/op](https://github.com/christianrpetrin/queue-tests/blob/master/images/queue-100k-items-perf.jpg?raw=true
+"Benchmark tests")
+
+Memory<br/>
+![B/op](https://github.com/christianrpetrin/queue-tests/blob/master/images/queue-100k-items-mem.jpg?raw=true
+"Benchmark tests") <br/>
+
+Aggregated Results<br/> Performance<br/>
+![ns/op](https://github.com/christianrpetrin/queue-tests/blob/master/images/queue-line-perf.jpg?raw=true
+"Benchmark tests")
+
+Memory<br/>
+![B/op](https://github.com/christianrpetrin/queue-tests/blob/master/images/queue-line-mem.jpg?raw=true
+"Benchmark tests") <br/>
+
+[Detailed, curated results can be found
+here](https://docs.google.com/spreadsheets/d/e/2PACX-1vRnCm7v51Eo5nq66NsGi8aQI6gL14XYJWqaeRJ78ZIWq1pRCtEZfsLD2FcI-gIpUhhTPnkzqDte_SDB/pubhtml?gid=668319604&single=true)
+
+[Aggregated, curated results can be found
+here](https://docs.google.com/spreadsheets/d/e/2PACX-1vRnCm7v51Eo5nq66NsGi8aQI6gL14XYJWqaeRJ78ZIWq1pRCtEZfsLD2FcI-gIpUhhTPnkzqDte_SDB/pubhtml?gid=582031751&single=true)
+<br/>
+
+Given above results, [queueimpl7](queueimpl7/queueimpl7.go), henceforth just
+"impl7", proved to be the most balanced implementation, being either faster or
+very competitive in all tests scenarios from a performance and memory
+perspective. Refer [here](https://github.com/christianrpetrin/queue-tests) for
+more details about the tests.
+
+[The benchmark tests can be found
+here.](https://github.com/christianrpetrin/queue-tests/blob/master/benchmark_test.go).
+
+
+#### Impl7 Design and Implementation
+[Impl7](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl7/queueimpl7.go)
+was the result of the observation that some slice based implementations such as
+[queueimpl1](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl1/queueimpl1.go)
+and
+[queueimpl2](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl2/queueimpl2.go)
+offers phenomenal performance when the queue is used with small data sets.
+
+For instance, comparing impl3 (very simple linked slice implementation) with
+impl1 (very simple slice based implementation), the results at adding 0 (init
+time only), 1 and 10 items are very favorable for impl1, from a performance and
+memory perspective.
+
+```
+benchstat rawresults/bench-impl1.txt rawresults/bench-impl3.txt
+name old time/op new time/op delta
+/0-4 6.83ns ± 3% 472.53ns ± 7% +6821.99% (p=0.000 n=20+17)
+/1-4 48.1ns ± 6% 492.4ns ± 5% +924.66% (p=0.000 n=20+20)
+/10-4 532ns ± 5% 695ns ± 8% +30.57% (p=0.000 n=20+20)
+/100-4 3.19µs ± 2% 2.50µs ± 4% -21.69% (p=0.000 n=18+19)
+/1000-4 24.5µs ± 3% 23.6µs ± 2% -3.33% (p=0.000 n=19+19)
+/10000-4 322µs ± 4% 238µs ± 1% -26.02% (p=0.000 n=19+18)
+/100000-4 15.8ms ±10% 3.3ms ±13% -79.32% (p=0.000 n=20+20)
+
+name old alloc/op new alloc/op delta
+/0-4 0.00B 2080.00B ± 0% +Inf% (p=0.000 n=20+20)
+/1-4 16.0B ± 0% 2080.0B ± 0% +12900.00% (p=0.000 n=20+20)
+/10-4 568B ± 0% 2152B ± 0% +278.87% (p=0.000 n=20+20)
+/100-4 4.36kB ± 0% 2.87kB ± 0% -34.13% (p=0.000 n=20+20)
+/1000-4 40.7kB ± 0% 24.6kB ± 0% -39.54% (p=0.000 n=20+20)
+/10000-4 746kB ± 0% 244kB ± 0% -67.27% (p=0.000 n=20+20)
+/100000-4 10.0MB ± 0% 2.4MB ± 0% -75.85% (p=0.000 n=15+20)
+
+name old allocs/op new allocs/op delta
+/0-4 0.00 2.00 ± 0% +Inf% (p=0.000 n=20+20)
+/1-4 1.00 ± 0% 2.00 ± 0% +100.00% (p=0.000 n=20+20)
+/10-4 14.0 ± 0% 11.0 ± 0% -21.43% (p=0.000 n=20+20)
+/100-4 108 ± 0% 101 ± 0% -6.48% (p=0.000 n=20+20)
+/1000-4 1.01k ± 0% 1.01k ± 0% +0.50% (p=0.000 n=20+20)
+/10000-4 10.0k ± 0% 10.2k ± 0% +1.35% (p=0.000 n=20+20)
+/100000-4 100k ± 0% 102k ± 0% +1.53% (p=0.000 n=20+20)
+```
+
+Impl7 is a bybrid experiment between using a simple slice based queue
+implementation for small data sets and the fixed sized linked slice approach for
+large data sets, which is an approach that scales really well, offering really
+good performance for small and large data sets.
+
+The implentation starts by lazily creating the first slice to hold the first
+values added to the queue.
+
+```go
+const (
+ // firstSliceSize holds the size of the first slice.
+ firstSliceSize = 1
+
+ // maxFirstSliceSize holds the maximum size of the first slice.
+ maxFirstSliceSize = 16
+
+ // maxInternalSliceSize holds the maximum size of each internal slice.
+ maxInternalSliceSize = 128
+)
+
+...
+
+// Push adds a value to the queue.
+// The complexity is amortized O(1).
+func (q *Queueimpl7) Push(v interface{}) {
+ if q.head == nil {
+ h := newNode(firstSliceSize) // Returns a 1-sized slice.
+ q.head = h
+ q.tail = h
+ q.lastSliceSize = maxFirstSliceSize
+ } else if len(q.tail.v) >= q.lastSliceSize {
+ n := newNode(maxInternalSliceSize) // Returns a 128-sized slice.
+ q.tail.n = n
+ q.tail = n
+ q.lastSliceSize = maxInternalSliceSize
+ }
+
+ q.tail.v = append(q.tail.v, v)
+ q.len++
+}
+
+...
+
+// newNode returns an initialized node.
+func newNode(capacity int) *Node {
+ return &Node{
+ v: make([]interface{}, 0, capacity),
+ }
+}
+```
+
+The very first created slice is created with capacity 1. The implementation
+allows the builtin append function to dynamically resize the slice up to 16
+(maxFirstSliceSize) positions. After that it reverts to creating fixed sized 128
+position slices, which offers the best performance for data sets above 16 items.
+
+16 items was chosen as this seems to provide the best balanced performance for
+small and large data sets according to the [array size benchmark
+tests](https://github.com/christianrpetrin/queue-tests/blob/master/bench_slice_size.md).
+Above 16 items, growing the slice means allocating a new, larger one and copying
+all 16 elements from the previous slice into the new one. The append function
+phenomenal performance can only compensate for the added copying of elements if
+the data set is very small, no more than 8 items in our tests. For above 8
+items, the fixed sized slice approach is consistently faster and uses less
+memory, where 128 sized slices are allocated and linked together when the data
+structure needs to scale to accomodate new values.
+
+Why 16? Why not 15 or 14?
+
+The builtin append function, as of "go1.11 darwin/amd64", seems to double the
+slice size every time is needs to allocate a new one.
+
+```go
+ts := make([]int, 0, 1)
+
+ts = append(ts, 1)
+fmt.Println(cap(ts)) // Slice has 1 item; output: 1
+
+ts = append(ts, 1)
+fmt.Println(cap(ts)) // Slice has 2 items; output: 2
+
+ts = append(ts, 1)
+fmt.Println(cap(ts)) // Slice has 3 items; output: 4
+
+ts = append(ts, 1)
+ts = append(ts, 1)
+fmt.Println(cap(ts)) // Slice has 5 items; output: 8
+
+ts = append(ts, 1)
+ts = append(ts, 1)
+ts = append(ts, 1)
+ts = append(ts, 1)
+fmt.Println(cap(ts)) // Slice has 9 items; output: 16
+```
+
+Since the append function will resize the slice from 8 to 16 positions, it makes
+sense to use all 16 already allocated positions before switching to the fixed
+sized slices approach.
+
+
+#### Impl7 Benchmark Results
+Below compares impl7 with a few selected implementations.
+
+The tests name are formatted given below.
+- Benchmark/N-4: benchmark a queue implementation where N denotes the number of
+ items added and removed to/from the queue; 4 means the number of CPU cores in
+ the host machine.
+
+Examples:
+- Benchmark/0-4: benchmark the queue by creating a new instance of it. This only
+ test initialization time.
+- Benchmark/100-4: benchmark the queue by creating a new instance of it and
+ adding and removing 100 items to/from the queue.
+
+---
+
+Standard list used as a FIFO queue vs impl7.
+```
+benchstat rawresults/bench-list.txt rawresults/bench-impl7.txt
+name old time/op new time/op delta
+/0-4 34.9ns ± 1% 1.2ns ± 3% -96.64% (p=0.000 n=19+20)
+/1-4 77.0ns ± 1% 68.3ns ± 1% -11.21% (p=0.000 n=20+20)
+/10-4 574ns ± 0% 578ns ± 0% +0.59% (p=0.000 n=18+20)
+/100-4 5.94µs ± 1% 3.07µs ± 0% -48.28% (p=0.000 n=19+18)
+/1000-4 56.0µs ± 1% 25.8µs ± 1% -53.92% (p=0.000 n=20+20)
+/10000-4 618µs ± 1% 260µs ± 1% -57.99% (p=0.000 n=20+18)
+/100000-4 13.1ms ± 6% 3.1ms ± 3% -76.50% (p=0.000 n=20+20)
+
+name old alloc/op new alloc/op delta
+/0-4 48.0B ± 0% 0.0B -100.00% (p=0.000 n=20+20)
+/1-4 96.0B ± 0% 48.0B ± 0% -50.00% (p=0.000 n=20+20)
+/10-4 600B ± 0% 600B ± 0% ~ (all equal)
+/100-4 5.64kB ± 0% 3.40kB ± 0% -39.72% (p=0.000 n=20+20)
+/1000-4 56.0kB ± 0% 25.2kB ± 0% -55.10% (p=0.000 n=20+20)
+/10000-4 560kB ± 0% 243kB ± 0% -56.65% (p=0.000 n=20+20)
+/100000-4 5.60MB ± 0% 2.43MB ± 0% -56.66% (p=0.000 n=18+20)
+
+name old allocs/op new allocs/op delta
+/0-4 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20)
+/1-4 2.00 ± 0% 2.00 ± 0% ~ (all equal)
+/10-4 20.0 ± 0% 15.0 ± 0% -25.00% (p=0.000 n=20+20)
+/100-4 200 ± 0% 107 ± 0% -46.50% (p=0.000 n=20+20)
+/1000-4 2.00k ± 0% 1.02k ± 0% -48.95% (p=0.000 n=20+20)
+/10000-4 20.0k ± 0% 10.2k ± 0% -49.20% (p=0.000 n=20+20)
+/100000-4 200k ± 0% 102k ± 0% -49.22% (p=0.000 n=20+20)
+```
+Impl7 is:
+- Up to ~29x faster (1.2ns vs 34.9ns) than list package for init time (0 items)
+- Up to ~4x faster (3.1ms vs 13.1ms) than list package for 10k items
+- Uses ~1/2 memory (2.43MB vs 5.60MB) than list package for 10k items
+
+---
+
+Impl1 (simple slice based queue implementaion) vs impl7.
+```
+benchstat rawresults/bench-impl1.txt rawresults/bench-impl7.txt
+name old time/op new time/op delta
+/0-4 6.83ns ± 3% 1.18ns ± 3% -82.79% (p=0.000 n=20+20)
+/1-4 48.1ns ± 6% 68.3ns ± 1% +42.23% (p=0.000 n=20+20)
+/10-4 532ns ± 5% 578ns ± 0% +8.55% (p=0.000 n=20+20)
+/100-4 3.19µs ± 2% 3.07µs ± 0% -3.74% (p=0.000 n=18+18)
+/1000-4 24.5µs ± 3% 25.8µs ± 1% +5.51% (p=0.000 n=19+20)
+/10000-4 322µs ± 4% 260µs ± 1% -19.23% (p=0.000 n=19+18)
+/100000-4 15.8ms ±10% 3.1ms ± 3% -80.60% (p=0.000 n=20+20)
+
+name old alloc/op new alloc/op delta
+/0-4 0.00B 0.00B ~ (all equal)
+/1-4 16.0B ± 0% 48.0B ± 0% +200.00% (p=0.000 n=20+20)
+/10-4 568B ± 0% 600B ± 0% +5.63% (p=0.000 n=20+20)
+/100-4 4.36kB ± 0% 3.40kB ± 0% -22.02% (p=0.000 n=20+20)
+/1000-4 40.7kB ± 0% 25.2kB ± 0% -38.25% (p=0.000 n=20+20)
+/10000-4 746kB ± 0% 243kB ± 0% -67.47% (p=0.000 n=20+20)
+/100000-4 10.0MB ± 0% 2.4MB ± 0% -75.84% (p=0.000 n=15+20)
+
+name old allocs/op new allocs/op delta
+/0-4 0.00 0.00 ~ (all equal)
+/1-4 1.00 ± 0% 2.00 ± 0% +100.00% (p=0.000 n=20+20)
+/10-4 14.0 ± 0% 15.0 ± 0% +7.14% (p=0.000 n=20+20)
+/100-4 108 ± 0% 107 ± 0% -0.93% (p=0.000 n=20+20)
+/1000-4 1.01k ± 0% 1.02k ± 0% +1.09% (p=0.000 n=20+20)
+/10000-4 10.0k ± 0% 10.2k ± 0% +1.39% (p=0.000 n=20+20)
+/100000-4 100k ± 0% 102k ± 0% +1.54% (p=0.000 n=20+20)
+```
+Impl7 is:
+- Up to ~5x faster (1.18ns vs 6.83ns) than impl1 for init time (0 items)
+- Up to ~5x faster (3.1ms vs 15.8ms) than impl1 for 10k items
+- Uses ~1/4 memory (2.4MB vs 10MB) than impl1 for 10k items
+
+It's important to note that the performance and memory gains for impl7 is
+exponential like the larger the data set is due to the fact slice based
+implementations doesn't scale well, paying a higher and higher price,
+performance and memory wise, everytime it needs to scale to accomadate an ever
+expanding data set.
+
+---
+
+Gammazero (slice, ring based FIFO queue implementation) vs impl7.
+```
+benchstat rawresults/bench-gammazero.txt rawresults/bench-impl7.txt
+name old time/op new time/op delta
+/0-4 1.45ns ± 0% 1.18ns ± 3% -18.97% (p=0.000 n=14+20)
+/1-4 103ns ± 2% 68ns ± 1% -33.83% (p=0.000 n=17+20)
+/10-4 337ns ± 1% 578ns ± 0% +71.39% (p=0.000 n=16+20)
+/100-4 4.16µs ± 2% 3.07µs ± 0% -26.12% (p=0.000 n=20+18)
+/1000-4 35.6µs ± 2% 25.8µs ± 1% -27.59% (p=0.000 n=19+20)
+/10000-4 342µs ± 2% 260µs ± 1% -23.96% (p=0.000 n=20+18)
+/100000-4 12.0ms ±10% 3.1ms ± 3% -74.46% (p=0.000 n=20+20)
+
+name old alloc/op new alloc/op delta
+/0-4 0.00B 0.00B ~ (all equal)
+/1-4 256B ± 0% 48B ± 0% -81.25% (p=0.000 n=20+20)
+/10-4 328B ± 0% 600B ± 0% +82.93% (p=0.000 n=20+20)
+/100-4 6.42kB ± 0% 3.40kB ± 0% -47.07% (p=0.000 n=20+20)
+/1000-4 56.6kB ± 0% 25.2kB ± 0% -55.57% (p=0.000 n=20+20)
+/10000-4 473kB ± 0% 243kB ± 0% -48.64% (p=0.000 n=20+20)
+/100000-4 7.09MB ± 0% 2.43MB ± 0% -65.77% (p=0.000 n=20+20)
+
+name old allocs/op new allocs/op delta
+/0-4 0.00 0.00 ~ (all equal)
+/1-4 1.00 ± 0% 2.00 ± 0% +100.00% (p=0.000 n=20+20)
+/10-4 10.0 ± 0% 15.0 ± 0% +50.00% (p=0.000 n=20+20)
+/100-4 106 ± 0% 107 ± 0% +0.94% (p=0.000 n=20+20)
+/1000-4 1.01k ± 0% 1.02k ± 0% +0.89% (p=0.000 n=20+20)
+/10000-4 10.0k ± 0% 10.2k ± 0% +1.43% (p=0.000 n=20+20)
+/100000-4 100k ± 0% 102k ± 0% +1.54% (p=0.000 n=20+20)
+```
+Impl7 is:
+- Up to ~18% faster (1.18ns vs 1.45ns) than gammazero for init time (0 items)
+- Up to ~3x faster (3.1ms vs 12.0ms) than gammazero for 10k items
+- Uses ~1/2 memory (2.43MB vs 7.09MB) than gammazero for 10k items
+
+---
+
+Buffered channel vs impl7.
+```
+benchstat rawresults/bench-channel.txt rawresults/bench-impl7.txt
+name old time/op new time/op delta
+/0-4 30.2ns ± 1% 1.2ns ± 3% -96.12% (p=0.000 n=19+20)
+/1-4 87.6ns ± 1% 68.3ns ± 1% -22.00% (p=0.000 n=19+20)
+/10-4 704ns ± 1% 578ns ± 0% -17.90% (p=0.000 n=20+20)
+/100-4 6.78µs ± 1% 3.07µs ± 0% -54.70% (p=0.000 n=20+18)
+/1000-4 67.3µs ± 1% 25.8µs ± 1% -61.65% (p=0.000 n=20+20)
+/10000-4 672µs ± 1% 260µs ± 1% -61.36% (p=0.000 n=19+18)
+/100000-4 6.76ms ± 1% 3.07ms ± 3% -54.61% (p=0.000 n=19+20)
+
+name old alloc/op new alloc/op delta
+/0-4 96.0B ± 0% 0.0B -100.00% (p=0.000 n=20+20)
+/1-4 112B ± 0% 48B ± 0% -57.14% (p=0.000 n=20+20)
+/10-4 248B ± 0% 600B ± 0% +141.94% (p=0.000 n=20+20)
+/100-4 1.69kB ± 0% 3.40kB ± 0% +101.42% (p=0.000 n=20+20)
+/1000-4 16.2kB ± 0% 25.2kB ± 0% +55.46% (p=0.000 n=20+20)
+/10000-4 162kB ± 0% 243kB ± 0% +49.93% (p=0.000 n=20+20)
+/100000-4 1.60MB ± 0% 2.43MB ± 0% +51.43% (p=0.000 n=16+20)
+
+name old allocs/op new allocs/op delta
+/0-4 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20)
+/1-4 1.00 ± 0% 2.00 ± 0% +100.00% (p=0.000 n=20+20)
+/10-4 10.0 ± 0% 15.0 ± 0% +50.00% (p=0.000 n=20+20)
+/100-4 100 ± 0% 107 ± 0% +7.00% (p=0.000 n=20+20)
+/1000-4 1.00k ± 0% 1.02k ± 0% +2.10% (p=0.000 n=20+20)
+/10000-4 10.0k ± 0% 10.2k ± 0% +1.61% (p=0.000 n=20+20)
+/100000-4 100k ± 0% 102k ± 0% +1.57% (p=0.000 n=20+20)
+```
+Impl7 is:
+- Up to ~25x faster (1.2ns vs 30.2ns) than channels for init time (0 items)
+- Up to ~2x faster (3.07ms vs 6.76ms) than channels for 10k items
+- Uses ~50% MORE memory (2.43MB vs 1.60MB) than channels for 10k items
+
+Above is not really a fair comparison as standard buffered channels doesn't
+scale (at all) and they are meant for routine synchronization. Nonetheless, they
+can and make for an excellent FIFO queue option. Still, impl7 is consistenly
+faster than channels across the board, but uses considerably more memory than
+channels.
+
+---
+
+Given its excellent performance under all scenarios, the hybrid approach impl7
+seems to be the ideal candidate for a high performance, low memory footprint
+general purpose FIFO queue.
+
+For above reasons, I propose to port impl7 to the standard library.
+
+All raw benchmark results can be found
+[here](https://github.com/christianrpetrin/queue-tests/tree/master/rawresults).
+
+
+### Internal Slice Size
+[Impl7](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl7/queueimpl7.go)
+uses linked slices as its underlying data structure.
+
+The size of the internal slice does influence performance and memory consumption
+significantly.
+
+According to the [internal slice size bench
+tests](https://github.com/christianrpetrin/queue-tests/blob/master/queueimpl7/benchmark_test.go),
+larger internal slice sizes yields better performance and lower memory
+footprint. However, the gains diminishes dramatically as the slice size
+increases.
+
+Below are a few interesting results from the benchmark tests.
+
+```
+BenchmarkMaxSubsequentSliceSize/1-4 20000 76836 ns/op 53967 B/op 2752 allocs/op
+BenchmarkMaxSubsequentSliceSize/2-4 30000 59811 ns/op 40015 B/op 1880 allocs/op
+BenchmarkMaxSubsequentSliceSize/4-4 30000 42925 ns/op 33039 B/op 1444 allocs/op
+BenchmarkMaxSubsequentSliceSize/8-4 50000 36946 ns/op 29551 B/op 1226 allocs/op
+BenchmarkMaxSubsequentSliceSize/16-4 50000 30597 ns/op 27951 B/op 1118 allocs/op
+BenchmarkMaxSubsequentSliceSize/32-4 50000 28273 ns/op 27343 B/op 1064 allocs/op
+BenchmarkMaxSubsequentSliceSize/64-4 50000 26969 ns/op 26895 B/op 1036 allocs/op
+BenchmarkMaxSubsequentSliceSize/128-4 50000 27316 ns/op 26671 B/op 1022 allocs/op
+BenchmarkMaxSubsequentSliceSize/256-4 50000 26221 ns/op 28623 B/op 1016 allocs/op
+BenchmarkMaxSubsequentSliceSize/512-4 50000 25882 ns/op 28559 B/op 1012 allocs/op
+BenchmarkMaxSubsequentSliceSize/1024-4 50000 25674 ns/op 28527 B/op 1010 allocs/op
+```
+
+Given the fact that larger internal slices also means potentially more unused
+memory in some scenarios, 128 seems to be the perfect balance between
+performance and worst case scenario for memory footprint.
+
+Full results can be found
+[here](https://github.com/christianrpetrin/queue-tests/blob/master/bench_slice_size.md).
+
+
+### API
+[Impl7](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl7/queueimpl7.go)
+implements below API methods.
+
+| Operation | Method |
+| --- | --- |
+| Add | func (q *Queueimpl7) Push(v interface{}) |
+| Remove | func (q *Queueimpl7) Pop() (interface{}, bool) |
+| Size | func (q *Queueimpl7) Len() int |
+| Return First | func (q *Queueimpl7) Front() (interface{}, bool) |
+
+As nil values are considered valid queue values, similarly to the map data
+structure, "Front" and "Pop" returns a second bool parameter to indicate whether
+the returned value is valid and whether the queue is empty or not.
+
+The reasonale for above method names and signatures are the need to keep
+compatibility with existing Go data structures such as the
+[list](https://github.com/golang/go/blob/master/src/container/list/list.go),
+[ring](https://github.com/golang/go/blob/master/src/container/ring/ring.go) and
+[heap](https://github.com/golang/go/blob/master/src/container/heap/heap.go)
+packages.
+
+Below are the method names used by the existing list, ring and heap Go data
+structures, as well as the new proposed queue.
+
+| Operation | list | ring | heap | queue |
+| --- | --- | --- | --- | --- |
+| Add | PushFront/PushBack | Link | Push | Push |
+| Remove | Remove | Unlink | Pop | Pop |
+| Size | Len | Len | - | Len |
+| Return First | Front | - | - | Front |
+
+For comparison purposes, below are the method names for
+[C++](http://www.cplusplus.com/reference/queue/queue/),
+[Java](https://docs.oracle.com/javase/7/docs/api/java/util/Queue.html) and
+[C#](https://docs.microsoft.com/en-us/dotnet/api/system.collections.generic.queue-1?view=netframework-4.7.2)
+for their queue implementation.
+
+| Operation | C++ | Java | C# |
+| --- | --- | --- | --- |
+| Add | push | add/offer | Enqueue |
+| Remove | pop | remove/poll | Dequeue |
+| Size | size | - | Count |
+| Return First | front | peek | Peek |
+
+
+### Range Support
+Just like the current container data strucutures such as the list, ring and
+heap, Impl7 doesn't support the range keyword for navigation.
+
+The API offers two ways to iterate over the queue items.
+
+Either use "Pop" to retrieve the first current element and the second bool
+parameter to check for an empty queue.
+```go
+for v, ok := q.Pop(); ok; v, ok = q.Pop() {
+ // Do something with v
+}
+```
+
+Or use "Len" and "Pop" to check for an empty queue and retrieve the first
+current element.
+```go
+for q.Len() > 0 {
+ v, _ := q.Pop()
+ // Do something with v
+}
+```
+
+### Data Type
+Just like the current container data strucutures such as the
+[list](https://github.com/golang/go/blob/master/src/container/list/list.go),
+[ring](https://github.com/golang/go/blob/master/src/container/ring/ring.go) and
+[heap](https://github.com/golang/go/tree/master/src/container/heap), Impl7 only
+supported data type is "interface{}", making it usable by virtually any Go
+types.
+
+It is possible to implement support for specialized data types such as int,
+float, bool, etc, but that would require duplicating the Push/Pop methods to
+accept the different data types, much like
+strconv.ParseBool/ParseFloat/ParseInt/etc. However, with the impending release
+of generics, we should probrably wait as generics would solve this problem
+nicely.
+
+
+
+### Safe for Concurrent Use
+[Impl7](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl7/queueimpl7.go)
+is not safe for concurrent use by default. The rationale for this decision is
+below.
+
+1) Not all users will need a safe for concurrent use queue implementation
+2) Executing routine synchronization is expensive, causing performance to drop
+ very significantly
+3) Getting impl7 to be safe for concurrent use is actually very simple
+
+Below is an example of a safe for concurrent use queue implementation that uses
+impl7 as its underlying queue.
+
+```go
+package tests
+
+import (
+ "fmt"
+ "sync"
+ "testing"
+
+ "github.com/christianrpetrin/queue-tests/queueimpl7"
+)
+
+type SafeQueue struct {
+ q Queueimpl7
+ m sync.Mutex
+}
+
+func (s *SafeQueue) Len() int {
+ s.m.Lock()
+ defer s.m.Unlock()
+
+ return s.q.Len()
+}
+
+func (s *SafeQueue) Push(v interface{}) {
+ s.m.Lock()
+ defer s.m.Unlock()
+
+ s.q.Push(v)
+}
+
+func (s *SafeQueue) Pop() (interface{}, bool) {
+ s.m.Lock()
+ defer s.m.Unlock()
+
+ return s.q.Pop()
+}
+
+func (s *SafeQueue) Front() (interface{}, bool) {
+ s.m.Lock()
+ defer s.m.Unlock()
+
+ return s.q.Front()
+}
+
+func TestSafeQueue(t *testing.T) {
+ var q SafeQueue
+
+ q.Push(1)
+ q.Push(2)
+
+ for v, ok := q.Pop(); ok; v, ok = q.Pop() {
+ fmt.Println(v)
+ }
+
+ // Output:
+ // 1
+ // 2
+}
+```
+
+### Drawbacks
+The biggest drawback of the proposed implementation is the potentially extra
+allocated but not used memory in its head and tail slices.
+
+This scenario realizes when exactly 17 items are added to the queue, causing the
+creation of a full sized internal slice of 128 positions. Initially only the
+first element in this new slice is used to store the added value. All the other
+127 elements are already allocated, but not used.
+
+```go
+// Assuming a 128 internal sized slice.
+q := queueimpl7.New()
+
+// Push 16 items to fill the first dynamic slice (sized 16).
+for i := 1; i <= 16; i++ {
+ q.Push(i)
+}
+// Push 1 extra item that causes the creation of a new 128 sized slice to store this value.
+q.Push(17)
+
+// Pops the first 16 items to release the first slice (sized 16).
+for i := 1; i <= 16; i++ {
+ q.Pop()
+}
+
+// As unsafe.Sizeof (https://golang.org/pkg/unsafe/#Sizeof) doesn't consider the length of slices,
+// we need to manually calculate the memory used by the internal slices.
+var internalSliceType interface{}
+fmt.Println(fmt.Sprintf("%d bytes", unsafe.Sizeof(q)+(unsafe.Sizeof(internalSliceType) /* bytes per slice position */ *127 /* head slice unused positions */)))
+
+// Output for a 64bit system (Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz): 2040 bytes
+```
+
+The worst case scenario realizes when exactly 145 items are added to the queue
+and 143 items are removed. This causes the queue struct to hold a 128-sized
+slice as its head slice, but only the last element is actually used. Similarly,
+the queue struct will hold a separate 128-sized slice as its tail slice, but
+only the first position in that slice is being used.
+
+```go
+// Assuming a 128 internal sized slice.
+q := queueimpl7.New()
+
+// Push 16 items to fill the first dynamic slice (sized 16).
+for i := 1; i <= 16; i++ {
+ q.Push(i)
+}
+
+// Push an additional 128 items to fill the first full sized slice (sized 128).
+for i := 1; i <= 128; i++ {
+ q.Push(i)
+}
+
+// Push 1 extra item that causes the creation of a new 128 sized slice to store this value,
+// adding a total of 145 items to the queue.
+q.Push(1)
+
+// Pops the first 143 items to release the first dynamic slice (sized 16) and
+// 127 items from the first full sized slice (sized 128).
+for i := 1; i <= 143; i++ {
+ q.Pop()
+}
+
+// As unsafe.Sizeof (https://golang.org/pkg/unsafe/#Sizeof) doesn't consider the length of slices,
+// we need to manually calculate the memory used by the internal slices.
+var internalSliceType interface{}
+fmt.Println(fmt.Sprintf("%d bytes", unsafe.Sizeof(q)+(unsafe.Sizeof(internalSliceType) /* bytes per slice position */ *(127 /* head slice unused positions */ +127 /* tail slice unused positions */))))
+
+// Output for a 64bit system (Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz): 4072 bytes
+```
+
+Above code were run on Go version "go1.11 darwin/amd64".
+
+
+### Alternative Solutions
+Buffered channels provide an excellent option to be used as a simple FIFO queue.
+However, there's a few drawbacks when using buffered channels as queues.
+
+1) Channels block the consumers once its internal limit (initialized length) has
+ been reached
+2) Channels can't be used as data store to store sequential data as it require
+ immediate consumers
+3) Channels are meant for routine synchronization, not as a data structure, so
+ they are safe for concurrent use, which is not a feature that is always
+ desired in a queue
+
+The use of the standard list package as an unbounded queue seems to be the
+perfect candidate for large data sets. However, its performance is dramatically
+slower than impl7 and displays a much bigger memory footprint.
+
+Similarly, the use of dynamic slices to implement queues doesn't offer very good
+performance or memory footprint either, as confirmed by the [benchmark test
+results](https://github.com/christianrpetrin/queue-tests/blob/master/bench_queue.md).
+
+
+## Compatibility
+This is a new package; existing code continues to compile, in keeping with the
+[compatibility guidelines](https://golang.org/doc/go1compat).
+
+
+## Implementation
+A complete implementation can be found
+[here](https://github.com/christianrpetrin/queue-tests/blob/master/queueimpl7).
+Only a few changes such as package name and comments would need to be changed,
+besides other changes the Go community feels are necessary. I'll be porting the
+code to the standard library by submitting PR, addressing comments, merging,
+etc.
+
+The goal, subject to proposal approval, is to have the new package checked in
+for the Go 1.12 release.
+
+
+## Open issues
+If this queue proposal is accepted, it would make sense to research and
+implement similar approach for a stack implementation as well.
+
+Should this be a deque (double-ended queue) implementation instead? The deque
+could be used as a stack as well, but it would make more sense to have a queue
+and stack implementations (like most mainstream languages have) instead of a
+deque that can be used as a stack (confusing). Stack is a very important
+computer science data structure as well and so I believe Go should have a
+specialized implementation for it as well (given the specialized implementation
+offers real value to the users and not just a "nice" named interface and
+methods).
+
+Should "Pop" and "Front" return only the value instead of the value and a second
+bool parameter (which indicates whether the queue is empty or not)? The
+implication of the change is adding nil values wouldn't be valid anymore so
+"Pop" and "Front" would return nil when the queue is empty. Panic should be
+avoided in libraries.
+
+The memory footprint for a 128 sized internal slice causes, in the worst case
+scenario, a 4072 bytes of memory allocated but not used. Switching to 64 means
+roughly half the memory would be used with a slight ~2.89% performance drop
+(252813ns vs 260137ns). The extra memory footprint is not worth the extra
+performance gain is a very good point to make. Should we change this value to 64
+or maybe make it configurable?
+
+Should we also provide a safe for concurrent use implementation? A specialized
+implementation that would rely on atomic operations to update its internal
+indices and length could offer a much better performance when comparing to a
+similar implementation that relies on mutexes. Maybe a linked channel approach
+would be the best solution for a safe for concurrent use, yet very performant
+solution. I'll be testing this approach in the near future.
+
+With the impending release of generics, should we wait to release the new queue
+package once the new generics framework is released?
+
+Should we implement support for the range keyword for the new queue? It could be
+done in a generic way so other data structures could also benefit from this
+feature. For now, IMO, this is a topic for another proposal/discussion.
+
+
+## Summary
+I propose to add a new package, "container/queue", to the standard library to
+support an in-memory, unbounded, general purpose queue implementation.
+
+I feel strongly this proposal should be accepted due to below reasons.
+
+1) The proposed solution was well researched and probed, being dramatically and
+ consistently faster than 6 other experimental queue implementations as well 3
+ promising open source queue implementations as well the standard list package
+ and buffered channels; it still consumes considerably less memory than every
+ other queue implementation tested, except for buffered channels
+2) The proposed solution uses a new, unique approach to building queues, yet its
+ [implementation](https://github.com/christianrpetrin/queue-tests/blob/master/queueimpl7/queueimpl7.go)
+ is clean and extremely simple. Both main methods, "Push" and "Pop", are
+ composed of only 13 and 14 lines of code, respectively. The proposed
+ implementation also have proper tests with 100% test coverage. I feel
+ strongly the proposed solution will require minimal maintenance moving
+ forward
+3) I'll implement any changes the Go community feels are needed for the proposed
+ solution to be worth of the standard library and the Go community
+
+Design document with more details at https://golang.org/issue/NNNNN.