| # Proposal: Built in support for high performance unbounded queue |
| |
| Author: Christian Petrin. |
| |
| Last updated: October 2, 2018 |
| |
| Discussion at: https://github.com/golang/go/issues/27935 |
| |
| Design document at https://github.com/golang/proposal/blob/master/design/27935-unbounded-queue-package.md |
| |
| |
| ## Abstract |
| I propose to add a new package, "container/queue", to the standard library to |
| support an in-memory, unbounded, general purpose queue implementation. |
| |
| [Queues](https://en.wikipedia.org/wiki/Queue_(abstract_data_type)) in computer |
| science is a very old, well established and well known concept, yet Go doesn't |
| provide a specialized, safe to use, performant and issue free unbounded queue |
| implementation. |
| |
| Buffered channels provide an excellent option to be used as a queue, but |
| buffered channels are bounded and so doesn't scale to support very large data |
| sets. The same applies for the standard [ring |
| package](https://github.com/golang/go/tree/master/src/container/ring). |
| |
| The standard [list |
| package](https://github.com/golang/go/tree/master/src/container/list) can be |
| used as the underlying data structure for building unbounded queues, but the |
| performance yielded by this linked list based implementation [is not |
| optimal](https://github.com/christianrpetrin/queue-tests/blob/master/bench_queue.md). |
| |
| Implementing a queue using slices as suggested |
| [here](https://stackoverflow.com/a/26863706) is a feasible approach, but the |
| performance yielded by this implementation can be abysmal in some [high load |
| scenarios](https://github.com/christianrpetrin/queue-tests/blob/master/bench_queue.md). |
| |
| ## Background |
| Queues that grows dynamically has many uses. As an example, I'm working on a |
| logging system called [CloudLogger](https://github.com/cloud-logger/docs) that |
| sends all logged data to external logging management systems, such as |
| [Stackdriver](https://cloud.google.com/stackdriver/) and |
| [Cloudwatch](https://aws.amazon.com/cloudwatch/). External logging systems |
| typically [rate limit](https://en.wikipedia.org/wiki/Rate_limiting) how much |
| data their service will accept for a given account and time frame. So in a |
| scenario where the hosting application is logging more data than the logging |
| management system will accept at a given moment, CloudLogger has to queue the |
| extra logs and send them to the logging management system at a pace the system |
| will accept. As there's no telling how much data will have to be queued as it |
| depends on the current traffic, an unbounded, dynamically growing queue is the |
| ideal data structure to be used. Buffered channels in this scenario is not ideal |
| as they have a limit on how much data they will accept, and once that limit has |
| been reached, the producers (routines adding to the channel) start to block, |
| making the adding to the channel operation an "eventual" synchronous process. A |
| fully asynchronous operation in this scenario is highly desirable as logging |
| data should not slow down significantly the hosting application. |
| |
| Above problem is a problem that, potentially, every system that calls another |
| system faces. And in the [cloud](https://en.wikipedia.org/wiki/Cloud_computing) |
| and [microservices](https://en.wikipedia.org/wiki/Microservices) era, this is an |
| extremely common scenario. |
| |
| Due to the lack of support for built in unbounded queues in Go, Go engineers are |
| left to either: |
| 1) Research and use external packages, or |
| 2) Build their own queue implementation |
| |
| Both approaches are riddled with pitfalls. |
| |
| Using external packages, specially in enterprise level software, requires a lot |
| of care as using external, potentially untested and hard to understand code can |
| have unwanted consequences. This problem is made much worse by the fact that, |
| currently, there's no well established and disseminated open source Go queue |
| implementation according to [this stackoverflow |
| discussion](https://stackoverflow.com/questions/2818852/is-there-a-queue-implementation), |
| [this github search for Go |
| queues](https://github.com/search?l=Go&q=go+queue&type=Repositories) and |
| [Awesome Go](https://awesome-go.com/). |
| |
| Building a queue, on the other hand, might sound like a compelling argument, but |
| building efficient, high performant, bug free unbounded queue is a hard job that |
| requires a pretty solid computer science foundation as well a good deal of time |
| to research different design approaches, test different implementations, make |
| sure the code is bug and memory leak free, etc. |
| |
| In the end what Go engineers have been doing up to this point is building their |
| own queues, which are for the most part inefficient and can have disastrous, yet |
| hidden performance and memory issues. As examples of poorly designed and/or |
| implemented queues, the approaches suggested |
| [here](https://stackoverflow.com/a/26863706) and |
| [here](https://stackoverflow.com/a/11757161) (among many others), requires |
| linear copy of the internal slice for resizing purposes. Some implementations |
| also has memory issues such as an ever expanding internal slice and memory |
| leaks. |
| |
| ## Proposal |
| I propose to add a new package, "container/queue", to the standard library to |
| support in-memory unbounded queues. The [proposed queue |
| implementation](https://github.com/christianrpetrin/queue-tests/blob/master/queueimpl7/queueimpl7.go) |
| offers [excellent performance and very low memory |
| consumption](https://github.com/christianrpetrin/queue-tests/blob/master/bench_queue.md) |
| when comparing it to three promising open source implementations |
| ([gammazero](https://github.com/gammazero/deque), |
| [phf](https://github.com/phf/go-queue) and |
| [juju](https://github.com/juju/utils/tree/master/deque)); to use Go channels as |
| queue; the standard list package as a queue as well as six other experimental |
| queue implementations. |
| |
| The [proposed queue |
| implementation](https://github.com/christianrpetrin/queue-tests/blob/master/queueimpl7/queueimpl7.go) |
| offers the most balanced approach to performance given different loads, being |
| significantly faster and still uses less memory than every other queue |
| implementation in the |
| [tests](https://github.com/christianrpetrin/queue-tests/blob/master/benchmark_test.go). |
| |
| The closest data structure Go has to offer for building dynamically growing |
| queues for large data sets is the [standard list |
| package](https://github.com/golang/go/tree/master/src/container/list). When |
| comparing the proposed solution to [using the list package as an unbounded |
| queue](https://github.com/christianrpetrin/queue-tests/blob/master/benchmark_test.go) |
| (refer to "BenchmarkList"), the proposed solution is consistently faster than |
| using the list package as a queue as well as displaying a much lower memory |
| footprint. |
| |
| |
| ### Reasoning |
| There's [two well accepted |
| approaches](https://en.wikipedia.org/wiki/Queue_(abstract_data_type)#Queue_implementation) |
| to implementing queues when in comes to the queue underlying data structure: |
| |
| 1) Using linked list |
| 2) Using array |
| |
| Linked list as the underlying data structure for an unbounded queue has the |
| advantage of scaling efficiently when the underlying data structure needs to |
| grow to accommodate more values. This is due to the fact that the existing |
| elements doesn't need to be repositioned or copied around when the queue needs |
| to grow. |
| |
| However, there's a few concerns with this approach: |
| 1) The use of prev/next pointers for each value requires a good deal of extra |
| memory |
| 2) Due to the fact that each "node" in the linked list can be allocated far away |
| from the previous one, navigating through the list can be slow due to its bad |
| [memory |
| locality](https://www.cs.cornell.edu/courses/cs3110/2012sp/lectures/lec25-locality/lec25.html) |
| properties |
| 3) Adding new values always require new memory allocations and pointers being |
| set, hindering performance |
| |
| On the other hand, using a slice as the underlying data structure for unbounded |
| queues has the advantage of very good [memory |
| locality](https://www.cs.cornell.edu/courses/cs3110/2012sp/lectures/lec25-locality/lec25.html), |
| making retrieval of values faster when comparing to linked lists. Also an "alloc |
| more than needed right now" approach can easily be implemented with slices. |
| |
| However, when the slice needs to expand to accommodate new values, a [well |
| adopted |
| strategy](https://en.wikipedia.org/wiki/Dynamic_array#Geometric_expansion_and_amortized_cost) |
| is to allocate a new, larger slice, copy over all elements from the previous |
| slice into the new one and use the new one to add the new elements. |
| |
| The problem with this approach is the obvious need to copy all the values from |
| the older, small slice, into the new one, yielding a poor performance when the |
| amount of values that need copying are fairly large. |
| |
| Another potential problem is a theoretical lower limit on how much data they can |
| hold as slices, like arrays, have to allocate its specified positions in |
| sequential memory addresses, so the maximum number of items the queue would ever |
| be able to hold is the maximum size a slice can be allocated on that particular |
| system at any given moment. Due to modern memory management techniques such as |
| [virtual memory](https://en.wikipedia.org/wiki/Virtual_memory) and |
| [paging](https://en.wikipedia.org/wiki/Paging), this is a very hard scenario to |
| corroborate thru practical testing. |
| |
| Nonetheless, this approach doesn't scale well with large data sets. |
| |
| Having said that, there's a third, newer approach to implementing unbounded |
| queues: use fixed size linked slices as the underlying data structure. |
| |
| The fixed size linked slices approach is a hybrid between the first two, |
| providing good memory locality arrays have alongside the efficient growing |
| mechanism linked lists offer. It is also not limited on the maximum size a slice |
| can be allocated, being able to hold and deal efficiently with a theoretical |
| much larger amount of data than pure slice based implementations. |
| |
| |
| ## Rationale |
| ### Research |
| [A first implementation](https://github.com/cloud-spin/queue) of the new design |
| was built. |
| |
| The benchmark tests showed the new design was very promising, so I decided to |
| research about other possible queue designs and implementations with the goal to |
| improve the first design and implementation. |
| |
| As part of the research to identify the best possible queue designs and |
| implementations, I implemented and probed below queue implementations. |
| |
| - [queueimpl1](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl1/queueimpl1.go): |
| custom queue implementation that stores the values in a simple slice. Pop |
| removes the first slice element. This is a slice based implementation that |
| tests [this](https://stackoverflow.com/a/26863706) suggestion. |
| - [queueimpl2](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl2/queueimpl2.go): |
| custom queue implementation that stores the values in a simple slice. Pop |
| moves the current position to next one instead of removing the first element. |
| This is a slice based implementation similarly to queueimpl1, but differs in |
| the fact that it uses pointers to point to the current first element in the |
| queue instead of removing the first element. |
| - [queueimpl3](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl3/queueimpl3.go): |
| custom queue implementation that stores the values in linked slices. This |
| implementation tests the queue performance when controlling the length and |
| current positions in the slices using the builtin len and append functions. |
| - [queueimpl4](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl4/queueimpl4.go): |
| custom queue implementation that stores the values in linked arrays. This |
| implementation tests the queue performance when controlling the length and |
| current positions in the arrays using simple local variables instead of the |
| built in len and append functions (i.e. it uses arrays instead of slices). |
| - [queueimpl5](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl5/queueimpl5.go): |
| custom queue implementation that stores the values in linked slices. This |
| implementation tests the queue performance when storing the "next" pointer as |
| part of the values slice instead of having it as a separate "next" field. The |
| next element is stored in the last position of the internal slices, which is a |
| reserved position. |
| - [queueimpl6](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl6/queueimpl6.go): |
| custom queue implementation that stores the values in linked slices. This |
| implementation tests the queue performance when performing lazy creation of |
| the first slice as well as starting with an slice of size 1 and doubling its |
| size up to 128, everytime a new linked slice needs to be created. |
| - [queueimpl7](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl7/queueimpl7.go): |
| custom queue implementation that stores the values in linked slices. This |
| implementation tests the queue performance when performing lazy creation of |
| the internal slice as well as starting with a 1-sized slice, allowing it to |
| grow up to 16 by using the built in append function. Subsequent slices are |
| created with 128 fixed size. |
| |
| Also as part of the research, I investigated and probed below open source queue |
| implementations as well. |
| - [phf](https://github.com/phf/go-queue): this is a slice, ring based queue |
| implementation. Interesting to note the author did a pretty good job |
| researching and probing other queue implementations as well. |
| - [gammazero](https://github.com/gammazero/deque): the deque implemented in this |
| package is also a slice, ring based queue implementation. |
| - [juju](https://github.com/juju/utils/tree/master/deque): the deque implemented |
| in this package uses a linked list based approach, similarly to other |
| experimental implementations in this package such as |
| [queueimpl3](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl3/queueimpl3.go). |
| The biggest difference between this implementation and the other experimental |
| ones is the fact that this queue uses the standard list package as the linked |
| list. The standard list package implements a doubly linked list, while the |
| experimental implementations implements their own singly linked list. |
| |
| The [standard list |
| package](https://github.com/golang/go/blob/master/src/container/list/list.go) as |
| well as buffered channels were probed as well. |
| |
| |
| ### Benchmark Results |
| |
| Initialization time only<br/> Performance<br/> |
|  <br/> |
| |
| Memory<br/> |
|  |
| |
| Add and remove 1k items<br/> Performance<br/> |
|  |
| |
| Memory<br/> |
|  <br/> |
| |
| Add and remove 100k items<br/> Performance<br/> |
|  |
| |
| Memory<br/> |
|  <br/> |
| |
| Aggregated Results<br/> Performance<br/> |
|  |
| |
| Memory<br/> |
|  <br/> |
| |
| Detailed, curated results can be found |
| [here](https://docs.google.com/spreadsheets/d/e/2PACX-1vRnCm7v51Eo5nq66NsGi8aQI6gL14XYJWqaeRJ78ZIWq1pRCtEZfsLD2FcI-gIpUhhTPnkzqDte_SDB/pubhtml?gid=668319604&single=true) |
| |
| Aggregated, curated results can be found |
| [here](https://docs.google.com/spreadsheets/d/e/2PACX-1vRnCm7v51Eo5nq66NsGi8aQI6gL14XYJWqaeRJ78ZIWq1pRCtEZfsLD2FcI-gIpUhhTPnkzqDte_SDB/pubhtml?gid=582031751&single=true) |
| <br/> |
| |
| Given above results, |
| [queueimpl7](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl7/queueimpl7.go), |
| henceforth just "impl7", proved to be the most balanced implementation, being |
| either faster or very competitive in all test scenarios from a performance and |
| memory perspective. |
| |
| Refer [here](https://github.com/christianrpetrin/queue-tests) for more details |
| about the tests. |
| |
| The benchmark tests can be found |
| [here](https://github.com/christianrpetrin/queue-tests/blob/master/benchmark_test.go). |
| |
| |
| #### Impl7 Design and Implementation |
| [Impl7](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl7/queueimpl7.go) |
| was the result of the observation that some slice based implementations such as |
| [queueimpl1](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl1/queueimpl1.go) |
| and |
| [queueimpl2](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl2/queueimpl2.go) |
| offers phenomenal performance when the queue is used with small data sets. |
| |
| For instance, comparing |
| [queueimpl3](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl3/queueimpl3.go) |
| (very simple linked slice implementation) with |
| [queueimpl1](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl1/queueimpl1.go) |
| (very simple slice based implementation), the results at adding 0 (init time |
| only), 1 and 10 items are very favorable for impl1, from a performance and |
| memory perspective. |
| |
| ``` |
| benchstat rawresults/bench-impl1.txt rawresults/bench-impl3.txt |
| name old time/op new time/op delta |
| /0-4 6.83ns ± 3% 472.53ns ± 7% +6821.99% (p=0.000 n=20+17) |
| /1-4 48.1ns ± 6% 492.4ns ± 5% +924.66% (p=0.000 n=20+20) |
| /10-4 532ns ± 5% 695ns ± 8% +30.57% (p=0.000 n=20+20) |
| /100-4 3.19µs ± 2% 2.50µs ± 4% -21.69% (p=0.000 n=18+19) |
| /1000-4 24.5µs ± 3% 23.6µs ± 2% -3.33% (p=0.000 n=19+19) |
| /10000-4 322µs ± 4% 238µs ± 1% -26.02% (p=0.000 n=19+18) |
| /100000-4 15.8ms ±10% 3.3ms ±13% -79.32% (p=0.000 n=20+20) |
| |
| name old alloc/op new alloc/op delta |
| /0-4 0.00B 2080.00B ± 0% +Inf% (p=0.000 n=20+20) |
| /1-4 16.0B ± 0% 2080.0B ± 0% +12900.00% (p=0.000 n=20+20) |
| /10-4 568B ± 0% 2152B ± 0% +278.87% (p=0.000 n=20+20) |
| /100-4 4.36kB ± 0% 2.87kB ± 0% -34.13% (p=0.000 n=20+20) |
| /1000-4 40.7kB ± 0% 24.6kB ± 0% -39.54% (p=0.000 n=20+20) |
| /10000-4 746kB ± 0% 244kB ± 0% -67.27% (p=0.000 n=20+20) |
| /100000-4 10.0MB ± 0% 2.4MB ± 0% -75.85% (p=0.000 n=15+20) |
| |
| name old allocs/op new allocs/op delta |
| /0-4 0.00 2.00 ± 0% +Inf% (p=0.000 n=20+20) |
| /1-4 1.00 ± 0% 2.00 ± 0% +100.00% (p=0.000 n=20+20) |
| /10-4 14.0 ± 0% 11.0 ± 0% -21.43% (p=0.000 n=20+20) |
| /100-4 108 ± 0% 101 ± 0% -6.48% (p=0.000 n=20+20) |
| /1000-4 1.01k ± 0% 1.01k ± 0% +0.50% (p=0.000 n=20+20) |
| /10000-4 10.0k ± 0% 10.2k ± 0% +1.35% (p=0.000 n=20+20) |
| /100000-4 100k ± 0% 102k ± 0% +1.53% (p=0.000 n=20+20) |
| ``` |
| |
| Impl7 is a hybrid experiment between using a simple slice based queue |
| implementation for small data sets and the fixed size linked slice approach for |
| large data sets, which is an approach that scales really well, offering really |
| good performance for small and large data sets. |
| |
| The implementation starts by lazily creating the first slice to hold the first |
| values added to the queue. |
| |
| ```go |
| const ( |
| // firstSliceSize holds the size of the first slice. |
| firstSliceSize = 1 |
| |
| // maxFirstSliceSize holds the maximum size of the first slice. |
| maxFirstSliceSize = 16 |
| |
| // maxInternalSliceSize holds the maximum size of each internal slice. |
| maxInternalSliceSize = 128 |
| ) |
| |
| ... |
| |
| // Push adds a value to the queue. |
| // The complexity is amortized O(1). |
| func (q *Queueimpl7) Push(v interface{}) { |
| if q.head == nil { |
| h := newNode(firstSliceSize) // Returns a 1-sized slice. |
| q.head = h |
| q.tail = h |
| q.lastSliceSize = maxFirstSliceSize |
| } else if len(q.tail.v) >= q.lastSliceSize { |
| n := newNode(maxInternalSliceSize) // Returns a 128-sized slice. |
| q.tail.n = n |
| q.tail = n |
| q.lastSliceSize = maxInternalSliceSize |
| } |
| |
| q.tail.v = append(q.tail.v, v) |
| q.len++ |
| } |
| |
| ... |
| |
| // newNode returns an initialized node. |
| func newNode(capacity int) *Node { |
| return &Node{ |
| v: make([]interface{}, 0, capacity), |
| } |
| } |
| ``` |
| |
| The very first created slice is created with capacity 1. The implementation |
| allows the builtin append function to dynamically resize the slice up to 16 |
| (maxFirstSliceSize) positions. After that it reverts to creating fixed size 128 |
| position slices, which offers the best performance for data sets above 16 items. |
| |
| 16 items was chosen as this seems to provide the best balanced performance for |
| small and large data sets according to the [array size benchmark |
| tests](https://github.com/christianrpetrin/queue-tests/blob/master/bench_slice_size.md). |
| Above 16 items, growing the slice means allocating a new, larger one and copying |
| all 16 elements from the previous slice into the new one. The append function |
| phenomenal performance can only compensate for the added copying of elements if |
| the data set is very small, no more than 8 items in the benchmark tests. For |
| above 8 items, the fixed size slice approach is consistently faster and uses |
| less memory, where 128 sized slices are allocated and linked together when the |
| data structure needs to scale to accommodate new values. |
| |
| Why 16? Why not 15 or 14? |
| |
| The builtin append function, as of "go1.11 darwin/amd64", seems to double the |
| slice size every time it needs to allocate a new one. |
| |
| ```go |
| ts := make([]int, 0, 1) |
| |
| ts = append(ts, 1) |
| fmt.Println(cap(ts)) // Slice has 1 item; output: 1 |
| |
| ts = append(ts, 1) |
| fmt.Println(cap(ts)) // Slice has 2 items; output: 2 |
| |
| ts = append(ts, 1) |
| fmt.Println(cap(ts)) // Slice has 3 items; output: 4 |
| |
| ts = append(ts, 1) |
| ts = append(ts, 1) |
| fmt.Println(cap(ts)) // Slice has 5 items; output: 8 |
| |
| ts = append(ts, 1) |
| ts = append(ts, 1) |
| ts = append(ts, 1) |
| ts = append(ts, 1) |
| fmt.Println(cap(ts)) // Slice has 9 items; output: 16 |
| ``` |
| |
| Since the append function will resize the slice from 8 to 16 positions, it makes |
| sense to use all 16 already allocated positions before switching to the fixed |
| size slices approach. |
| |
| #### Design Considerations |
| [Impl7](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl7/queueimpl7.go) |
| uses linked slices as its underlying data structure. |
| |
| The reasonale for the choice comes from two main obervations of slice based |
| queues: |
| 1) When the queue needs to expand to accomodate new values, a new, larger slice |
| needs to be allocated and used |
| 2) Allocating and managing large slices is expensive, especially in an |
| overloaded system with little avaialable physical memory |
| |
| To help clarify the scenario, below is what happens when a slice based queue |
| that already holds, say 1bi items, needs to expand to accommodate a new item. |
| |
| Slice based implementation |
| - Allocate a new, [twice the |
| size](https://en.wikipedia.org/wiki/Dynamic_array#Geometric_expansion_and_amortized_cost) |
| of the previous allocated one, say 2 billion positions slice |
| - Copy over all 1 billion items from the previous slice into the new one |
| - Add the new value into the first unused position in the new array, position |
| 1000000001. |
| |
| The same scenario for impl7 plays out like below. |
| |
| Impl7 |
| - Allocate a new 128 size slice |
| - Set the next pointer |
| - Add the value into the first position of the new array, position 0 |
| |
| Impl7 never copies data around, but slice based ones do, and if the data set is |
| large, it doesn't matter how fast the copying algorithm is. The copying has to |
| be done and will take some time. |
| |
| The decision to use linked slices was also the result of the observation that |
| slices goes to great length to provide predictive, indexed positions. A hash |
| table, for instance, absolutely need this property, but not a queue. So impl7 |
| completely gives up this property and focus on what really matters: add to end, |
| retrieve from head. No copying around and repositioning of elements is needed |
| for that. So when a slice goes to great length to provide that functionality, |
| the whole work of allocating new arrays, copying data around is all wasted work. |
| None of that is necessary. And this work costs dearly for large data sets as |
| observed in the |
| [tests](https://github.com/christianrpetrin/queue-tests/blob/master/bench_queue.md). |
| |
| |
| #### Impl7 Benchmark Results |
| Below compares impl7 with a few selected implementations. |
| |
| The tests name are formatted given below. |
| - Benchmark/N-4: benchmark a queue implementation where N denotes the number of |
| items added and removed to/from the queue; 4 means the number of CPU cores in |
| the host machine. |
| |
| Examples: |
| - Benchmark/0-4: benchmark the queue by creating a new instance of it. This only |
| test initialization time. |
| - Benchmark/100-4: benchmark the queue by creating a new instance of it and |
| adding and removing 100 items to/from the queue. |
| |
| --- |
| |
| Standard list used as a FIFO queue vs impl7. |
| ``` |
| benchstat rawresults/bench-list.txt rawresults/bench-impl7.txt |
| name old time/op new time/op delta |
| /0-4 34.9ns ± 1% 1.2ns ± 3% -96.64% (p=0.000 n=19+20) |
| /1-4 77.0ns ± 1% 68.3ns ± 1% -11.21% (p=0.000 n=20+20) |
| /10-4 574ns ± 0% 578ns ± 0% +0.59% (p=0.000 n=18+20) |
| /100-4 5.94µs ± 1% 3.07µs ± 0% -48.28% (p=0.000 n=19+18) |
| /1000-4 56.0µs ± 1% 25.8µs ± 1% -53.92% (p=0.000 n=20+20) |
| /10000-4 618µs ± 1% 260µs ± 1% -57.99% (p=0.000 n=20+18) |
| /100000-4 13.1ms ± 6% 3.1ms ± 3% -76.50% (p=0.000 n=20+20) |
| |
| name old alloc/op new alloc/op delta |
| /0-4 48.0B ± 0% 0.0B -100.00% (p=0.000 n=20+20) |
| /1-4 96.0B ± 0% 48.0B ± 0% -50.00% (p=0.000 n=20+20) |
| /10-4 600B ± 0% 600B ± 0% ~ (all equal) |
| /100-4 5.64kB ± 0% 3.40kB ± 0% -39.72% (p=0.000 n=20+20) |
| /1000-4 56.0kB ± 0% 25.2kB ± 0% -55.10% (p=0.000 n=20+20) |
| /10000-4 560kB ± 0% 243kB ± 0% -56.65% (p=0.000 n=20+20) |
| /100000-4 5.60MB ± 0% 2.43MB ± 0% -56.66% (p=0.000 n=18+20) |
| |
| name old allocs/op new allocs/op delta |
| /0-4 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) |
| /1-4 2.00 ± 0% 2.00 ± 0% ~ (all equal) |
| /10-4 20.0 ± 0% 15.0 ± 0% -25.00% (p=0.000 n=20+20) |
| /100-4 200 ± 0% 107 ± 0% -46.50% (p=0.000 n=20+20) |
| /1000-4 2.00k ± 0% 1.02k ± 0% -48.95% (p=0.000 n=20+20) |
| /10000-4 20.0k ± 0% 10.2k ± 0% -49.20% (p=0.000 n=20+20) |
| /100000-4 200k ± 0% 102k ± 0% -49.22% (p=0.000 n=20+20) |
| ``` |
| Impl7 is: |
| - Up to ~29x faster (1.2ns vs 34.9ns) than list package for init time (0 items) |
| - Up to ~4x faster (3.1ms vs 13.1ms) than list package for 100k items |
| - Uses ~1/2 memory (2.43MB vs 5.60MB) than list package for 100k items |
| |
| --- |
| |
| [impl1](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl1/queueimpl1.go) |
| (simple slice based queue implementaion) vs impl7. |
| ``` |
| benchstat rawresults/bench-impl1.txt rawresults/bench-impl7.txt |
| name old time/op new time/op delta |
| /0-4 6.83ns ± 3% 1.18ns ± 3% -82.79% (p=0.000 n=20+20) |
| /1-4 48.1ns ± 6% 68.3ns ± 1% +42.23% (p=0.000 n=20+20) |
| /10-4 532ns ± 5% 578ns ± 0% +8.55% (p=0.000 n=20+20) |
| /100-4 3.19µs ± 2% 3.07µs ± 0% -3.74% (p=0.000 n=18+18) |
| /1000-4 24.5µs ± 3% 25.8µs ± 1% +5.51% (p=0.000 n=19+20) |
| /10000-4 322µs ± 4% 260µs ± 1% -19.23% (p=0.000 n=19+18) |
| /100000-4 15.8ms ±10% 3.1ms ± 3% -80.60% (p=0.000 n=20+20) |
| |
| name old alloc/op new alloc/op delta |
| /0-4 0.00B 0.00B ~ (all equal) |
| /1-4 16.0B ± 0% 48.0B ± 0% +200.00% (p=0.000 n=20+20) |
| /10-4 568B ± 0% 600B ± 0% +5.63% (p=0.000 n=20+20) |
| /100-4 4.36kB ± 0% 3.40kB ± 0% -22.02% (p=0.000 n=20+20) |
| /1000-4 40.7kB ± 0% 25.2kB ± 0% -38.25% (p=0.000 n=20+20) |
| /10000-4 746kB ± 0% 243kB ± 0% -67.47% (p=0.000 n=20+20) |
| /100000-4 10.0MB ± 0% 2.4MB ± 0% -75.84% (p=0.000 n=15+20) |
| |
| name old allocs/op new allocs/op delta |
| /0-4 0.00 0.00 ~ (all equal) |
| /1-4 1.00 ± 0% 2.00 ± 0% +100.00% (p=0.000 n=20+20) |
| /10-4 14.0 ± 0% 15.0 ± 0% +7.14% (p=0.000 n=20+20) |
| /100-4 108 ± 0% 107 ± 0% -0.93% (p=0.000 n=20+20) |
| /1000-4 1.01k ± 0% 1.02k ± 0% +1.09% (p=0.000 n=20+20) |
| /10000-4 10.0k ± 0% 10.2k ± 0% +1.39% (p=0.000 n=20+20) |
| /100000-4 100k ± 0% 102k ± 0% +1.54% (p=0.000 n=20+20) |
| ``` |
| Impl7 is: |
| - Up to ~5x faster (1.18ns vs 6.83ns) than impl1 for init time (0 items) |
| - Up to ~5x faster (3.1ms vs 15.8ms) than impl1 for 100k items |
| - Uses ~1/4 memory (2.4MB vs 10MB) than impl1 for 100k items |
| |
| It's important to note that the performance and memory gains for impl7 is |
| exponential like the larger the data set is due to the fact slice based |
| implementations doesn't scale well, [paying a higher and higher |
| price](https://en.wikipedia.org/wiki/Dynamic_array#Geometric_expansion_and_amortized_cost), |
| performance and memory wise, every time it needs to scale to accommodate an ever |
| expanding data set. |
| |
| --- |
| |
| [phf](https://github.com/phf/go-queue) (slice, ring based FIFO queue |
| implementation) vs impl7. |
| ``` |
| benchstat rawresults/bench-phf.txt rawresults/bench-impl7.txt |
| name old time/op new time/op delta |
| /0-4 28.1ns ± 1% 1.2ns ± 3% -95.83% (p=0.000 n=20+20) |
| /1-4 42.5ns ± 1% 68.3ns ± 1% +60.80% (p=0.000 n=20+20) |
| /10-4 681ns ± 1% 578ns ± 0% -15.11% (p=0.000 n=18+20) |
| /100-4 4.55µs ± 1% 3.07µs ± 0% -32.45% (p=0.000 n=19+18) |
| /1000-4 35.5µs ± 1% 25.8µs ± 1% -27.32% (p=0.000 n=18+20) |
| /10000-4 349µs ± 2% 260µs ± 1% -25.67% (p=0.000 n=20+18) |
| /100000-4 11.7ms ±11% 3.1ms ± 3% -73.77% (p=0.000 n=20+20) |
| |
| name old alloc/op new alloc/op delta |
| /0-4 16.0B ± 0% 0.0B -100.00% (p=0.000 n=20+20) |
| /1-4 16.0B ± 0% 48.0B ± 0% +200.00% (p=0.000 n=20+20) |
| /10-4 696B ± 0% 600B ± 0% -13.79% (p=0.000 n=20+20) |
| /100-4 6.79kB ± 0% 3.40kB ± 0% -49.94% (p=0.000 n=20+20) |
| /1000-4 57.0kB ± 0% 25.2kB ± 0% -55.86% (p=0.000 n=20+20) |
| /10000-4 473kB ± 0% 243kB ± 0% -48.68% (p=0.000 n=20+20) |
| /100000-4 7.09MB ± 0% 2.43MB ± 0% -65.77% (p=0.000 n=18+20) |
| |
| name old allocs/op new allocs/op delta |
| /0-4 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) |
| /1-4 1.00 ± 0% 2.00 ± 0% +100.00% (p=0.000 n=20+20) |
| /10-4 15.0 ± 0% 15.0 ± 0% ~ (all equal) |
| /100-4 111 ± 0% 107 ± 0% -3.60% (p=0.000 n=20+20) |
| /1000-4 1.02k ± 0% 1.02k ± 0% +0.39% (p=0.000 n=20+20) |
| /10000-4 10.0k ± 0% 10.2k ± 0% +1.38% (p=0.000 n=20+20) |
| /100000-4 100k ± 0% 102k ± 0% +1.54% (p=0.000 n=20+20) |
| ``` |
| Impl7 is: |
| - Up to ~23x faster (1.2ns vs 28.1ns) than phf for init time (0 items) |
| - Up to ~3x faster (3.1ms vs 11.7ms) than phf for 100k items |
| - Uses ~1/2 memory (2.43MB vs 7.09MB) than phf for 100k items |
| |
| --- |
| |
| Buffered channel vs impl7. |
| ``` |
| benchstat rawresults/bench-channel.txt rawresults/bench-impl7.txt |
| name old time/op new time/op delta |
| /0-4 30.2ns ± 1% 1.2ns ± 3% -96.12% (p=0.000 n=19+20) |
| /1-4 87.6ns ± 1% 68.3ns ± 1% -22.00% (p=0.000 n=19+20) |
| /10-4 704ns ± 1% 578ns ± 0% -17.90% (p=0.000 n=20+20) |
| /100-4 6.78µs ± 1% 3.07µs ± 0% -54.70% (p=0.000 n=20+18) |
| /1000-4 67.3µs ± 1% 25.8µs ± 1% -61.65% (p=0.000 n=20+20) |
| /10000-4 672µs ± 1% 260µs ± 1% -61.36% (p=0.000 n=19+18) |
| /100000-4 6.76ms ± 1% 3.07ms ± 3% -54.61% (p=0.000 n=19+20) |
| |
| name old alloc/op new alloc/op delta |
| /0-4 96.0B ± 0% 0.0B -100.00% (p=0.000 n=20+20) |
| /1-4 112B ± 0% 48B ± 0% -57.14% (p=0.000 n=20+20) |
| /10-4 248B ± 0% 600B ± 0% +141.94% (p=0.000 n=20+20) |
| /100-4 1.69kB ± 0% 3.40kB ± 0% +101.42% (p=0.000 n=20+20) |
| /1000-4 16.2kB ± 0% 25.2kB ± 0% +55.46% (p=0.000 n=20+20) |
| /10000-4 162kB ± 0% 243kB ± 0% +49.93% (p=0.000 n=20+20) |
| /100000-4 1.60MB ± 0% 2.43MB ± 0% +51.43% (p=0.000 n=16+20) |
| |
| name old allocs/op new allocs/op delta |
| /0-4 1.00 ± 0% 0.00 -100.00% (p=0.000 n=20+20) |
| /1-4 1.00 ± 0% 2.00 ± 0% +100.00% (p=0.000 n=20+20) |
| /10-4 10.0 ± 0% 15.0 ± 0% +50.00% (p=0.000 n=20+20) |
| /100-4 100 ± 0% 107 ± 0% +7.00% (p=0.000 n=20+20) |
| /1000-4 1.00k ± 0% 1.02k ± 0% +2.10% (p=0.000 n=20+20) |
| /10000-4 10.0k ± 0% 10.2k ± 0% +1.61% (p=0.000 n=20+20) |
| /100000-4 100k ± 0% 102k ± 0% +1.57% (p=0.000 n=20+20) |
| ``` |
| Impl7 is: |
| - Up to ~25x faster (1.2ns vs 30.2ns) than channels for init time (0 items) |
| - Up to ~2x faster (3.07ms vs 6.76ms) than channels for 100k items |
| - Uses ~50% MORE memory (2.43MB vs 1.60MB) than channels for 100k items |
| |
| Above is not really a fair comparison as standard buffered channels doesn't |
| scale (at all) and they are meant for routine synchronization. Nonetheless, they |
| can and make for an excellent bounded FIFO queue option. Still, impl7 is |
| consistently faster than channels across the board, but uses considerably more |
| memory than channels. |
| |
| --- |
| |
| Given its excellent performance under all scenarios, the hybrid approach impl7 |
| seems to be the ideal candidate for a high performance, low memory footprint |
| general purpose FIFO queue. |
| |
| For above reasons, I propose to port impl7 to the standard library. |
| |
| All raw benchmark results can be found |
| [here](https://github.com/christianrpetrin/queue-tests/tree/master/rawresults). |
| |
| |
| ### Internal Slice Size |
| [Impl7](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl7/queueimpl7.go) |
| uses linked slices as its underlying data structure. |
| |
| The size of the internal slice does influence performance and memory consumption |
| significantly. |
| |
| According to the [internal slice size bench |
| tests](https://github.com/christianrpetrin/queue-tests/blob/master/queueimpl7/benchmark_test.go), |
| larger internal slice sizes yields better performance and lower memory |
| footprint. However, the gains diminishes dramatically as the slice size |
| increases. |
| |
| Below are a few interesting results from the benchmark tests. |
| |
| ``` |
| BenchmarkMaxSubsequentSliceSize/1-4 20000 76836 ns/op 53967 B/op 2752 allocs/op |
| BenchmarkMaxSubsequentSliceSize/2-4 30000 59811 ns/op 40015 B/op 1880 allocs/op |
| BenchmarkMaxSubsequentSliceSize/4-4 30000 42925 ns/op 33039 B/op 1444 allocs/op |
| BenchmarkMaxSubsequentSliceSize/8-4 50000 36946 ns/op 29551 B/op 1226 allocs/op |
| BenchmarkMaxSubsequentSliceSize/16-4 50000 30597 ns/op 27951 B/op 1118 allocs/op |
| BenchmarkMaxSubsequentSliceSize/32-4 50000 28273 ns/op 27343 B/op 1064 allocs/op |
| BenchmarkMaxSubsequentSliceSize/64-4 50000 26969 ns/op 26895 B/op 1036 allocs/op |
| BenchmarkMaxSubsequentSliceSize/128-4 50000 27316 ns/op 26671 B/op 1022 allocs/op |
| BenchmarkMaxSubsequentSliceSize/256-4 50000 26221 ns/op 28623 B/op 1016 allocs/op |
| BenchmarkMaxSubsequentSliceSize/512-4 50000 25882 ns/op 28559 B/op 1012 allocs/op |
| BenchmarkMaxSubsequentSliceSize/1024-4 50000 25674 ns/op 28527 B/op 1010 allocs/op |
| ``` |
| |
| Given the fact that larger internal slices also means potentially more unused |
| memory in some scenarios, 128 seems to be the perfect balance between |
| performance and worst case scenario for memory footprint. |
| |
| Full results can be found |
| [here](https://github.com/christianrpetrin/queue-tests/blob/master/bench_slice_size.md). |
| |
| |
| ### API |
| [Impl7](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl7/queueimpl7.go) |
| implements below API methods. |
| |
| | Operation | Method | |
| | --- | --- | |
| | Add | func (q *Queueimpl7) Push(v interface{}) | |
| | Remove | func (q *Queueimpl7) Pop() (interface{}, bool) | |
| | Size | func (q *Queueimpl7) Len() int | |
| | Return First | func (q *Queueimpl7) Front() (interface{}, bool) | |
| |
| As nil values are considered valid queue values, similarly to the map data |
| structure, "Front" and "Pop" returns a second bool parameter to indicate whether |
| the returned value is valid and whether the queue is empty or not. |
| |
| The resonale for above method names and signatures are the need to keep |
| compatibility with existing Go data structures such as the |
| [list](https://github.com/golang/go/blob/master/src/container/list/list.go), |
| [ring](https://github.com/golang/go/blob/master/src/container/ring/ring.go) and |
| [heap](https://github.com/golang/go/blob/master/src/container/heap/heap.go) |
| packages. |
| |
| Below are the method names used by the existing list, ring and heap Go data |
| structures, as well as the new proposed queue. |
| |
| | Operation | list | ring | heap | queue | |
| | --- | --- | --- | --- | --- | |
| | Add | PushFront/PushBack | Link | Push | Push | |
| | Remove | Remove | Unlink | Pop | Pop | |
| | Size | Len | Len | - | Len | |
| | Return First | Front | - | - | Front | |
| |
| For comparison purposes, below are the method names for |
| [C++](http://www.cplusplus.com/reference/queue/queue/), |
| [Java](https://docs.oracle.com/javase/7/docs/api/java/util/Queue.html) and |
| [C#](https://docs.microsoft.com/en-us/dotnet/api/system.collections.generic.queue-1?view=netframework-4.7.2) |
| for their queue implementation. |
| |
| | Operation | C++ | Java | C# | |
| | --- | --- | --- | --- | |
| | Add | push | add/offer | Enqueue | |
| | Remove | pop | remove/poll | Dequeue | |
| | Size | size | - | Count | |
| | Return First | front | peek | Peek | |
| |
| |
| ### Range Support |
| Just like the current container data strucutures such as |
| [list](https://github.com/golang/go/blob/master/src/container/list/list.go), |
| [ring](https://github.com/golang/go/blob/master/src/container/ring/ring.go) and |
| [heap](https://github.com/golang/go/tree/master/src/container/heap), Impl7 |
| doesn't support the range keyword for navigation. |
| |
| The API offers two ways to iterate over the queue items. |
| |
| Either use "Pop" to retrieve the first current element and the second bool |
| parameter to check for an empty queue. |
| |
| ```go |
| for v, ok := q.Pop(); ok; v, ok = q.Pop() { |
| // Do something with v |
| } |
| ``` |
| |
| Or use "Len" and "Pop" to check for an empty queue and retrieve the first |
| current element. |
| ```go |
| for q.Len() > 0 { |
| v, _ := q.Pop() |
| // Do something with v |
| } |
| ``` |
| |
| ### Data Type |
| Just like the current container data strucutures such as the |
| [list](https://github.com/golang/go/blob/master/src/container/list/list.go), |
| [ring](https://github.com/golang/go/blob/master/src/container/ring/ring.go) and |
| [heap](https://github.com/golang/go/tree/master/src/container/heap), Impl7 only |
| supported data type is "interface{}", making it usable by virtually any Go |
| types. |
| |
| It is possible to implement support for specialized data types such as int, |
| float, bool, etc, but that would require duplicating the Push/Pop methods to |
| accept the different data types, much like |
| strconv.ParseBool/ParseFloat/ParseInt/etc. However, with the impending release |
| of generics, we should probrably wait as generics would solve this problem |
| nicely. |
| |
| |
| |
| ### Safe for Concurrent Use |
| [Impl7](https://github.com/christianrpetrin/queue-tests/tree/master/queueimpl7/queueimpl7.go) |
| is not safe for concurrent use by default. The rationale for this decision is |
| below. |
| |
| 1) Not all users will need a safe for concurrent use queue implementation |
| 2) Executing routine synchronization is expensive, causing performance to drop |
| very significantly |
| 3) Getting impl7 to be safe for concurrent use is actually very simple |
| |
| Below is an example of a safe for concurrent use queue implementation that uses |
| impl7 as its underlying queue. |
| |
| ```go |
| package tests |
| |
| import ( |
| "fmt" |
| "sync" |
| "testing" |
| |
| "github.com/christianrpetrin/queue-tests/queueimpl7" |
| ) |
| |
| type SafeQueue struct { |
| q Queueimpl7 |
| m sync.Mutex |
| } |
| |
| func (s *SafeQueue) Len() int { |
| s.m.Lock() |
| defer s.m.Unlock() |
| |
| return s.q.Len() |
| } |
| |
| func (s *SafeQueue) Push(v interface{}) { |
| s.m.Lock() |
| defer s.m.Unlock() |
| |
| s.q.Push(v) |
| } |
| |
| func (s *SafeQueue) Pop() (interface{}, bool) { |
| s.m.Lock() |
| defer s.m.Unlock() |
| |
| return s.q.Pop() |
| } |
| |
| func (s *SafeQueue) Front() (interface{}, bool) { |
| s.m.Lock() |
| defer s.m.Unlock() |
| |
| return s.q.Front() |
| } |
| |
| func TestSafeQueue(t *testing.T) { |
| var q SafeQueue |
| |
| q.Push(1) |
| q.Push(2) |
| |
| for v, ok := q.Pop(); ok; v, ok = q.Pop() { |
| fmt.Println(v) |
| } |
| |
| // Output: |
| // 1 |
| // 2 |
| } |
| ``` |
| |
| |
| ### Drawbacks |
| The biggest drawback of the proposed implementation is the potentially extra |
| allocated but not used memory in its head and tail slices. |
| |
| This scenario realizes when exactly 17 items are added to the queue, causing the |
| creation of a full sized internal slice of 128 positions. Initially only the |
| first element in this new slice is used to store the added value. All the other |
| 127 elements are already allocated, but not used. |
| |
| ```go |
| // Assuming a 128 internal sized slice. |
| q := queueimpl7.New() |
| |
| // Push 16 items to fill the first dynamic slice (sized 16). |
| for i := 1; i <= 16; i++ { |
| q.Push(i) |
| } |
| // Push 1 extra item that causes the creation of a new 128 sized slice to store this value. |
| q.Push(17) |
| |
| // Pops the first 16 items to release the first slice (sized 16). |
| for i := 1; i <= 16; i++ { |
| q.Pop() |
| } |
| |
| // As unsafe.Sizeof (https://golang.org/pkg/unsafe/#Sizeof) doesn't consider the length of slices, |
| // we need to manually calculate the memory used by the internal slices. |
| var internalSliceType interface{} |
| fmt.Println(fmt.Sprintf("%d bytes", unsafe.Sizeof(q)+(unsafe.Sizeof(internalSliceType) /* bytes per slice position */ *127 /* head slice unused positions */))) |
| |
| // Output for a 64bit system (Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz): 2040 bytes |
| ``` |
| |
| The worst case scenario realizes when exactly 145 items are added to the queue |
| and 143 items are removed. This causes the queue struct to hold a 128-sized |
| slice as its head slice, but only the last element is actually used. Similarly, |
| the queue struct will hold a separate 128-sized slice as its tail slice, but |
| only the first position in that slice is being used. |
| |
| ```go |
| // Assuming a 128 internal sized slice. |
| q := queueimpl7.New() |
| |
| // Push 16 items to fill the first dynamic slice (sized 16). |
| for i := 1; i <= 16; i++ { |
| q.Push(i) |
| } |
| |
| // Push an additional 128 items to fill the first full sized slice (sized 128). |
| for i := 1; i <= 128; i++ { |
| q.Push(i) |
| } |
| |
| // Push 1 extra item that causes the creation of a new 128 sized slice to store this value, |
| // adding a total of 145 items to the queue. |
| q.Push(1) |
| |
| // Pops the first 143 items to release the first dynamic slice (sized 16) and |
| // 127 items from the first full sized slice (sized 128). |
| for i := 1; i <= 143; i++ { |
| q.Pop() |
| } |
| |
| // As unsafe.Sizeof (https://golang.org/pkg/unsafe/#Sizeof) doesn't consider the length of slices, |
| // we need to manually calculate the memory used by the internal slices. |
| var internalSliceType interface{} |
| fmt.Println(fmt.Sprintf("%d bytes", unsafe.Sizeof(q)+(unsafe.Sizeof(internalSliceType) /* bytes per slice position */ *(127 /* head slice unused positions */ +127 /* tail slice unused positions */)))) |
| |
| // Output for a 64bit system (Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz): 4072 bytes |
| ``` |
| |
| Above code was run on Go version "go1.11 darwin/amd64". |
| |
| |
| ## Open Questions/Issues |
| Should this be a deque (double-ended queue) implementation instead? The deque |
| could be used as a stack as well, but it would make more sense to have a queue |
| and stack implementations (like most mainstream languages have) instead of a |
| deque that can be used as a stack (confusing). Stack is a very important |
| computer science data structure as well and so I believe Go should have a |
| specialized implementation for it as well (given the specialized implementation |
| offers real value to the users and not just a "nice" named interface and |
| methods). |
| |
| Should "Pop" and "Front" return only the value instead of the value and a second |
| bool parameter (which indicates whether the queue is empty or not)? The |
| implication of the change is adding nil values wouldn't be valid anymore so |
| "Pop" and "Front" would return nil when the queue is empty. Panic should be |
| avoided in libraries. |
| |
| The memory footprint for a 128 sized internal slice causes, in the worst case |
| scenario, a 2040 bytes of memory allocated (on a 64bit system) but not used. |
| Switching to 64 means roughly half the memory would be used with a slight ~2.89% |
| performance drop (252813ns vs 260137ns). The extra memory footprint is not worth |
| the extra performance gain is a very good point to make. Should we change this |
| value to 64 or maybe make it configurable? |
| |
| Should we also provide a safe for concurrent use implementation? A specialized |
| implementation that would rely on atomic operations to update its internal |
| indices and length could offer a much better performance when comparing to a |
| similar implementation that relies on a mutex. |
| |
| With the impending release of generics, should we wait to release the new queue |
| package once the new generics framework is released? |
| |
| Should we implement support for the range keyword for the new queue? It could be |
| done in a generic way so other data structures could also benefit from this |
| feature. For now, IMO, this is a topic for another proposal/discussion. |
| |
| |
| |
| ## Summary |
| I propose to add a new package, "container/queue", to the standard library to |
| support an in-memory, unbounded, general purpose queue implementation. |
| |
| I feel strongly this proposal should be accepted due to below reasons. |
| |
| 1) The proposed solution was well researched and probed, being dramatically and |
| consistently faster than 6 other experimental queue implementations as well 3 |
| promising open source queue implementations as well the standard list package |
| and buffered channels; it still consumes considerably less memory than every |
| other queue implementation tested, except for buffered channels |
| 2) The proposed solution uses a new, unique approach to building queues, yet its |
| [implementation](https://github.com/christianrpetrin/queue-tests/blob/master/queueimpl7/queueimpl7.go) |
| is clean and extremely simple. Both main methods, "Push" and "Pop", are |
| composed of only 16 and 19 lines of code (total), respectively. The proposed |
| implementation also have proper tests with 100% test coverage and should |
| require minimal maintenance moving forward |
| 3) I'll implement any changes the Go community feel are needed for the proposed |
| solution to be worth of the standard library and the Go community |