The following examples are obtained from the publication “GoBench: A Benchmark Suite of Real-World Go Concurrency Bugs” (doi:10.1109/CGO51591.2021.9370317).
Authors Ting Yuan (yuanting@ict.ac.cn): State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China; Guangwei Li (liguangwei@ict.ac.cn): State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China; Jie Lu† (lujie@ict.ac.an): State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences; Chen Liu (liuchen17z@ict.ac.cn): State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China Lian Li (lianli@ict.ac.cn): State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China; Jingling Xue (jingling@cse.unsw.edu.au): University of New South Wales, School of Computer Science and Engineering, Sydney, Australia
White paper: https://lujie.ac.cn/files/papers/GoBench.pdf
The examples have been modified in order to run the goroutine leak profiler. Buggy snippets are moved from within a unit test to separate applications. Each is then independently executed, possibly as multiple copies within the same application in order to exercise more interleavings. Concurrently, the main program sets up a waiting period (typically 1ms), followed by a goroutine leak profile request. Other modifications may involve injecting calls to runtime.Gosched(), to more reliably exercise buggy interleavings, or reductions in waiting periods when calling time.Sleep, in order to reduce overall testing time.
The resulting goroutine leak profile is analyzed to ensure that no unexpecte leaks occurred, and that the expected leaks did occur. If the leak is flaky, the only purpose of the expected leak list is to protect against unexpected leaks.
The examples have also been modified to remove data races, since those create flaky test failures, when really all we care about are leaked goroutines.
The entries below document each of the corresponding leaks.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#10214 | pull request | patch | Resource | AB-BA leak |
This goroutine leak is caused by different order when acquiring coalescedMu.Lock() and raftMu.Lock(). The fix is to refactor sendQueuedHeartbeats() so that cockroachdb can unlock coalescedMu before locking raftMu.
G1 G2
------------------------------------------------------------------------------------
s.sendQueuedHeartbeats() .
s.coalescedMu.Lock() [L1] .
s.sendQueuedHeartbeatsToNode() .
s.mu.replicas[0].reportUnreachable() .
s.mu.replicas[0].raftMu.Lock() [L2] .
. s.mu.replicas[0].tick()
. s.mu.replicas[0].raftMu.Lock() [L2]
. s.mu.replicas[0].tickRaftMuLocked()
. s.mu.replicas[0].mu.Lock() [L3]
. s.mu.replicas[0].maybeQuiesceLocked()
. s.mu.replicas[0].maybeCoalesceHeartbeat()
. s.coalescedMu.Lock() [L1]
--------------------------------G1,G2 leak------------------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#1055 | pull request | patch | Mixed | Channel & WaitGroup |
Stop() is called and blocked at s.stop.Wait() after acquiring the lock.StartTask() is called and attempts to acquire the lock. It is then blocked.Stop() never finishes since the task doesn't call SetStopped.G1 G2.0 G2.1 G2.2 G3 ------------------------------------------------------------------------------------------------------------------------------- s[0].stop.Add(1) [1] go func() [G2.0] s[1].stop.Add(1) [1] . go func() [G2.1] . s[2].stop.Add(1) [1] . . go func() [G2.2] . . go func() [G3] . . . <-done . . . . . s[0].StartTask() . . . . s[0].draining == 0 . . . . . s[1].StartTask() . . . . s[1].draining == 0 . . . . . s[2].StartTask() . . . . s[2].draining == 0 . . . . . s[0].Quiesce() . . . . s[0].mu.Lock() [L1[0]] . s[0].mu.Lock() [L1[0]] . . . . s[0].drain.Add(1) [1] . . . . s[0].mu.Unlock() [L1[0]] . . . . <-s[0].ShouldStop() . . . . . . . s[0].draining = 1 . . . . s[0].drain.Wait() . . s[0].mu.Lock() [L1[1]] . . . . s[1].drain.Add(1) [1] . . . . s[1].mu.Unlock() [L1[1]] . . . . <-s[1].ShouldStop() . . . . . s[2].mu.Lock() [L1[2]] . . . . s[2].drain.Add() [1] . . . . s[2].mu.Unlock() [L1[2]] . . . . <-s[2].ShouldStop() . ----------------------------------------------------G1, G2.[0..2], G3 leak-----------------------------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#10790 | pull request | patch | Communication | Channel & Context |
It is possible that a message from ctxDone will make beginCmds return without draining the channel ch, so that anonymous function goroutines will leak.
G1 G2 helper goroutine
-----------------------------------------------------
. . r.sendChans()
r.beginCmds() . .
. . ch1 <- true
<- ch1 . .
. . ch2 <- true
...
. cancel()
<- ch1
------------------G1 leak----------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#13197 | pull request | patch | Communication | Channel & Context |
One goroutine executing (*Tx).awaitDone() blocks and waiting for a signal context.Done().
G1 G2
-------------------------------
begin()
. awaitDone()
return .
. <-tx.ctx.Done()
-----------G2 leaks------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#13755 | pull request | patch | Communication | Channel & Context |
The buggy code does not close the db query result (rows), so that one goroutine running (*Rows).awaitDone is blocked forever. The blocking goroutine is waiting for cancel signal from context.
G1 G2
---------------------------------------
initContextClose()
. awaitDone()
return .
. <-tx.ctx.Done()
---------------G2 leaks----------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#1462 | pull request | patch | Mixed | Channel & WaitGroup |
Executing <-stopper.ShouldStop() in processEventsUntil may cause goroutines created by lt.RunWorker in lt.start to be stuck sending a message over lt.Events. The main thread is then stuck at s.stop.Wait(), since the sender goroutines cannot call s.stop.Done().
G1 G2 G3
-------------------------------------------------------------------------------------------------------
NewLocalInterceptableTransport()
lt.start()
lt.stopper.RunWorker()
s.AddWorker()
s.stop.Add(1) [1]
go func() [G2]
stopper.RunWorker() .
s.AddWorker() .
s.stop.Add(1) [2] .
go func() [G3] .
s.Stop() . .
s.Quiesce() . .
. select [default] .
. lt.Events <- interceptMessage(0) .
close(s.stopper) . .
. . select [<-stopper.ShouldStop()]
. . <<<done>>>
s.stop.Wait() .
----------------------------------------------G1,G2 leak-----------------------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#16167 | pull request | patch | Resource | Double Locking |
This is another example of goroutine leaks caused by recursively acquiring RWLock. There are two lock variables (systemConfigCond and systemConfigMu) which refer to the same underlying lock. The leak invovlves two goroutines. The first acquires systemConfigMu.Lock(), then tries to acquire systemConfigMu.RLock(). The second acquires systemConfigMu.Lock(). If the second goroutine interleaves in between the two lock operations of the first goroutine, both goroutines will leak.
G1 G2
---------------------------------------------------------------
. e.Start()
. e.updateSystemConfig()
e.execParsed() .
e.systemConfigCond.L.Lock() [L1] .
. e.systemConfigMu.Lock() [L1]
e.systemConfigMu.RLock() [L1] .
------------------------G1,G2 leak-----------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#18101 | pull request | patch | Resource | Double Locking |
The context.Done() signal short-circuits the reader goroutine, but not the senders, leading them to leak.
G1 G2 helper goroutine
--------------------------------------------------------------
restore()
. splitAndScatter()
<-readyForImportCh .
<-readyForImportCh <==> readyForImportCh<-
...
. . cancel()
<<done>> . <<done>>
readyForImportCh<-
-----------------------G2 leaks--------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#2448 | pull request | patch | Communication | Channel |
This bug is caused by two goroutines waiting for each other to unblock their channels:
MultiRaft sends the commit event for the Membership changestore.processRaft takes it and begins processingsendEvent, but this blocks since store.processRaft isn't ready for another select. Consequently the main MultiRaft loop is waiting for that as well.Membership change was applied to the range, and the store now tries to execute the callbackcallbackChan, but that is consumed by the MultiRaft loop, which is currently waiting for store.processRaft to consume from the events channel, which it will only do after the callback has completed.G1 G2
--------------------------------------------------------------------------
s.processRaft() st.start()
select .
. select [default]
. s.handleWriteResponse()
. s.sendEvent()
. select
<-s.multiraft.Events <----> m.Events <- event
. select [default]
. s.handleWriteResponse()
. s.sendEvent()
. select [m.Events<-, <-s.stopper.ShouldStop()]
callback() .
select [
m.callbackChan<-,
<-s.stopper.ShouldStop()
] .
------------------------------G1,G2 leak----------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#24808 | pull request | patch | Communication | Channel |
When we Start the Compactor, it may already have received Suggestions, leaking the previously blocking write to a full channel.
G1
------------------------------------------------
...
compactor.ch <-
compactor.Start()
compactor.ch <-
--------------------G1 leaks--------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#25456 | pull request | patch | Communication | Channel |
When CheckConsistency (in the complete code) returns an error, the queue checks whether the store is draining to decide whether the error is worth logging. This check was incorrect and would block until the store actually started draining.
G1
---------------------------------------
...
<-repl.store.Stopper().ShouldQuiesce()
---------------G1 leaks----------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#35073 | pull request | patch | Communication | Channel |
Previously, the outbox could fail during startup without closing its RowChannel. This could lead to goroutine leaks in rare cases due to channel communication mismatch.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#35931 | pull request | patch | Communication | Channel |
Previously, if a processor that reads from multiple inputs was waiting on one input to provide more data, and the other input was full, and both inputs were connected to inbound streams, it was possible to cause goroutine leaks during flow cancellation when trying to propagate the cancellation metadata messages into the flow. The cancellation method wrote metadata messages to each inbound stream one at a time, so if the first one was full, the canceller would block and never send a cancellation message to the second stream, which was the one actually being read from.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#3710 | pull request | patch | Resource | RWR Deadlock |
The goroutine leak is caused by acquiring a RLock twice in a call chain. ForceRaftLogScanAndProcess(acquire s.mu.RLock()) -> MaybeAdd() -> shouldQueue() -> getTruncatableIndexes() ->RaftStatus(acquire s.mu.Rlock())
G1 G2
------------------------------------------------------------
store.ForceRaftLogScanAndProcess()
s.mu.RLock()
s.raftLogQueue.MaybeAdd()
bq.impl.shouldQueue()
getTruncatableIndexes()
r.store.RaftStatus()
. store.processRaft()
. s.mu.Lock()
s.mu.RLock()
----------------------G1,G2 leak-----------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#584 | pull request | patch | Resource | Double Locking |
Missing call to mu.Unlock() before the break in the loop.
G1
---------------------------
g.bootstrap()
g.mu.Lock() [L1]
if g.closed { ==> break
g.manage()
g.mu.Lock() [L1]
----------G1 leaks---------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#6181 | pull request | patch | Resource | RWR Deadlock |
The same RWMutex may be recursively acquired for both reading and writing.
G1 G2 G3 ...
-----------------------------------------------------------------------------------------------
testRangeCacheCoalescedRquests()
initTestDescriptorDB()
pauseLookupResumeAndAssert()
return
. doLookupWithToken()
. . doLookupWithToken()
. rc.LookupRangeDescriptor() .
. . rc.LookupRangeDescriptor()
. rdc.rangeCacheMu.RLock() .
. rdc.String() .
. . rdc.rangeCacheMu.RLock()
. . fmt.Printf()
. . rdc.rangeCacheMu.RUnlock()
. . rdc.rangeCacheMu.Lock()
. rdc.rangeCacheMu.RLock() .
-----------------------------------G2,G3,... leak----------------------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#7504 | pull request | patch | Resource | AB-BA Deadlock |
The locks are acquired as leaseState and tableNameCache in Release(), but as tableNameCache and leaseState in AcquireByName, leading to an AB-BA deadlock.
G1 G2
-----------------------------------------------------
mgr.AcquireByName() mgr.Release()
m.tableNames.get(id) .
c.mu.Lock() [L2] .
. t.release(lease)
. t.mu.Lock() [L3]
. s.mu.Lock() [L1]
lease.mu.Lock() [L1] .
. t.removeLease(s)
. t.tableNameCache.remove()
. c.mu.Lock() [L2]
---------------------G1, G2 leak---------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| cockroach#9935 | pull request | patch | Resource | Double Locking |
This bug is caused by acquiring l.mu.Lock() twice.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| etcd#10492 | pull request | patch | Resource | Double locking |
A simple double locking case for lines 19, 31.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| etcd#5509 | pull request | patch | Resource | Double locking |
r.acquire() returns holding r.client.mu.RLock() on a failure path (line 42). This causes any call to client.Close() to leak goroutines.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| etcd#6708 | pull request | patch | Resource | Double locking |
Line 54, 49 double locking
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| etcd#6857 | pull request | patch | Communication | Channel |
Choosing a different case in a select statement (n.stop) will lead to goroutine leaks when sending over n.status.
G1 G2 G3
-------------------------------------------
n.run() . .
. . n.Stop()
. . n.stop<-
<-n.stop . .
. . <-n.done
close(n.done) . .
return . .
. . return
. n.Status()
. n.status<-
----------------G2 leaks-------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| etcd#6873 | pull request | patch | Mixed | Channel & Lock |
This goroutine leak involves a goroutine acquiring a lock and being blocked over a channel operation with no partner, while another tries to acquire the same lock.
G1 G2 G3
--------------------------------------------------------------
newWatchBroadcasts()
wbs.update()
wbs.updatec <-
return
. <-wbs.updatec .
. wbs.coalesce() .
. . wbs.stop()
. . wbs.mu.Lock()
. . close(wbs.updatec)
. . <-wbs.donec
. wbs.mu.Lock() .
---------------------G2,G3 leak--------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| etcd#7492 | pull request | patch | Mixed | Channel & Lock |
This goroutine leak involves a goroutine acquiring a lock and being blocked over a channel operation with no partner, while another tries to acquire the same lock.
G2 G1
---------------------------------------------------------------
. stk.run()
ts.assignSimpleTokenToUser() .
t.simpleTokensMu.Lock() .
t.simpleTokenKeeper.addSimpleToken() .
tm.addSimpleTokenCh <- true .
. <-tm.addSimpleTokenCh
t.simpleTokensMu.Unlock() .
ts.assignSimpleTokenToUser() .
...
t.simpleTokensMu.Lock()
. <-tokenTicker.C
tm.addSimpleTokenCh <- true .
. tm.deleteTokenFunc()
. t.simpleTokensMu.Lock()
---------------------------G1,G2 leak--------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| etcd#7902 | pull request | patch | Mixed | Channel & Lock |
If the follower gooroutine acquires mu.Lock() first and calls rc.release(), it will be blocked sending over rcNextc. Only the leader can close(nextc) to unblock the follower. However, in order to invoke rc.release(), the leader needs to acquires mu.Lock(). The fix is to remove the lock and unlock around rc.release().
G1 G2 (leader) G3 (follower)
---------------------------------------------------------------------
runElectionFunc()
doRounds()
wg.Wait()
. ...
. mu.Lock()
. rc.validate()
. rcNextc = nextc
. mu.Unlock() ...
. . mu.Lock()
. . rc.validate()
. . mu.Unlock()
. . mu.Lock()
. . rc.release()
. . <-rcNextc
. mu.Lock()
-------------------------G1,G2,G3 leak--------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| grpc#1275 | pull request | patch | Communication | Channel |
Two goroutines are involved in this leak. The main goroutine is blocked at case <- donec, and is waiting for the second goroutine to close the channel. The second goroutine is created by the main goroutine. It is blocked when calling stream.Read(), which invokes recvBufferRead.Read(). The second goroutine is blocked at case i := r.recv.get(), and it is waiting for someone to send a message to this channel. It is the client.CloseSream() method called by the main goroutine that should send the message, but it is not. The patch is to send out this message.
G1 G2
-----------------------------------------------------
testInflightStreamClosing()
. stream.Read()
. io.ReadFull()
. <-r.recv.get()
CloseStream()
<-donec
---------------------G1, G2 leak---------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| grpc#1424 | pull request | patch | Communication | Channel |
The goroutine running cc.lbWatcher returns without draining the done channel.
G1 G2 G3
-----------------------------------------------------------------
DialContext() . .
. cc.dopts.balancer.Notify() .
. . cc.lbWatcher()
. <-doneChan
close()
---------------------------G2 leaks-------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| grpc#1460 | pull request | patch | Mixed | Channel & Lock |
When gRPC keepalives are enabled (which isn't the case by default at this time) and PermitWithoutStream is false (the default), the client can leak goroutines when transitioning between having no active stream and having one active stream.The keepalive() goroutine is stuck at “<-t.awakenKeepalive”, while the main goroutine is stuck in NewStream() on t.mu.Lock().
G1 G2
--------------------------------------------
client.keepalive()
. client.NewStream()
t.mu.Lock()
<-t.awakenKeepalive
. t.mu.Lock()
---------------G1,G2 leak-------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| grpc#3017 | pull request | patch | Resource | Missing unlock |
Line 65 is an execution path with a missing unlock.
G1 G2 G3
------------------------------------------------------------------------------------------------
NewSubConn([1])
ccc.mu.Lock() [L1]
sc = 1
ccc.subConnToAddr[1] = 1
go func() [G2]
<-done .
. ccc.RemoveSubConn(1)
. ccc.mu.Lock()
. addr = 1
. entry = &subConnCacheEntry_grpc3017{}
. cc.subConnCache[1] = entry
. timer = time.AfterFunc() [G3]
. entry.cancel = func()
. sc = ccc.NewSubConn([1])
. ccc.mu.Lock() [L1]
. entry.cancel()
. !timer.Stop() [true]
. entry.abortDeleting = true
. . ccc.mu.Lock()
. . <<<done>>>
. ccc.RemoveSubConn(1)
. ccc.mu.Lock() [L1]
-------------------------------------------G1, G2 leak-----------------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| grpc#660 | pull request | patch | Communication | Channel |
The parent function could return without draining the done channel.
G1 G2 helper goroutine
-------------------------------------------------------------
doCloseLoopUnary()
. bc.stop <- true
<-bc.stop
return
. done <-
----------------------G2 leak--------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| grpc#795 | pull request | patch | Resource | Double locking |
Line 20 is an execution path with a missing unlock.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| grpc#862 | pull request | patch | Communication | Channel & Context |
When return value conn is nil, cc(ClientConn) is not closed. The goroutine executing resetAddrConn is leaked. The patch is to close ClientConn in defer func().
G1 G2
---------------------------------------
DialContext()
. cc.resetAddrConn()
. resetTransport()
. <-ac.ctx.Done()
--------------G2 leak------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| hugo#3251 | pull request | patch | Resource | RWR deadlock |
A goroutine can hold Lock() at line 20 then acquire RLock() at line 29. RLock() at line 29 will never be acquired because Lock() at line 20 will never be released.
G1 G2 G3
------------------------------------------------------------------------------------------
wg.Add(1) [W1: 1]
go func() [G2]
go func() [G3]
. resGetRemote()
. remoteURLLock.URLLock(url)
. l.Lock() [L1]
. l.m[url] = &sync.Mutex{} [L2]
. l.m[url].Lock() [L2]
. l.Unlock() [L1]
. . resGetRemote()
. . remoteURLLock.URLLock(url)
. . l.Lock() [L1]
. . l.m[url].Lock() [L2]
. remoteURLLock.URLUnlock(url)
. l.RLock() [L1]
...
wg.Wait() [W1]
----------------------------------------G1,G2,G3 leak--------------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| hugo#5379 | pull request | patch | Resource | Double locking |
A goroutine first acquire contentInitMu at line 99 then acquire the same Mutex at line 66
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| istio#16224 | pull request | patch | Mixed | Channel & Lock |
A goroutine holds a Mutex at line 91 and is then blocked at line 93. Another goroutine attempts to acquire the same Mutex at line 101 to further drains the same channel at 103.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| istio#17860 | pull request | patch | Communication | Channel |
a.statusCh can't be drained at line 70.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| istio#18454 | pull request | patch | Communication | Channel & Context |
s.timer.Stop() at line 56 and 61 can be called concurrency (i.e. from their entry point at line 104 and line 66). See Timer.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| kubernetes#10182 | pull request | patch | Mixed | Channel & Lock |
Goroutine 1 is blocked on a lock held by goroutine 3, while goroutine 3 is blocked on sending message to ch, which is read by goroutine 1.
G1 G2 G3
-------------------------------------------------------------------------------
s.Start()
s.syncBatch()
. s.SetPodStatus()
. s.podStatusesLock.Lock()
<-s.podStatusChannel <===> s.podStatusChannel <- true
. s.podStatusesLock.Unlock()
. return
s.DeletePodStatus() .
. . s.podStatusesLock.Lock()
. . s.podStatusChannel <- true
s.podStatusesLock.Lock()
-----------------------------G1,G3 leak-----------------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| kubernetes#11298 | pull request | patch | Communication | Channel & Condition Variable |
n.node used the n.lock as underlaying locker. The service loop initially locked it, the Notify function tried to lock it before calling n.node.Signal(), leading to a goroutine leak. n.cond.Signal() at line 59 and line 81 are not guaranteed to unblock the n.cond.Wait at line 56.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| kubernetes#13135 | pull request | patch | Resource | AB-BA deadlock |
G1 G2 G3
----------------------------------------------------------------------------------
NewCacher()
watchCache.SetOnReplace()
watchCache.SetOnEvent()
. cacher.startCaching()
. c.Lock()
. c.reflector.ListAndWatch()
. r.syncWith()
. r.store.Replace()
. w.Lock()
. w.onReplace()
. cacher.initOnce.Do()
. cacher.Unlock()
return cacher .
. . c.watchCache.Add()
. . w.processEvent()
. . w.Lock()
. cacher.startCaching() .
. c.Lock() .
...
. c.Lock()
. w.Lock()
--------------------------------G2,G3 leak-----------------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| kubernetes#1321 | pull request | patch | Mixed | Channel & Lock |
This is a lock-channel bug. The first goroutine invokes distribute(), which holds m.lock.Lock(), while blocking at sending message to w.result. The second goroutine invokes stopWatching() function, which can unblock the first goroutine by closing w.result. However, in order to close w.result, stopWatching() function needs to acquire m.lock.Lock().
The fix is to introduce another channel and put receive message from the second channel in the same select statement as the w.result. Close the second channel can unblock the first goroutine, while no need to hold m.lock.Lock().
G1 G2
----------------------------------------------
testMuxWatcherClose()
NewMux()
. m.loop()
. m.distribute()
. m.lock.Lock()
. w.result <- true
w := m.Watch()
w.Stop()
mw.m.stopWatching()
m.lock.Lock()
---------------G1,G2 leak---------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| kubernetes#25331 | pull request | patch | Communication | Channel & Context |
A potential goroutine leak occurs when an error has happened, blocking resultChan, while cancelling context in Stop().
G1 G2
------------------------------------
wc.run()
. wc.Stop()
. wc.errChan <-
. wc.cancel()
<-wc.errChan
wc.cancel()
wc.resultChan <-
-------------G1 leak----------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| kubernetes#26980 | pull request | patch | Mixed | Channel & Lock |
A goroutine holds a Mutex at line 24 and blocked at line 35. Another goroutine blocked at line 58 by acquiring the same Mutex.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| kubernetes#30872 | pull request | patch | Resource | AB-BA deadlock |
The lock is acquired both at lines 92 and 157.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| kubernetes#38669 | pull request | patch | Communication | Channel |
No sender for line 33.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| kubernetes#5316 | pull request | patch | Communication | Channel |
If the main goroutine selects a case that doesn’t consumes the channels, the anonymous goroutine will be blocked on sending to channel.
G1 G2
--------------------------------------
finishRequest()
. fn()
time.After()
. errCh<-/ch<-
--------------G2 leaks----------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| kubernetes#58107 | pull request | patch | Resource | RWR deadlock |
The rules for read and write lock: allows concurrent read lock; write lock has higher priority than read lock.
There are two queues (queue 1 and queue 2) involved in this bug, and the two queues are protected by the same read-write lock (rq.workerLock.RLock()). Before getting an element from queue 1 or queue 2, rq.workerLock.RLock() is acquired. If the queue is empty, cond.Wait() will be invoked. There is another goroutine (goroutine D), which will periodically invoke rq.workerLock.Lock(). Under the following situation, deadlock will happen. Queue 1 is empty, so that some goroutines hold rq.workerLock.RLock(), and block at cond.Wait(). Goroutine D is blocked when acquiring rq.workerLock.Lock(). Some goroutines try to process jobs in queue 2, but they are blocked when acquiring rq.workerLock.RLock(), since write lock has a higher priority.
The fix is to not acquire rq.workerLock.RLock(), while pulling data from any queue. Therefore, when a goroutine is blocked at cond.Wait(), rq.workLock.RLock() is not held.
G3 G4 G5
--------------------------------------------------------------------
. . Sync()
rq.workerLock.RLock() . .
q.cond.Wait() . .
. . rq.workerLock.Lock()
. rq.workerLock.RLock()
. q.cond.L.Lock()
-----------------------------G3,G4,G5 leak-----------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| kubernetes#62464 | pull request | patch | Resource | RWR deadlock |
This is another example for recursive read lock bug. It has been noticed by the go developers that RLock should not be recursively used in the same thread.
G1 G2
--------------------------------------------------------
m.reconcileState()
m.state.GetCPUSetOrDefault()
s.RLock()
s.GetCPUSet()
. p.RemoveContainer()
. s.GetDefaultCPUSet()
. s.SetDefaultCPUSet()
. s.Lock()
s.RLock()
---------------------G1,G2 leak--------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| kubernetes#6632 | pull request | patch | Mixed | Channel & Lock |
When resetChan is full, WriteFrame holds the lock and blocks on the channel. Then monitor() fails to close the resetChan because the lock is already held by WriteFrame.
G1 G2 helper goroutine
----------------------------------------------------------------
i.monitor()
<-i.conn.closeChan
. i.WriteFrame()
. i.writeLock.Lock()
. i.resetChan <-
. . i.conn.closeChan<-
i.writeLock.Lock()
----------------------G1,G2 leak--------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| kubernetes#70277 | pull request | patch | Communication | Channel |
wait.poller() returns a function with type WaitFunc. the function creates a goroutine and the goroutine only quits when after or done closed.
The doneCh defined at line 70 is never closed.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| moby#17176 | pull request | patch | Resource | Double locking |
devices.nrDeletedDevices takes devices.Lock() but does not release it (line 36) if there are no deleted devices. This will block other goroutines trying to acquire devices.Lock().
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| moby#21233 | pull request | patch | Communication | Channel |
This test was checking that it received every progress update that was produced. But delivery of these intermediate progress updates is not guaranteed. A new update can overwrite the previous one if the previous one hasn't been sent to the channel yet.
The call to t.Fatalf terminated the current goroutine which was consuming the channel, which caused a deadlock and eventual test timeout rather than a proper failure message.
G1 G2 G3
----------------------------------------------------------
testTransfer() . .
tm.Transfer() . .
t.Watch() . .
. WriteProgress() .
. ProgressChan<- .
. . <-progressChan
. ... ...
. return .
. <-progressChan
<-watcher.running
----------------------G1,G3 leak--------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| moby#25384 | pull request | patch | Mixed | Misuse WaitGroup |
When n=1 (where n is len(pm.plugins)), the location of group.Wait() doesn’t matter. When n > 1, group.Wait() is invoked in each iteration. Whenever group.Wait() is invoked, it waits for group.Done() to be executed n times. However, group.Done() is only executed once in one iteration.
Misuse of sync.WaitGroup
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| moby#27782 | pull request | patch | Communication | Channel & Condition Variable |
G1 G2 G3
-----------------------------------------------------------------------
InitializeStdio()
startLogging()
l.ReadLogs()
NewLogWatcher()
. l.readLogs()
container.Reset() .
LogDriver.Close() .
r.Close() .
close(w.closeNotifier) .
. followLogs(logWatcher)
. watchFile()
. New()
. NewEventWatcher()
. NewWatcher()
. . w.readEvents()
. . event.ignoreLinux()
. . return false
. <-logWatcher.WatchClose() .
. fileWatcher.Remove() .
. w.cv.Wait() .
. . w.Events <- event
------------------------------G2,G3 leak-------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| moby#28462 | pull request | patch | Mixed | Channel & Lock |
One goroutine may acquire a lock and try to send a message over channel stop, while the other will try to acquire the same lock. With the wrong ordering, both goroutines will leak.
G1 G2
--------------------------------------------------------------
monitor()
handleProbeResult()
. d.StateChanged()
. c.Lock()
. d.updateHealthMonitorElseBranch()
. h.CloseMonitorChannel()
. s.stop <- struct{}{}
c.Lock()
----------------------G1,G2 leak------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| moby#30408 | pull request | patch | Communication | Condition Variable |
Wait() at line 22 has no corresponding Signal() or Broadcast().
G1 G2
------------------------------------------
testActive()
. p.waitActive()
. p.activateWait.L.Lock()
. p.activateWait.Wait()
<-done
-----------------G1,G2 leak---------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| moby#33781 | pull request | patch | Communication | Channel & Context |
The goroutine created using an anonymous function is blocked sending a message over an unbuffered channel. However there exists a path in the parent goroutine where the parent function will return without draining the channel.
G1 G2 G3
----------------------------------------
monitor() .
<-time.After() .
. .
<-stop stop<-
.
cancelProbe()
return
. result<-
----------------G3 leak------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| moby#36114 | pull request | patch | Resource | Double locking |
The the lock for the struct svm has already been locked when calling svm.hotRemoveVHDsAtStart().
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| moby#4951 | pull request | patch | Resource | AB-BA deadlock |
The root cause and patch is clearly explained in the commit description. The global lock is devices.Lock(), and the device lock is baseInfo.lock.Lock(). It is very likely that this bug can be reproduced.
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| moby#7559 | pull request | patch | Resource | Double locking |
Line 25 is missing a call to .Unlock.
G1
---------------------------
proxy.connTrackLock.Lock()
if err != nil { continue }
proxy.connTrackLock.Lock()
-----------G1 leaks--------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| serving#2137 | pull request | patch | Mixed | Channel & Lock |
patch:https://github.com/ knative/serving/pull/2137/files pull request:https://github.com/ knative/serving/pull/2137
G1 G2 G3
----------------------------------------------------------------------------------
b.concurrentRequests(2) . .
b.concurrentRequest() . .
r.lock.Lock() . .
. start.Done() .
start.Wait() . .
b.concurrentRequest() . .
r.lock.Lock() . .
. . start.Done()
start.Wait() . .
unlockAll(locks) . .
unlock(lc) . .
req.lock.Unlock() . .
ok := <-req.accepted . .
. b.Maybe() .
. b.activeRequests <- t .
. thunk() .
. r.lock.Lock() .
. . b.Maybe()
. . b.activeRequests <- t
----------------------------G1,G2,G3 leak-----------------------------------------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| syncthing#4829 | pull request | patch | Resource | Double locking |
Double locking at line 17 and line 30.
G1
---------------------------
mapping.clearAddresses()
m.mut.Lock() [L2]
m.notify(...)
m.mut.RLock() [L2]
----------G1 leaks---------
| Bug ID | Ref | Patch | Type | Sub-type |
|---|---|---|---|---|
| syncthing#5795 | pull request | patch | Communication | Channel |
<-c.dispatcherLoopStopped at line 82 is blocking forever because dispatcherLoop() is blocking at line 72.
G1 G2
--------------------------------------------------------------
c.Start()
go c.dispatcherLoop() [G3]
. select [<-c.inbox, <-c.closed]
c.inbox <- <================> [<-c.inbox]
<-c.dispatcherLoopStopped .
. default
. c.ccFn()/c.Close()
. close(c.closed)
. <-c.dispatcherLoopStopped
---------------------G1,G2 leak-------------------------------