| More Research Problems of Implementing Go |
| |
| Dmitry Vyukov |
| Google |
| |
| http://golang.org/ |
| |
| * About Go |
| |
| Go is an open source programming language that makes it easy to build simple, reliable, and efficient software. |
| |
| Design began in late 2007. |
| |
| - Robert Griesemer, Rob Pike, Ken Thompson |
| - Russ Cox, Ian Lance Taylor |
| |
| Became open source in November 2009. |
| |
| Developed entirely in the open; very active community. |
| Language stable as of Go 1, early 2012. |
| Work continues. |
| |
| * Motivation for Go |
| |
| .image research2/datacenter.jpg |
| |
| * Motivation for Go |
| |
| Started as an answer to software problems at Google: |
| |
| - multicore processors |
| - networked systems |
| - massive computation clusters |
| - scale: 10⁷⁺ lines of code |
| - scale: 10³⁺ programmers |
| - scale: 10⁶⁺ machines (design point) |
| |
| Deployed: parts of YouTube, dl.google.com, Blogger, Google Code, Google Fiber, ... |
| |
| * Go |
| |
| A simple but powerful and fun language. |
| |
| - start with C, remove complex parts |
| - add interfaces, concurrency |
| - also: garbage collection, closures, reflection, strings, ... |
| |
| For more background on design: |
| |
| - [[http://commandcenter.blogspot.com/2012/06/less-is-exponentially-more.html][Less is exponentially more]] |
| - [[http://go.dev/talks/2012/splash.article][Go at Google: Language Design in the Service of Software Engineering]] |
| |
| * Research and Go |
| |
| Go is designed for building production systems at Google. |
| |
| - Goal: make that job easier, faster, better. |
| - Non-goal: break new ground in programming language research |
| |
| Plenty of research questions about how to implement Go well. |
| |
| - Concurrency |
| - Scheduling |
| - Garbage collection |
| - Race and deadlock detection |
| - Testing of the implementation |
| |
| |
| |
| |
| * Concurrency |
| |
| .image research2/busy.jpg |
| |
| * Concurrency |
| |
| Go provides two important concepts: |
| |
| A goroutine is a thread of control within the program, with its own local variables and stack. Cheap, easy to create. |
| |
| A channel carries typed messages between goroutines. |
| |
| * Concurrency |
| |
| .play research2/hello.go |
| |
| * Concurrency: CSP |
| |
| Channels adopted from Hoare's Communicating Sequential Processes. |
| |
| - Orthogonal to rest of language |
| - Can keep familiar model for computation |
| - Focus on _composition_ of regular code |
| |
| Go _enables_ simple, safe concurrent programming. |
| It doesn't _forbid_ bad programming. |
| |
| Caveat: not purely memory safe; sharing is legal. |
| Passing a pointer over a channel is idiomatic. |
| |
| Experience shows this is practical. |
| |
| * Concurrency |
| |
| Sequential network address resolution, given a work list: |
| |
| .play research2/addr1.go /lookup/+1,/^}/-1 |
| |
| * Concurrency |
| |
| Concurrent network address resolution, given a work list: |
| |
| .play research2/addr2.go /lookup/+1,/^}/-1 |
| |
| * Concurrency |
| |
| Select statements: switch for communication. |
| |
| .play research2/select.go /select/,/^}/-1 |
| |
| That's select that makes efficient implementation difficult. |
| |
| * Implementing Concurrency |
| |
| Challenge: Make channel communication scale |
| |
| - start with one global channel lock |
| - per-channel locks, locked in address order for multi-channel operations |
| |
| Research question: lock-free channels? |
| |
| |
| |
| |
| * Scheduling |
| |
| .image research2/gophercomplex6.jpg |
| |
| * Scheduling |
| |
| On the one hand we have arbitrary user programs: |
| |
| - fine-grained goroutines, coarse-grained goroutines or a mix of both |
| - computational goroutines, IO-bound goroutines or a mix of both |
| - arbitrary dynamic communication patterns |
| - busy, idle, bursty programs |
| |
| No user hints! |
| |
| * Scheduling |
| |
| On the other hand we have complex hardware topology: |
| |
| - per-core caches |
| - caches shared between cores |
| - cores shared between hyper threads (HT) |
| - multiple processors with non-uniform memory access (NUMA) |
| |
| * Scheduling |
| |
| Challenge: make it all magically work efficiently |
| |
| - start with one global lock for all scheduler state |
| - distributed work-stealing scheduler with per-"processor" state |
| - integrated network poller into scheduler |
| - lock-free work queues |
| |
| * Scheduling |
| |
| Current scheduler: |
| |
| ┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐ |
| │ │ │ │ │ │ │ │ │ │ |
| ├─┤ ├─┤ ├─┤ ├─┤ ├─┤ Global |
| │ │ │G│ │ │ │ │ │ │ state |
| ├─┤ ├─┤ ├─┤ ├─┤ ├─┤ |
| │G│ │G│ │G│ │ │ │G│ |
| ├─┤ ├─┤ ├─┤ ├─┤ ├─┤ |
| │G│ │G│ │G│ │G│ │G│ |
| └┬┘ └┬┘ └┬┘ └┬┘ └─┘ |
| │ │ │ │ |
| ↓ ↓ ↓ ↓ |
| ┌─┬──────┐ ┌─┬──────┐ ┌─┬──────┐ ┌─┬──────┐ ┌────┐┌──────┐┌───────┐ |
| │P│mcache│ │P│mcache│ │P│mcache│ │P│mcache│ │heap││timers││netpoll│ |
| └┬┴──────┘ └┬┴──────┘ └┬┴──────┘ └┬┴──────┘ └────┘└──────┘└───────┘ |
| │ │ │ │ |
| ↓ ↓ ↓ ↓ |
| ┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐ |
| │M│ │M│ │M│ │M│ │M│ │M│ │M│ |
| └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘ |
| |
| G - goroutine; P - logical processor; M - OS thread (machine) |
| |
| * Scheduling |
| |
| Want: |
| |
| - temporal locality to exploit caches |
| - spatial locality to exploit NUMA |
| - schedule mostly LIFO but ensure weak fairness |
| - allocate local memory and stacks |
| - scan local memory in GC |
| - collocate communicating goroutines |
| - distribute non-communicating goroutines |
| - distribute timers and network poller |
| - poll network on the same core where last read was issued |
| |
| |
| |
| |
| |
| * Garbage Collection |
| |
| * Garbage Collection |
| |
| Garbage collection simplifies APIs. |
| |
| - In C and C++, too much API design (and too much programming effort!) is about memory management. |
| |
| Fundamental to concurrency: too hard to track ownership otherwise. |
| |
| Fundamental to interfaces: memory management details do not bifurcate otherwise-similar APIs. |
| |
| Of course, adds cost, latency, complexity in run time system. |
| |
| * Garbage Collection |
| |
| Plenty of research about garbage collection, mostly in Java context. |
| |
| - Parallel stop-the-world |
| - CMS: concurrent mark-and-sweep, stop-the-world compaction |
| - G1: region-based incremental copying collector |
| |
| Java collectors usually: |
| |
| - are generational/incremental because allocation rate is high |
| - compact memory to support generations |
| - have pauses because concurrent compaction is tricky and slow |
| |
| * Garbage Collection |
| |
| But Go is very different! |
| |
| - User can avoid lots of allocations by embedding objects: |
| |
| type Point struct { |
| X, Y int |
| } |
| type Rectangle struct { |
| Min, Max Point |
| } |
| |
| - Less pointers. |
| - Lots of stack allocations. |
| - Interior pointers are allowed: |
| |
| p := &rect.Max |
| |
| - Hundreds of thousands of stacks (goroutines) |
| - No object headers so far |
| |
| * Implementing Garbage Collection |
| |
| Current GC: stop the world, parallel mark, start the world, concurrent sweep. |
| Concurrent mark is almost ready. |
| |
| Cannot reuse Java GC algorithms directly. |
| |
| Research question: what GC algorithm is the best fit for Go? |
| Do we need generations? Do we need compaction? What are efficient data structures that support interior pointers? |
| |
| |
| |
| |
| * Race and deadlock detection |
| |
| .image research2/race.png 160 600 |
| |
| * Race detection |
| |
| Based on ThreadSanitizer runtime, originally mainly targeted C/C++. |
| Traditional happens-before race detector based on vector clocks (devil in details!). |
| Works fine for Go, except: |
| |
| $ go run -race lots_of_goroutines.go |
| race: limit on 8192 simultaneously alive goroutines is exceeded, dying |
| |
| Research question: race detector that efficiently supports hundreds of thousands of goroutines? |
| |
| * Deadlock detection |
| |
| Deadlock on mutexes due to lock order inversion: |
| |
| // thread 1 // thread 2 |
| pthread_mutex_lock(&m1); pthread_mutex_lock(&m2); |
| pthread_mutex_lock(&m2); pthread_mutex_lock(&m1); |
| ... ... |
| pthread_mutex_unlock(&m2); pthread_mutex_unlock(&m1); |
| pthread_mutex_unlock(&m1); pthread_mutex_unlock(&m2); |
| |
| Lock order inversions are easy to detect: |
| |
| - build "M1 is locked under M2" relation. |
| - if it becomes cyclic, there is a potential deadlock. |
| - whenever a new edge is added to the graph, do DFS to find cycles. |
| |
| * Deadlock detection |
| |
| Go has channels and mutexes. Channels are semaphores. A mutex can be unlocked in |
| a different goroutine, so it is essentially a binary semaphore too. |
| |
| Deadlock example: |
| |
| // Parallel file tree walk. |
| func worker(pendingItems chan os.FileInfo) |
| for f := range pendingItems { |
| if f.IsDir() { |
| filepath.Walk(f.Name(), func(path string, info os.FileInfo, err error) error { |
| pendingItems <- info |
| }) |
| } else { |
| visit(f) |
| } |
| } |
| } |
| |
| pendingItems channel has limited capacity. All workers can block on send to pendingItems. |
| |
| * Deadlock detection |
| |
| Another deadlock example: |
| |
| var ( |
| c = make(chan T, 100) |
| mtx sync.RWMutex |
| ) |
| |
| // goroutine 1 // goroutine 2 // goroutine 3 |
| // does send // does receive // "resizes" the channel |
| mtx.RLock() mtx.RLock() mtx.Lock() |
| c <- v v := <-c tmp := make(chan T, 200) |
| mtx.RUnlock() mtx.RUnlock() copyAll(c, tmp) |
| c = tmp |
| mtx.Unlock() |
| |
| RWMutex is fair for both readers and writers: when a writer arrives, new readers are not let to enter the critical section. |
| Goroutine 1 blocks on chan send; then goroutine 3 blocks on mtx.Lock; then goroutine 2 blocks on mtx.RLock. |
| |
| * Deadlock detection |
| |
| Research question: how to detect deadlocks on semaphores? |
| |
| No known theory to date. |
| |
| |
| |
| |
| * Testing of the implementation |
| |
| .image research2/gopherswrench.jpg 240 405 |
| |
| * Testing of the implementation |
| |
| So now we have a new language with several complex implementations: |
| |
| - lexer |
| - parser |
| - transformation and optimization passes |
| - code generation |
| - linker |
| - channel and map operations |
| - garbage collector |
| - ... |
| |
| *How*do*you*test*it?* |
| |
| * Testing of the implementation |
| |
| Csmith is a tool that generates random C programs that statically and dynamically conform to the C99 standard. |
| |
| .image research2/csmith.png |
| |
| * Testing of the implementation |
| |
| Gosmith is a tool that generates random Go programs that statically and dynamically conform to the Go standard. |
| |
| Turned out to be much simpler than C: no undefined behavior all around! |
| |
| - no uninitialized variables |
| - no concurrent mutations between sequence points (x[i++] = --i) |
| - no UB during signed overflow |
| - total 191 kinds of undefined behavior and 52 kinds of unspecified behavior in C |
| |
| * Testing of the implementation |
| |
| But generates uninteresting programs from execution point of view: most of them deadlock or crash on nil deref. |
| |
| Trophies: |
| |
| - 31 bugs in gc compiler |
| - 18 bugs in gccgo compiler |
| - 5 bugs in llgo compiler |
| - 1 bug in gofmt |
| - 3 bugs in the spec |
| |
| .image research2/emoji.png |
| |
| * Testing of the implementation |
| |
| Research question: how to generate random *interesting*concurrent* Go programs? |
| |
| Must: |
| |
| - create and wait for goroutines |
| - communicate over channels |
| - protect data with mutexes (reader-writer) |
| - pass data ownership between goroutines (explicitly and implicitly) |
| |
| Must not: |
| |
| - deadlock |
| - cause data races |
| - have non-deterministic results |
| |
| |
| |
| |
| |
| * Research and Go |
| |
| Plenty of research questions about how to implement Go well. |
| |
| - Concurrency |
| - Scheduling |
| - Garbage collection |
| - Race and deadlock detection |
| - Testing of the implementation |
| - [Polymorphism] |
| - [Program translation] |
| |