content/2014/research2.slide - talks - Git at Google

 More Research Problems of Implementing Go

 Dmitry Vyukov
 Google

 http://golang.org/

 * About Go

 Go is an open source programming language that makes it easy to build simple, reliable, and efficient software.

 Design began in late 2007.

 - Robert Griesemer, Rob Pike, Ken Thompson
 - Russ Cox, Ian Lance Taylor

 Became open source in November 2009.

 Developed entirely in the open; very active community.
 Language stable as of Go 1, early 2012.
 Work continues.

 * Motivation for Go

 .image research2/datacenter.jpg

 * Motivation for Go

 Started as an answer to software problems at Google:

 - multicore processors
 - networked systems
 - massive computation clusters
 - scale: 10⁷⁺ lines of code
 - scale: 10³⁺ programmers
 - scale: 10⁶⁺ machines (design point)

 Deployed: parts of YouTube, dl.google.com, Blogger, Google Code, Google Fiber, ...

 * Go

 A simple but powerful and fun language.

 - start with C, remove complex parts
 - add interfaces, concurrency
 - also: garbage collection, closures, reflection, strings, ...

 For more background on design:

 - [[http://commandcenter.blogspot.com/2012/06/less-is-exponentially-more.html][Less is exponentially more]]
 - [[http://talks.golang.org/2012/splash.article][Go at Google: Language Design in the Service of Software Engineering]]

 * Research and Go

 Go is designed for building production systems at Google.

 - Goal: make that job easier, faster, better.
 - Non-goal: break new ground in programming language research

 Plenty of research questions about how to implement Go well.

 - Concurrency
 - Scheduling
 - Garbage collection
 - Race and deadlock detection
 - Testing of the implementation


 * Concurrency

 .image research2/busy.jpg

 * Concurrency

 Go provides two important concepts:

 A goroutine is a thread of control within the program, with its own local variables and stack. Cheap, easy to create.

 A channel carries typed messages between goroutines.

 * Concurrency

 .play research2/hello.go

 * Concurrency: CSP

 Channels adopted from Hoare's Communicating Sequential Processes.

 - Orthogonal to rest of language
 - Can keep familiar model for computation
 - Focus on _composition_ of regular code

 Go _enables_ simple, safe concurrent programming.
 It doesn't _forbid_ bad programming.

 Caveat: not purely memory safe; sharing is legal.
 Passing a pointer over a channel is idiomatic.

 Experience shows this is practical.

 * Concurrency

 Sequential network address resolution, given a work list:

 .play research2/addr1.go /lookup/+1,/^}/-1

 * Concurrency

 Concurrent network address resolution, given a work list:

 .play research2/addr2.go /lookup/+1,/^}/-1

 * Concurrency

 Select statements: switch for communication.

 .play research2/select.go /select/,/^}/-1

 That's select that makes efficient implementation difficult.

 * Implementing Concurrency

 Challenge: Make channel communication scale

 - start with one global channel lock
 - per-channel locks, locked in address order for multi-channel operations

 Research question: lock-free channels?


 * Scheduling

 .image research2/gophercomplex6.jpg

 * Scheduling

 On the one hand we have arbitrary user programs:

 - fine-grained goroutines, coarse-grained goroutines or a mix of both
 - computational goroutines, IO-bound goroutines or a mix of both
 - arbitrary dynamic communication patterns
 - busy, idle, bursty programs

 No user hints!

 * Scheduling

 On the other hand we have complex hardware topology:

 - per-core caches
 - caches shared between cores
 - cores shared between hyper threads (HT)
 - multiple processors with non-uniform memory access (NUMA)

 * Scheduling

 Challenge: make it all magically work efficiently

 - start with one global lock for all scheduler state
 - distributed work-stealing scheduler with per-"processor" state
 - integrated network poller into scheduler
 - lock-free work queues

 * Scheduling

 Current scheduler:

  ┌─┐         ┌─┐         ┌─┐         ┌─┐                  ┌─┐
  │ │         │ │         │ │         │ │                  │ │
  ├─┤         ├─┤         ├─┤         ├─┤                  ├─┤ Global
  │ │         │G│         │ │         │ │                  │ │ state
  ├─┤         ├─┤         ├─┤         ├─┤                  ├─┤
  │G│         │G│         │G│         │ │                  │G│
  ├─┤         ├─┤         ├─┤         ├─┤                  ├─┤
  │G│         │G│         │G│         │G│                  │G│
  └┬┘         └┬┘         └┬┘         └┬┘                  └─┘
   │           │           │           │
   ↓           ↓           ↓           ↓
  ┌─┬──────┐  ┌─┬──────┐  ┌─┬──────┐  ┌─┬──────┐     ┌────┐┌──────┐┌───────┐
  │P│mcache│  │P│mcache│  │P│mcache│  │P│mcache│     │heap││timers││netpoll│
  └┬┴──────┘  └┬┴──────┘  └┬┴──────┘  └┬┴──────┘     └────┘└──────┘└───────┘
   │           │           │           │
   ↓           ↓           ↓           ↓
  ┌─┐         ┌─┐         ┌─┐         ┌─┐               ┌─┐ ┌─┐ ┌─┐
  │M│         │M│         │M│         │M│               │M│ │M│ │M│
  └─┘         └─┘         └─┘         └─┘               └─┘ └─┘ └─┘

 G - goroutine; P - logical processor; M - OS thread (machine)

 * Scheduling

 Want:

 - temporal locality to exploit caches
 - spatial locality to exploit NUMA
 - schedule mostly LIFO but ensure weak fairness
 - allocate local memory and stacks
 - scan local memory in GC
 - collocate communicating goroutines
 - distribute non-communicating goroutines
 - distribute timers and network poller
 - poll network on the same core where last read was issued


 * Garbage Collection

 * Garbage Collection

 Garbage collection simplifies APIs.

 - In C and C++, too much API design (and too much programming effort!) is about memory management.

 Fundamental to concurrency: too hard to track ownership otherwise.

 Fundamental to interfaces: memory management details do not bifurcate otherwise-similar APIs.

 Of course, adds cost, latency, complexity in run time system.

 * Garbage Collection

 Plenty of research about garbage collection, mostly in Java context.

 - Parallel stop-the-world
 - CMS: concurrent mark-and-sweep, stop-the-world compaction
 - G1: region-based incremental copying collector

 Java collectors usually:

 - are generational/incremental because allocation rate is high
 - compact memory to support generations
 - have pauses because concurrent compaction is tricky and slow

 * Garbage Collection

 But Go is very different!

 - User can avoid lots of allocations by embedding objects:

 	type Point struct {
 		X, Y int
 	}
 	type Rectangle struct {
 		Min, Max Point
 	}

 - Less pointers.
 - Lots of stack allocations.
 - Interior pointers are allowed:

 	p := &rect.Max

 - Hundreds of thousands of stacks (goroutines)
 - No object headers so far

 * Implementing Garbage Collection

 Current GC: stop the world, parallel mark, start the world, concurrent sweep.
 Concurrent mark is almost ready.

 Cannot reuse Java GC algorithms directly.

 Research question: what GC algorithm is the best fit for Go?
 Do we need generations? Do we need compaction? What are efficient data structures that support interior pointers?


 * Race and deadlock detection

 .image research2/race.png 160 600

 * Race detection

 Based on ThreadSanitizer runtime, originally mainly targeted C/C++.
 Traditional happens-before race detector based on vector clocks (devil in details!).
 Works fine for Go, except:

  $ go run -race lots_of_goroutines.go
  race: limit on 8192 simultaneously alive goroutines is exceeded, dying

 Research question: race dectector that efficiently supports hundreds of thousands of goroutines?

 * Deadlock detection

 Deadlock on mutexes due to lock order inversion:

  // thread 1                       // thread 2
  pthread_mutex_lock(&m1);          pthread_mutex_lock(&m2);
  pthread_mutex_lock(&m2);          pthread_mutex_lock(&m1);
  ...                               ...
  pthread_mutex_unlock(&m2);        pthread_mutex_unlock(&m1);
  pthread_mutex_unlock(&m1);        pthread_mutex_unlock(&m2);

 Lock order inversions are easy to detect:

 - build "M1 is locked under M2" relation.
 - if it becomes cyclic, there is a potential deadlock.
 - whenever a new edge is added to the graph, do DFS to find cycles.

 * Deadlock detection

 Go has channels and mutexes. Channels are semaphores. A mutex can be unlocked in
 a different goroutine, so it is essentially a binary semaphore too.

 Deadlock example:

 	// Parallel file tree walk.
 	func worker(pendingItems chan os.FileInfo)
 		for f := range pendingItems {
 			if f.IsDir() {
 				filepath.Walk(f.Name(), func(path string, info os.FileInfo, err error) error {
 					pendingItems <- info
 				})
 			} else {
 				visit(f)
 			}
 		}
 	}

 pendingItems channel has limited capacity. All workers can block on send to pendingItems.

 * Deadlock detection

 Another deadlock example:

  var (
  	c = make(chan T, 100)
  	mtx sync.RWMutex
   )

  // goroutine 1      // goroutine 2         // goroutine 3
  // does send        // does receive        // "resizes" the channel
  mtx.RLock()         mtx.RLock()            mtx.Lock()
  c <- v              v := <-c               tmp := make(chan T, 200)
  mtx.RUnlock()       mtx.RUnlock()          copyAll(c, tmp)
                                             c = tmp
                                             mtx.Unlock()

 RWMutex is fair for both readers and writers: when a writer arrives, new readers are not let to enter the critical section.
 Goroutine 1 blocks on chan send; then goroutine 3 blocks on mtx.Lock; then goroutine 2 blocks on mtx.RLock.

 * Deadlock detection

 Research question: how to detect deadlocks on semaphores?

 No known theory to date.


 * Testing of the implementation

 .image research2/gopherswrench.jpg 240 405

 * Testing of the implementation

 So now we have a new language with several complex implementations:

 - lexer
 - parser
 - transformation and optimization passes
 - code generation
 - linker
 - channel and map operations
 - garbage collector
 - ...

 *How*do*you*test*it?*

 * Testing of the implementation

 Csmith is a tool that generates random C programs that statically and dynamically conform to the C99 standard.

 .image research2/csmith.png

 * Testing of the implementation

 Gosmith is a tool that generates random Go programs that statically and dynamically conform to the Go standard.

 Turned out to be much simpler than C: no undefined behavior all around!

 - no unitialized variables
 - no concurrent mutations between sequence points (x[i++] = --i)
 - no UB during signed overflow
 - total 191 kinds of undefined behavior and 52 kinds of unspecified behavior in C

 * Testing of the implementation

 But generates uninteresting programs from execution point of view: most of them deadlock or crash on nil deref.

 Trophies:

 - 31 bugs in gc compiler
 - 18 bugs in gccgo compiler
 - 5 bugs in llgo compiler
 - 1 bug in gofmt
 - 3 bugs in the spec

 .image research2/emoji.png

 * Testing of the implementation

 Research question: how to generate random *interesting*concurrent* Go programs?

 Must:

 - create and wait for goroutines
 - communicate over channels
 - protect data with mutexes (reader-writer)
 - pass data ownership between goroutines (explicitly and implicitly)

 Must not:

 - deadlock
 - cause data races
 - have non-deterministic results


 * Research and Go

 Plenty of research questions about how to implement Go well.

 - Concurrency
 - Scheduling
 - Garbage collection
 - Race and deadlock detection
 - Testing of the implementation
 - [Polymorphism]
 - [Program translation]
	More Research Problems of Implementing Go

	Dmitry Vyukov
	Google

	http://golang.org/

	* About Go

	Go is an open source programming language that makes it easy to build simple, reliable, and efficient software.

	Design began in late 2007.

	- Robert Griesemer, Rob Pike, Ken Thompson
	- Russ Cox, Ian Lance Taylor

	Became open source in November 2009.

	Developed entirely in the open; very active community.
	Language stable as of Go 1, early 2012.
	Work continues.

	* Motivation for Go

	.image research2/datacenter.jpg

	* Motivation for Go

	Started as an answer to software problems at Google:

	- multicore processors
	- networked systems
	- massive computation clusters
	- scale: 10⁷⁺ lines of code
	- scale: 10³⁺ programmers
	- scale: 10⁶⁺ machines (design point)

	Deployed: parts of YouTube, dl.google.com, Blogger, Google Code, Google Fiber, ...

	* Go

	A simple but powerful and fun language.

	- start with C, remove complex parts
	- add interfaces, concurrency
	- also: garbage collection, closures, reflection, strings, ...

	For more background on design:

	- [[http://commandcenter.blogspot.com/2012/06/less-is-exponentially-more.html][Less is exponentially more]]
	- [[http://talks.golang.org/2012/splash.article][Go at Google: Language Design in the Service of Software Engineering]]

	* Research and Go

	Go is designed for building production systems at Google.

	- Goal: make that job easier, faster, better.
	- Non-goal: break new ground in programming language research

	Plenty of research questions about how to implement Go well.

	- Concurrency
	- Scheduling
	- Garbage collection
	- Race and deadlock detection
	- Testing of the implementation




	* Concurrency

	.image research2/busy.jpg

	* Concurrency

	Go provides two important concepts:

	A goroutine is a thread of control within the program, with its own local variables and stack. Cheap, easy to create.

	A channel carries typed messages between goroutines.

	* Concurrency

	.play research2/hello.go

	* Concurrency: CSP

	Channels adopted from Hoare's Communicating Sequential Processes.

	- Orthogonal to rest of language
	- Can keep familiar model for computation
	- Focus on _composition_ of regular code

	Go _enables_ simple, safe concurrent programming.
	It doesn't _forbid_ bad programming.

	Caveat: not purely memory safe; sharing is legal.
	Passing a pointer over a channel is idiomatic.

	Experience shows this is practical.

	* Concurrency

	Sequential network address resolution, given a work list:

	.play research2/addr1.go /lookup/+1,/^}/-1

	* Concurrency

	Concurrent network address resolution, given a work list:

	.play research2/addr2.go /lookup/+1,/^}/-1

	* Concurrency

	Select statements: switch for communication.

	.play research2/select.go /select/,/^}/-1

	That's select that makes efficient implementation difficult.

	* Implementing Concurrency

	Challenge: Make channel communication scale

	- start with one global channel lock
	- per-channel locks, locked in address order for multi-channel operations

	Research question: lock-free channels?




	* Scheduling

	.image research2/gophercomplex6.jpg

	* Scheduling

	On the one hand we have arbitrary user programs:

	- fine-grained goroutines, coarse-grained goroutines or a mix of both
	- computational goroutines, IO-bound goroutines or a mix of both
	- arbitrary dynamic communication patterns
	- busy, idle, bursty programs

	No user hints!

	* Scheduling

	On the other hand we have complex hardware topology:

	- per-core caches
	- caches shared between cores
	- cores shared between hyper threads (HT)
	- multiple processors with non-uniform memory access (NUMA)

	* Scheduling

	Challenge: make it all magically work efficiently

	- start with one global lock for all scheduler state
	- distributed work-stealing scheduler with per-"processor" state
	- integrated network poller into scheduler
	- lock-free work queues

	* Scheduling

	Current scheduler:

	┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐
	│ │ │ │ │ │ │ │ │ │
	├─┤ ├─┤ ├─┤ ├─┤ ├─┤ Global
	│ │ │G│ │ │ │ │ │ │ state
	├─┤ ├─┤ ├─┤ ├─┤ ├─┤
	│G│ │G│ │G│ │ │ │G│
	├─┤ ├─┤ ├─┤ ├─┤ ├─┤
	│G│ │G│ │G│ │G│ │G│
	└┬┘ └┬┘ └┬┘ └┬┘ └─┘
	│ │ │ │
	↓ ↓ ↓ ↓
	┌─┬──────┐ ┌─┬──────┐ ┌─┬──────┐ ┌─┬──────┐ ┌────┐┌──────┐┌───────┐
	│P│mcache│ │P│mcache│ │P│mcache│ │P│mcache│ │heap││timers││netpoll│
	└┬┴──────┘ └┬┴──────┘ └┬┴──────┘ └┬┴──────┘ └────┘└──────┘└───────┘
	│ │ │ │
	↓ ↓ ↓ ↓
	┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐
	│M│ │M│ │M│ │M│ │M│ │M│ │M│
	└─┘ └─┘ └─┘ └─┘ └─┘ └─┘ └─┘

	G - goroutine; P - logical processor; M - OS thread (machine)

	* Scheduling

	Want:

	- temporal locality to exploit caches
	- spatial locality to exploit NUMA
	- schedule mostly LIFO but ensure weak fairness
	- allocate local memory and stacks
	- scan local memory in GC
	- collocate communicating goroutines
	- distribute non-communicating goroutines
	- distribute timers and network poller
	- poll network on the same core where last read was issued





	* Garbage Collection

	* Garbage Collection

	Garbage collection simplifies APIs.

	- In C and C++, too much API design (and too much programming effort!) is about memory management.

	Fundamental to concurrency: too hard to track ownership otherwise.

	Fundamental to interfaces: memory management details do not bifurcate otherwise-similar APIs.

	Of course, adds cost, latency, complexity in run time system.

	* Garbage Collection

	Plenty of research about garbage collection, mostly in Java context.

	- Parallel stop-the-world
	- CMS: concurrent mark-and-sweep, stop-the-world compaction
	- G1: region-based incremental copying collector

	Java collectors usually:

	- are generational/incremental because allocation rate is high
	- compact memory to support generations
	- have pauses because concurrent compaction is tricky and slow

	* Garbage Collection

	But Go is very different!

	- User can avoid lots of allocations by embedding objects:

	type Point struct {
	X, Y int
	}
	type Rectangle struct {
	Min, Max Point
	}

	- Less pointers.
	- Lots of stack allocations.
	- Interior pointers are allowed:

	p := &rect.Max

	- Hundreds of thousands of stacks (goroutines)
	- No object headers so far

	* Implementing Garbage Collection

	Current GC: stop the world, parallel mark, start the world, concurrent sweep.
	Concurrent mark is almost ready.

	Cannot reuse Java GC algorithms directly.

	Research question: what GC algorithm is the best fit for Go?
	Do we need generations? Do we need compaction? What are efficient data structures that support interior pointers?




	* Race and deadlock detection

	.image research2/race.png 160 600

	* Race detection

	Based on ThreadSanitizer runtime, originally mainly targeted C/C++.
	Traditional happens-before race detector based on vector clocks (devil in details!).
	Works fine for Go, except:

	$ go run -race lots_of_goroutines.go
	race: limit on 8192 simultaneously alive goroutines is exceeded, dying

	Research question: race dectector that efficiently supports hundreds of thousands of goroutines?

	* Deadlock detection

	Deadlock on mutexes due to lock order inversion:

	// thread 1 // thread 2
	pthread_mutex_lock(&m1); pthread_mutex_lock(&m2);
	pthread_mutex_lock(&m2); pthread_mutex_lock(&m1);
	... ...
	pthread_mutex_unlock(&m2); pthread_mutex_unlock(&m1);
	pthread_mutex_unlock(&m1); pthread_mutex_unlock(&m2);

	Lock order inversions are easy to detect:

	- build "M1 is locked under M2" relation.
	- if it becomes cyclic, there is a potential deadlock.
	- whenever a new edge is added to the graph, do DFS to find cycles.

	* Deadlock detection

	Go has channels and mutexes. Channels are semaphores. A mutex can be unlocked in
	a different goroutine, so it is essentially a binary semaphore too.

	Deadlock example:

	// Parallel file tree walk.
	func worker(pendingItems chan os.FileInfo)
	for f := range pendingItems {
	if f.IsDir() {
	filepath.Walk(f.Name(), func(path string, info os.FileInfo, err error) error {
	pendingItems <- info
	})
	} else {
	visit(f)
	}
	}
	}

	pendingItems channel has limited capacity. All workers can block on send to pendingItems.

	* Deadlock detection

	Another deadlock example:

	var (
	c = make(chan T, 100)
	mtx sync.RWMutex
	)

	// goroutine 1 // goroutine 2 // goroutine 3
	// does send // does receive // "resizes" the channel
	mtx.RLock() mtx.RLock() mtx.Lock()
	c <- v v := <-c tmp := make(chan T, 200)
	mtx.RUnlock() mtx.RUnlock() copyAll(c, tmp)
	c = tmp
	mtx.Unlock()

	RWMutex is fair for both readers and writers: when a writer arrives, new readers are not let to enter the critical section.
	Goroutine 1 blocks on chan send; then goroutine 3 blocks on mtx.Lock; then goroutine 2 blocks on mtx.RLock.

	* Deadlock detection

	Research question: how to detect deadlocks on semaphores?

	No known theory to date.




	* Testing of the implementation

	.image research2/gopherswrench.jpg 240 405

	* Testing of the implementation

	So now we have a new language with several complex implementations:

	- lexer
	- parser
	- transformation and optimization passes
	- code generation
	- linker
	- channel and map operations
	- garbage collector
	- ...

	Howdoyoutestit?

	* Testing of the implementation

	Csmith is a tool that generates random C programs that statically and dynamically conform to the C99 standard.

	.image research2/csmith.png

	* Testing of the implementation

	Gosmith is a tool that generates random Go programs that statically and dynamically conform to the Go standard.

	Turned out to be much simpler than C: no undefined behavior all around!

	- no unitialized variables
	- no concurrent mutations between sequence points (x[i++] = --i)
	- no UB during signed overflow
	- total 191 kinds of undefined behavior and 52 kinds of unspecified behavior in C

	* Testing of the implementation

	But generates uninteresting programs from execution point of view: most of them deadlock or crash on nil deref.

	Trophies:

	- 31 bugs in gc compiler
	- 18 bugs in gccgo compiler
	- 5 bugs in llgo compiler
	- 1 bug in gofmt
	- 3 bugs in the spec

	.image research2/emoji.png

	* Testing of the implementation

	Research question: how to generate random interestingconcurrent* Go programs?

	Must:

	- create and wait for goroutines
	- communicate over channels
	- protect data with mutexes (reader-writer)
	- pass data ownership between goroutines (explicitly and implicitly)

	Must not:

	- deadlock
	- cause data races
	- have non-deterministic results





	* Research and Go

	Plenty of research questions about how to implement Go well.

	- Concurrency
	- Scheduling
	- Garbage collection
	- Race and deadlock detection
	- Testing of the implementation
	- [Polymorphism]
	- [Program translation]