Blame - Performance.md - wiki

blob: 45a9ac484b2873732c229a94825049cd874142e2 [file] [log] [blame] [view]

Jaana Burcu Dogan	9f85935	2017-05-08 20:04:42 -0700	[diff] [blame]	1	# Debugging performance issues in Go programs
				2
				3	-- originally written by Dmitry Vyukov
				4
				5	Let's assume you have a Go program and want to improve its performance. There
				6	are several tools available that can help with this task. These tools can help
				7	you to identify various types of hotspots (CPU, IO, memory), hotspots are the
				8	places that you need to concentrate on in order to significantly improve
				9	performance. However, another outcome is possible -- the tools can help you
				10	identify obvious performance defects in the program. For example, you prepare an
				11	SQL statement before each query while you could prepare it once at program
				12	startup. Another example is if an O(N^2) algorithm somehow slipped into where an
				13	obvious O(N) exists and is expected. In order to identify such cases you need to
				14	sanity check what you see in profiles. For example for the first case
				15	significant time spent in SQL statement preparation would be the red flag.
				16
				17	It's also important to understand various bounding factors for performance. For
				18	example, if the program communicates via 100 Mbps network link and it is already
				19	utilizes >90Mbps, there is not much you can do with the program to improve its
				20	performance. There are similar bounding factors for disk IO, memory consumption
				21	and computational tasks. With that in mind we can look at the available tools.
				22
				23	Note: The tools can interfere with each other. For example, precise memory
				24	profiling skews CPU profiles, goroutine blocking profiling affects scheduler
				25	trace, etc. Use tools in isolation to get more precise info.
				26
				27	## CPU Profiler
				28
				29	Go runtime contains built-in CPU profiler, which shows what functions consume
				30	what percent of CPU time. There are 3 ways you can get access to it:
				31
				32	1. The simplest one is -cpuprofile flag of 'go test'
				33	(http://golang.org/cmd/go/#hdr-Description_of_testing_flags) command. For
				34	example, the following command:
				35
				36	$ go test -run=none -bench=ClientServerParallel4 -cpuprofile=cprof net/http
				37
				38	will profile the given benchmark and write CPU profile into 'cprof' file.
				39	Then:
				40
				41	$ go tool pprof --text http.test cprof
				42
				43	will print a list of the hottest functions.
				44
				45	There are several output types available, the most useful ones are:
				46	`--text`, `--web` and `--list`. Run `go tool pprof` to get the complete
				47	list. The obvious drawback of this option is that it works only for tests.
				48
				49	2. [net/http/pprof](http://golang.org/pkg/net/http/pprof): This is the ideal
				50	solution for network servers. You merely need to import `net/http/pprof`,
				51	and collect profiles with:
				52
				53	$ go tool pprof --text mybin http://myserver:6060:/debug/pprof/profile
				54
				55	3. Manual profile collection. You need to import
				56	[runtime/pprof](http://golang.org/pkg/runtime/pprof/) and add the following
				57	code to main function:
				58
				59	if *flagCpuprofile != "" {
				60	f, err := os.Create(*flagCpuprofile)
				61	if err != nil {
				62	log.Fatal(err)
				63	}
				64	pprof.StartCPUProfile(f)
				65	defer pprof.StopCPUProfile()
				66	}
				67
				68	The profile will be written to the specified file, visualize it the same way
				69	as in the first option.
				70
				71	Here is an example of a profile visualized with `--web` option:
				72	[cpu_profile.png]
				73
				74	You can investigate a single function with `--list=funcname` option. For
				75	example, the following profile shows that the time was spent in the `append`
				76	function:
				77
				78	. . 93: func (bp *buffer) WriteRune(r rune) error {
				79	. . 94: if r < utf8.RuneSelf {
				80	5 5 95: bp = append(bp, byte(r))
				81	. . 96: return nil
				82	. . 97: }
				83	. . 98:
				84	. . 99: b := *bp
				85	. . 100: n := len(b)
				86	. . 101: for n+utf8.UTFMax > cap(b) {
				87	. . 102: b = append(b, 0)
				88	. . 103: }
				89	. . 104: w := utf8.EncodeRune(b[n:n+utf8.UTFMax], r)
				90	. . 105: *bp = b[:n+w]
				91	. . 106: return nil
				92	. . 107: }
				93
				94	There are also 3 special entries that the profiler uses when it can't unwind
				95	stack: GC, System and ExternalCode. GC means time spent during garbage
				96	collection, see Memory Profiler and Garbage Collector Trace sections below for
				97	optimization suggestions. System means time spent in goroutine scheduler, stack
				98	management code and other auxiliary runtime code. ExternalCode means time spent
				99	in native dynamic libraries.
				100
				101	Here are some hints with respect to how to interpret what you see in the
				102	profile.
				103
				104	If you see lots of time spent in `runtime.mallocgc` function, the program
				105	potentially makes excessive amount of small memory allocations. The profile will
				106	tell you where the allocations are coming from. See the memory profiler section
				107	for suggestions on how to optimize this case.
				108
				109	If lots of time is spent in channel operations, sync.Mutex code and other
				110	synchronization primitives or System component, the program probably suffers
				111	from contention. Consider to restructure program to eliminate frequently
				112	accessed shared resources. Common techniques for this include
				113	sharding/partitioning, local buffering/batching and copy-on-write technique.
				114
				115	If lots of time is spent in `syscall.Read/Write`, the program potentially makes
				116	excessive amount of small reads and writes. Bufio wrappers around os.File or
				117	net.Conn can help in this case.
				118
				119	If lots of time is spent in GC component, the program either allocates too many
				120	transient objects or heap size is very small so garbage collections happen too
				121	frequently. See Garbage Collector Tracer and Memory Profiler sections for
				122	optimization suggestions.
				123
Ralph Caraveo III	93bb1d0	2018-09-20 21:20:15 -0700	[diff] [blame]	124	Note: For darwin CPU profiler currently only [works on El Capitan or newer](https://code.google.com/p/go/issues/detail?id=6047).
Jaana Burcu Dogan	9f85935	2017-05-08 20:04:42 -0700	[diff] [blame]	125
				126	Note: On windows you need to install Cygwin, Perl and Graphviz to generate
				127	svg/web profiles.
				128
				129
				130	## Memory Profiler
				131
				132	Memory profiler shows what functions allocate heap memory. You can collect it in
				133	similar ways as CPU profile: with `go test --memprofile`
				134	(http://golang.org/cmd/go/#hdr-Description_of_testing_flags), with
				135	[net/http/pprof](http://golang.org/pkg/net/http/pprof) via
				136	http://myserver:6060:/debug/pprof/heap or by calling
				137	[runtime/pprof.WriteHeapProfile](http://golang.org/pkg/runtime/pprof/#WriteHeapProfile).
				138
				139	You can visualize only allocations live at the time of profile collection
				140	(`--inuse_space flag to pprof`, default), or all allocations happened since
				141	program start (`--alloc_space` flag to `pprof`). The former is useful for
				142	profiles collected with net/http/pprof on live applications, the latter is
				143	useful for profiles collected at program end (otherwise you will see almost
				144	empty profile).
				145
				146	Note: the memory profiler is sampling, that is, it collects information only
				147	about some subset of memory allocations. Probability of sampling an object is
				148	proportional to its size. You can change the sampling rate with go test
				149	`--memprofilerate` flag, or by setting `runtime.MemProfileRate` variable at
				150	program startup. The rate of 1 will lead to collection of information about all
				151	allocations, but it can slow down execution. The default sampling rate is 1
				152	sample per 512KB of allocated memory.
				153
				154	You can also visualize number of bytes allocated or number of objects allocated
				155	(`--inuse/alloc_space` and `--inuse/alloc_objects` flags, respectively). The
				156	profiler tends to sample larger objects during profiling more. But it's
				157	important to understand that large objects affect memory consumption and GC
				158	time, while large number of tiny allocations affects execution speed (and GC
				159	time to some degree as well). So it may be useful to look at both.
				160
				161	Objects can be persistent or transient. If you have several large persistent
				162	objects allocated at program start, they will be most likely sampled by the
				163	profiler (because they are large). Such objects do affect memory consumption and
				164	GC time, but they do not affect normal execution speed (no memory management
				165	operations happen on them). On the other hand if you have large number of
				166	objects with very short life durations, they can be barely represented in the
				167	profile (if you use the default `--inuse_space mode`). But they do significantly
				168	affect execution speed, because they are constantly allocated and freed. So,
				169	once again, it may be useful to look at both types of objects. So, generally, if
				170	you want to reduce memory consumption, you need to look at `--inuse_space`
				171	profile collected during normal program operation. If you want to improve
				172	execution speed, look at `--alloc_objects` profile collected after significant
				173	running time or at program end.
				174
				175	There are several flags that control reporting granularity. `--functions` makes
				176	pprof report on function level (default). `--lines` makes pprof report on source
				177	line level, which is useful if hot functions allocate on different lines. And
				178	there are also `--addresses` and `--files` for exact instruction address and
				179	file level, respectively.
				180
				181	There is a useful option for the memory profile -- you can look at it right in
				182	the browser (provided that you imported `net/http/pprof`). If you open
				183	http://myserver:6060/debug/pprof/heap?debug=1 you must see the heap profile
				184	along the lines of:
				185
				186	heap profile: 4: 266528 [123: 11284472] @ heap/1048576
				187	1: 262144 [4: 376832] @ 0x28d9f 0x2a201 0x2a28a 0x2624d 0x26188 0x94ca3 0x94a0b 0x17add6 0x17ae9f 0x1069d3 0xfe911 0xf0a3e 0xf0d22 0x21a70
				188	# 0x2a201 cnew+0xc1 runtime/malloc.goc:718
				189	# 0x2a28a runtime.cnewarray+0x3a runtime/malloc.goc:731
				190	# 0x2624d makeslice1+0x4d runtime/slice.c:57
				191	# 0x26188 runtime.makeslice+0x98 runtime/slice.c:38
				192	# 0x94ca3 bytes.makeSlice+0x63 bytes/buffer.go:191
				193	# 0x94a0b bytes.(*Buffer).ReadFrom+0xcb bytes/buffer.go:163
				194	# 0x17add6 io/ioutil.readAll+0x156 io/ioutil/ioutil.go:32
				195	# 0x17ae9f io/ioutil.ReadAll+0x3f io/ioutil/ioutil.go:41
				196	# 0x1069d3 godoc/vfs.ReadFile+0x133 godoc/vfs/vfs.go:44
				197	# 0xfe911 godoc.func·023+0x471 godoc/meta.go:80
				198	# 0xf0a3e godoc.(*Corpus).updateMetadata+0x9e godoc/meta.go:101
				199	# 0xf0d22 godoc.(*Corpus).refreshMetadataLoop+0x42 godoc/meta.go:141
				200
				201	2: 4096 [2: 4096] @ 0x28d9f 0x29059 0x1d252 0x1d450 0x106993 0xf1225 0xe1489 0xfbcad 0x21a70
				202	# 0x1d252 newdefer+0x112 runtime/panic.c:49
				203	# 0x1d450 runtime.deferproc+0x10 runtime/panic.c:132
				204	# 0x106993 godoc/vfs.ReadFile+0xf3 godoc/vfs/vfs.go:43
				205	# 0xf1225 godoc.(*Corpus).parseFile+0x75 godoc/parser.go:20
				206	# 0xe1489 godoc.(*treeBuilder).newDirTree+0x8e9 godoc/dirtrees.go:108
				207	# 0xfbcad godoc.func·002+0x15d godoc/dirtrees.go:100
				208
				209	The numbers in the beginning of each entry `("1: 262144 [4: 376832]")` represent
				210	number of currently live objects, amount of memory occupied by live objects,
				211	total number of allocations and amount of memory occupied by all allocations,
				212	respectively.
				213
				214	Optimizations are usually application-specific, but here are some common
				215	suggestions.
				216
				217	1. Combine objects into larger objects. For example, replace `*bytes.Buffer`
				218	struct member with `bytes.Buffer` (you can preallocate buffer for writing by
				219	calling `bytes.Buffer.Grow` later). This will reduce number of memory
				220	allocations (faster) and also reduce pressure on garbage collector (faster
				221	garbage collections).
				222
				223	2. Local variables that escape from their declaration scope get promoted into
				224	heap allocations. Compiler generally can't prove that several variables have
				225	the same life time, so it allocates each such variable separately. So you
				226	can use the above advise for local variables as well. For example, replace:
				227
				228	for k, v := range m {
				229	k, v := k, v // copy for capturing by the goroutine
				230	go func() {
				231	// use k and v
				232	}()
				233	}
				234
				235	with:
				236
				237	for k, v := range m {
				238	x := struct{ k, v string }{k, v} // copy for capturing by the goroutine
				239	go func() {
				240	// use x.k and x.v
				241	}()
				242	}
				243
				244	This replaces two memory allocations with a single allocation. However, this
				245	optimization usually negatively affects code readability, so use it
				246	reasonably.
				247
				248	3. A special case of allocation combining is slice array preallocation. If you
				249	know a typical size of the slice, you can preallocate a backing array for it
				250	as follows:
				251
				252	type X struct {
				253	buf []byte
				254	bufArray [16]byte // Buf usually does not grow beyond 16 bytes.
				255	}
				256
				257	func MakeX() *X {
				258	x := &X{}
				259	// Preinitialize buf with the backing array.
				260	x.buf = x.bufArray[:0]
				261	return x
				262	}
				263
				264	4. If possible use smaller data types. For example, use `int8` instead of
				265	`int`.
				266
				267	5. Objects that do not contain any pointers (note that strings, slices, maps
				268	and chans contain implicit pointers), are not scanned by garbage collector.
				269	For example, a 1GB byte slice virtually does not affect garbage collection
				270	time. So if you remove pointers from actively used objects, it can
				271	positively impact garbage collection time. Some possibilities are: replace
				272	pointers with indices, split object into two parts one of which does not
				273	contain pointers.
				274
				275	6. Use freelists to reuse transient objects and reduce number of allocations.
				276	Standard library contains [sync.Pool](http://tip.golang.org/pkg/sync/#Pool)
				277	type that allows to reuse the same object several times in between garbage
				278	collections. However, be aware that, as any manual memory management scheme,
				279	incorrect use of sync.Pool can lead to use-after-free bugs.
				280
				281	You can also use the Garbage Collector Trace (see below) to get some insights
				282	into memory issues.
				283
				284	TODO(dvyukov): mention that stats are updated in a deferred way: "Memprof stats
				285	are updated in a deferred way. This is required in order to present a consistent
				286	picture in the situation when allocs are coming continuously and frees are
				287	coming in batches afterwards. Several consecutive GCs push the update pipeline
				288	forward. That's what you observe. So if you profile a live server then any
				289	sample will give you a consistent snapshot. However, if the program completes
				290	some activity and you want to collect the snapshot after this activity, then you
				291	need to execute 2 or 3 GCs before the collection."
				292
				293	## Blocking Profiler
				294
				295	The blocking profiler shows where goroutine block waiting on synchronization
				296	primitives (including timer channels). You can collect it in similar ways as CPU
				297	profile: with `go test --blockprofile`
				298	(http://golang.org/cmd/go/#hdr-Description_of_testing_flags), with
				299	[net/http/pprof](http://golang.org/pkg/net/http/pprof) via
				300	http://myserver:6060:/debug/pprof/block or by calling
				301	[runtime/pprof.Lookup("block").WriteTo](http://golang.org/pkg/runtime/pprof/#Profile.WriteTo).
				302
				303	But there is significant caveat -- the blocking profiler is not enabled by
				304	default. `go test --blockprofile` will enable it for you automatically. However,
				305	if you use `net/http/pprof` or `runtime/pprof`, you need to enable it manually
				306	(otherwise the profile will be empty). To enable the blocking profiler call
				307	[runtime.SetBlockProfileRate](http://golang.org/pkg/runtime/#SetBlockProfileRate).
				308	SetBlockProfileRate controls the fraction of goroutine blocking events that are
				309	reported in the blocking profile. The profiler aims to sample an average of one
				310	blocking event per the specified amount of nanoseconds spent blocked. To include
				311	every blocking event in the profile, set the rate to 1.
				312
				313	If a function contains several blocking operations and it's not obvious which
				314	one leads to blocking, use `--lines` flag to pprof.
				315
				316	Note that not all blocking is bad. When a goroutine blocks, the underlying
				317	worker thread simply switches to another goroutine. So blocking in the
				318	cooperative Go environment is very different from blocking on a mutex in a
				319	non-cooperative systems (e.g. typical C++ or Java threading libraries, where
				320	blocking leads to thread idling and expensive thread context switches). To give
				321	you some feeling, let's consider some examples.
				322
				323	Blocking on a time.Ticker is usually OK. If a goroutine blocks on a Ticker for
				324	10 seconds, you will see 10 seconds of blocking in the profile, which is
				325	perfectly fine. Blocking on `sync.WaitGroup` is frequently OK. For example, is a
				326	task takes 10 seconds, the goroutine waiting on a WaitGroup for completion will
				327	account for 10 seconds of blocking in the profile. Blocking on sync.Cond may or
				328	may not be OK, depending on the situation. Consumer blocking on a channel
				329	suggests slow producers or lack of work. Producer blocking on a channel suggests
				330	that consumers are slower, but this is frequently OK. Blocking on a
				331	channel-based semaphore shows how much goroutines are gated on the semaphore.
				332	Blocking on a sync.Mutex or sync.RWMutex is usually bad. You can use `--ignore`
				333	flag to pprof to exclude known uninteresting blocking from a profile during
				334	visualization.
				335
				336	Blocking of goroutines can lead to two negative consequences:
				337
				338	1. Program does not scale with processors due to lack of work. Scheduler Trace
				339	can help to identify this case.
				340
				341	2. Excessive goroutine blocking/unblocking consumes CPU time. CPU Profiler can
				342	help to identify this case (look at the System component).
				343
				344	Here are some common suggestions that can help to reduce goroutine blocking:
				345
				346	1. Use sufficiently buffered channels in producer-consumer scenarios.
				347	Unbuffered channels substantially limit available parallelism in the
				348	program.
				349
				350	2. Use `sync.RWMutex` instead of `sync.Mutex` for read-mostly workloads.
				351	Readers never block other readers in `sync.RWMutex`, even on implementation
				352	level.
				353
				354	3. In some cases it's possible to remove mutexes entirely by using
				355	copy-on-write technique. If the protected data structure is modified
				356	infrequently and it's feasible to make copies of it, then it can be updated
				357	as follows:
				358
				359	type Config struct {
				360	Routes map[string]net.Addr
				361	Backends []net.Addr
				362	}
				363
				364	var config atomic.Value // actual type is *Config
				365
				366	// Worker goroutines use this function to obtain the current config.
				367	// UpdateConfig must be called at least once before this func.
				368	func CurrentConfig() *Config {
				369	return config.Load().(*Config)
				370	}
				371
				372	// Background goroutine periodically creates a new Config object
				373	// as sets it as current using this function.
				374	func UpdateConfig(cfg *Config) {
				375	config.Store(cfg)
				376	}
				377
				378	This pattern prevents the writer from blocking readers during update.
				379
				380	4. Partitioning is another general technique for reducing contention/blocking
				381	on shared mutable data structures. Below is an example of how to partition a
				382	hashmap:
				383
				384	type Partition struct {
				385	sync.RWMutex
				386	m map[string]string
				387	}
				388
				389	const partCount = 64
				390	var m [partCount]Partition
				391
				392	func Find(k string) string {
				393	idx := hash(k) % partCount
				394	part := &m[idx]
				395	part.RLock()
				396	v := part.m[k]
				397	part.RUnlock()
				398	return v
				399	}
				400
				401	5. Local caching and batching of updates can help to reduce contention on
				402	un-partitionable data structures. Below you can see how to batch sends to a
				403	channel:
				404
				405	const CacheSize = 16
				406
				407	type Cache struct {
				408	buf [CacheSize]int
				409	pos int
				410	}
				411
				412	func Send(c chan [CacheSize]int, cache *Cache, value int) {
				413	cache.buf[cache.pos] = value
				414	cache.pos++
				415	if cache.pos == CacheSize {
				416	c <- cache.buf
				417	cache.pos = 0
				418	}
				419	}
				420
				421	This technique is not limited to channels. It can be used to batch updates
				422	to a map, batch allocations, etc.
				423
				424	6. Use [sync.Pool](http://tip.golang.org/pkg/sync/#Pool) for freelists instead
				425	of chan-based or mutex-protected freelists. sync.Pool uses smart techniques
				426	internally to reduce blocking.
				427
				428	## Goroutine Profiler
				429
				430	The goroutine profiler simply gives you current stacks of all live goroutines in
				431	the process. It can be handy to debug load balancing issues (see Scheduler Trace
				432	section below), or to debug deadlocks. The profile makes sense only for a
				433	running app, so go test does not expose it. You can collect the profile with
				434	[net/http/pprof](http://golang.org/pkg/net/http/pprof) via
				435	http://myserver:6060:/debug/pprof/goroutine, and visualize it to svg/pdf or by
				436	calling
				437	[runtime/pprof.Lookup("goroutine").WriteTo](http://golang.org/pkg/runtime/pprof/#Profile.WriteTo).
				438	But the most useful way is to type
				439	http://myserver:6060:/debug/pprof/goroutine?debug=2 in your browser, which will
				440	give you symbolized stacks similar to what you see when a program crashes. Note
				441	that goroutines in "syscall" state consume an OS thread, other goroutines do not
				442	(except for goroutines that called runtime.LockOSThread, which is,
				443	unfortunately, not visible in the profile). Note that goroutines in "IO wait"
				444	state also do not consume threads, they are parked on non-blocking network
				445	poller (which uses epoll/kqueue/GetQueuedCompletionStatus to unpark goroutines
				446	later).
				447
				448	## Garbage Collector Trace
				449
				450	Aside from the profiling tools, there is another kind of tools available --
				451	tracers. They allow to trace garbage collections, memory allocator and goroutine
				452	scheduler state. To enable the garbage collector (GC) trace, run the program
				453	with `GODEBUG=gctrace=1` environment variable:
				454
				455	$ GODEBUG=gctrace=1 ./myserver
				456
				457	Then the program will print output similar to the following during execution:
				458
				459	gc9(2): 12+1+744+8 us, 2 -> 10 MB, 108615 (593983-485368) objects, 4825/3620/0 sweeps, 0(0) handoff, 6(91) steal, 16/1/0 yields
				460	gc10(2): 12+6769+767+3 us, 1 -> 1 MB, 4222 (593983-589761) objects, 4825/0/1898 sweeps, 0(0) handoff, 6(93) steal, 16/10/2 yields
				461	gc11(2): 799+3+2050+3 us, 1 -> 69 MB, 831819 (1484009-652190) objects, 4825/691/0 sweeps, 0(0) handoff, 5(105) steal, 16/1/0 yields
				462
				463	Let's consider the meaning of these numbers. One line per GC is printed. The
				464	first number ("gc9") is the number of GC (this is the 9-th GC since program
				465	start). The number in parens ("(2)") is the number of worker threads
				466	participated in the GC. The next 4 numbers ("12+1+744+8 us") mean
				467	stop-the-world, sweeping, marking and waiting for worker threads to finish, in
				468	microseconds, respectively. The next 2 numbers ("2 -> 10 MB") mean size of live
				469	heap after the previous GC and full heap size (including garbage) before the
				470	current GC. The next 3 numbers ("108615 (593983-485368) objects") are total
				471	number of objects in heap (including garbage) and total number of memory
				472	allocation and free operations. The next 3 numbers ("4825/3620/0 sweeps")
				473	characterize sweep phase (of the previous GC): there were total 4825 memory
				474	spans, 3620 were swept on demand or in background, 0 were swept during
				475	stop-the-world phase (the rest were unused spans). The next 4 numbers ("0(0)
				476	handoff, 6(91) steal") characterize load balancing during parallel mark phase:
				477	there were 0 object handoff operations (0 objects were handoff), and 6 steal
				478	operations (91 objects were stolen). The last 3 numbers ("16/1/0 yields")
				479	characterize effectiveness of parallel mark phase: there were total of 17 yield
				480	operations during waiting for another thread.
				481
				482	The GC is [mark-and-sweep
				483	type](http://www.brpreiss.com/books/opus5/html/page424.html). Total GC can be
				484	expressed as:
				485
				486	Tgc = Tseq + Tmark + Tsweep
				487
				488	where Tseq is time to stop user goroutines and some preparation activities
				489	(usually small); Tmark is heap marking time, marking happens when all user
				490	goroutines are stopped, and thus can significantly affect latency of processing;
				491	Tsweep is heap sweeping time, sweeping generally happens concurrently with
				492	normal program execution, and so is not so critical for latency.
				493
				494	Marking time can be approximately expressed as:
				495
				496	Tmark = C1Nlive + C2MEMlive_ptr + C3*Nlive_ptr
				497
				498	where Nlive is the number of live objects in the heap during GC, `MEMlive_ptr`
				499	is the amount of memory occupied by live objects with pointers, `Nlive_ptr` is
				500	the number of pointers in live objects.
				501
				502	Sweeping time can be approximately expressed as:
				503
				504	Tsweep = C4MEMtotal + C5MEMgarbage
				505
				506	where `MEMtotal` is the total amount of heap memory, `MEMgarbage` is the amount
				507	of garbage in the heap.
				508
				509	Next GC happens after the program has allocated an extra amount of memory
				510	proportional to the amount already in use. The proportion is controlled by GOGC
				511	environment variable (100 by default). If GOGC=100 and the program is using 4M
				512	of heap memory, runtime will trigger GC again when the program gets to 8M. This
				513	keeps the GC cost in linear proportion to the allocation cost. Adjusting GOGC
				514	changes the linear constant and also the amount of extra memory used.
				515
				516	Only sweeping depends on total size of the heap, and sweeping happens
				517	concurrently with normal program execution. So it can make sense to set GOGC to
				518	a higher value (200, 300, 500, etc) if you can afford extra memory consumption.
				519	For example, GOGC=300 can reduce garbage collection overhead by up to 2 times
				520	while keeping latencies the same (at the cost of 2 times larger heap).
				521
				522	GC is parallel and generally scales well with hardware parallelism. So it can
				523	make sense to set GOMAXPROCS to higher value even for sequential programs just
				524	to speed up garbage collections. However, note that number of garbage collector
				525	threads is currently bounded by 8.
				526
				527	## Memory Allocator Trace
				528
				529	Memory allocator traces simply dumps all memory allocation and free operations
				530	onto console. It's enabled with GODEBUG=allocfreetrace=1 environment variable.
				531	The output looks along the lines of:
				532
				533	tracealloc(0xc208062500, 0x100, array of parse.Node)
				534	goroutine 16 [running]:
				535	runtime.mallocgc(0x100, 0x3eb7c1, 0x0)
				536	runtime/malloc.goc:190 +0x145 fp=0xc2080b39f8
				537	runtime.growslice(0x31f840, 0xc208060700, 0x8, 0x8, 0x1, 0x0, 0x0, 0x0)
				538	runtime/slice.goc:76 +0xbb fp=0xc2080b3a90
				539	text/template/parse.(*Tree).parse(0xc2080820e0, 0xc208023620, 0x0, 0x0)
				540	text/template/parse/parse.go:289 +0x549 fp=0xc2080b3c50
				541	...
				542
				543	tracefree(0xc208002d80, 0x120)
				544	goroutine 16 [running]:
				545	runtime.MSpan_Sweep(0x73b080)
				546	runtime/mgc0.c:1880 +0x514 fp=0xc20804b8f0
				547	runtime.MCentral_CacheSpan(0x69c858)
				548	runtime/mcentral.c:48 +0x2b5 fp=0xc20804b920
				549	runtime.MCache_Refill(0x737000, 0xc200000012)
				550	runtime/mcache.c:78 +0x119 fp=0xc20804b950
				551	...
				552
				553	The trace contains address of the memory block, size, type, goroutine id and the
				554	stack trace. It's probably more useful for debugging, but can give very
				555	fine-grained info for allocation optimizations as well.
				556
				557	## Scheduler Trace
				558
				559	Scheduler trace can provide insights into dynamic behavior of the goroutine
				560	scheduler and allow to debug load balancing and scalability issues. To enable
				561	the scheduler trace trace, run the program with GODEBUG=schedtrace=1000
				562	environment variable (the value means period of output, in ms, in this case it's
				563	once per second):
				564
				565	$ GODEBUG=schedtrace=1000 ./myserver
				566
				567	Then the program will print output similar to the following during execution:
				568
				569	SCHED 1004ms: gomaxprocs=4 idleprocs=0 threads=11 idlethreads=4 runqueue=8 [0 1 0 3]
				570	SCHED 2005ms: gomaxprocs=4 idleprocs=0 threads=11 idlethreads=5 runqueue=6 [1 5 4 0]
				571	SCHED 3008ms: gomaxprocs=4 idleprocs=0 threads=11 idlethreads=4 runqueue=10 [2 2 2 1]
				572
				573	The first number ("1004ms") is time since program start. Gomaxprocs is the
				574	current value of GOMAXPROCS. Idleprocs is the number of idling processors (the
				575	rest are executing Go code). Threads is the total number of worker threads
				576	created by the scheduler (threads can be in 3 states: execute Go code
				577	(gomaxprocs-idleprocs), execute syscalls/cgocalls or idle). Idlethreads is the
				578	number of idling worker threads. Runqueue is the length of global queue with
				579	runnable goroutines. The numbers in square brackets ("[0 1 0 3]") are lengths of
				580	per-processor queues with runnable goroutines. Sum of lengths of global and
				581	local queues represents the total number of goroutines available for execution.
				582
				583	Note: You can combine any of the tracers as
				584	GODEBUG=gctrace=1,allocfreetrace=1,schedtrace=1000.
				585
				586	Note: There is also detailed scheduler trace, which you can enable with
				587	GODEBUG=schedtrace=1000,scheddetail=1. It prints detailed info about every
				588	goroutine, worker thread and processor. We won't describe its format here as
				589	it's mainly useful for scheduler developers; but you can find details in
				590	[src/pkg/runtime/proc.c](https://code.google.com/p/go/source/browse/src/pkg/runtime/proc.c).
				591
				592	The scheduler trace is useful when a program does not scale linearly with
				593	GOMAXPROCS and/or does not consume 100% of CPU time. The ideal situation is when
				594	all processors are busy executing Go code, number of threads is reasonable,
				595	there is plenty of work in all queues and the work is reasonably evenly
				596	distributed:
				597
				598	gomaxprocs=8 idleprocs=0 threads=40 idlethreads=5 runqueue=10 [20 20 20 20 20 20 20 20]
				599
				600	A bad situation is when something of the above does not hold. For example the
				601	following sample demonstrates shortage of work to keep all processors busy:
				602
				603	gomaxprocs=8 idleprocs=6 threads=40 idlethreads=30 runqueue=0 [0 2 0 0 0 1 0 0]
				604
				605	Note: use OS-provided means to measure actual CPU utilization as the ultimate
				606	characteristic. On Unix family of operating system it is top command; on Windows
				607	it is Task Manager.
				608
				609	You can use the goroutine profiler to understand where goroutines block in the
				610	case of work shortage. Note that load imbalance is not ultimately bad as long as
				611	all processors are busy, it will just cause some moderate load balancing
				612	overheads.
				613
				614	## Memory Statistics
				615
				616	Go runtime exposes coarse-grained memory statistics via
				617	[runtime.ReadMemStats](http://golang.org/pkg/runtime/#ReadMemStats) function.
				618	The statistics are also exposed via net/http/pprof at the bottom of
				619	http://myserver:6060/debug/pprof/heap?debug=1. The statistics are described
				620	[here](http://golang.org/pkg/runtime/#MemStats). Some of the interesting fields
				621	are:
				622
				623	1. HeapAlloc - current heap size.
				624	2. HeapSys - total heap size.
				625	3. HeapObjects - total number of objects in the heap.
				626	4. HeapReleased - amount of memory released to the OS; runtime releases to the
				627	OS memory unused for 5 minutes, you can force this process with
				628	runtime/debug.FreeOSMemory.
				629	5. Sys - total amount of memory allocated from OS.
				630	6. Sys-HeapReleased - effective memory consumption of the program.
				631	7. StackSys - memory consumed for goroutine stacks (note that some stacks are
				632	allocated from heap and are not accounted here, unfortunately there is no
				633	way to get total size of stacks
				634	(https://code.google.com/p/go/issues/detail?id=7468)).
				635	8. MSpanSys/MCacheSys/BuckHashSys/GCSys/OtherSys - amount of memory allocated
				636	by runtime for various auxiliary purposes; they are generally not
				637	interesting, unless they are too high.
				638	9. PauseNs - durations of last garbage collections.
				639
				640	## Heap Dumper
				641
				642	The last available tool is heap dumper, it can write state of the whole heap
				643	into a file for future exploration. It can be useful for identifying memory
				644	leaks and getting insights into program memory consumption.
				645
				646	First, you need to write the dump using
				647	[runtime/debug.WriteHeapDump](http://tip.golang.org/pkg/runtime/debug/#WriteHeapDump)
				648	function:
				649
				650	f, err := os.Create("heapdump")
				651	if err != nil { ... }
				652	debug.WriteHeapDump(f.Fd())
				653
				654	Then you can either render it to a dot file with graphical representation of the
				655	heap or convert it to hprof format. To render it to a dot file:
				656
				657	$ go get github.com/randall77/hprof/dumptodot
				658	$ dumptodot heapdump mybinary > heap.dot
				659
				660	and open `heap.dot` with Graphviz.
				661
				662	To convert it to `hprof` format:
				663
				664	$ go get github.com/randall77/hprof/dumptohprof
				665	$ dumptohprof heapdump heap.hprof
				666	$ jhat heap.hprof
				667
				668	and navigate your browser to http://myserver:7000.
				669
				670	## Concluding Remarks
				671
				672	Optimization is an open problem, there are simple recipes that you can use to
				673	improve performance. Sometimes optimization requires complete re-architecture of
				674	the program. But we hope that the tools will be a valuable addition to your
				675	toolbox, that you can use to at least analyze and understand what happens.
				676	[Profiling Go Programs](http://blog.golang.org/profiling-go-programs) is a good
				677	tutorial on usage of CPU and memory profilers to optimize a simple program.