blob: 45a9ac484b2873732c229a94825049cd874142e2 [file] [log] [blame] [view]
Jaana Burcu Dogan9f859352017-05-08 20:04:42 -07001# Debugging performance issues in Go programs
2
3-- originally written by Dmitry Vyukov
4
5Let's assume you have a Go program and want to improve its performance. There
6are several tools available that can help with this task. These tools can help
7you to identify various types of hotspots (CPU, IO, memory), hotspots are the
8places that you need to concentrate on in order to significantly improve
9performance. However, another outcome is possible -- the tools can help you
10identify obvious performance defects in the program. For example, you prepare an
11SQL statement before each query while you could prepare it once at program
12startup. Another example is if an O(N^2) algorithm somehow slipped into where an
13obvious O(N) exists and is expected. In order to identify such cases you need to
14sanity check what you see in profiles. For example for the first case
15significant time spent in SQL statement preparation would be the red flag.
16
17It's also important to understand various bounding factors for performance. For
18example, if the program communicates via 100 Mbps network link and it is already
19utilizes >90Mbps, there is not much you can do with the program to improve its
20performance. There are similar bounding factors for disk IO, memory consumption
21and computational tasks. With that in mind we can look at the available tools.
22
23Note: The tools can interfere with each other. For example, precise memory
24profiling skews CPU profiles, goroutine blocking profiling affects scheduler
25trace, etc. Use tools in isolation to get more precise info.
26
27## CPU Profiler
28
29Go runtime contains built-in CPU profiler, which shows what functions consume
30what percent of CPU time. There are 3 ways you can get access to it:
31
321. The simplest one is -cpuprofile flag of 'go test'
33 (http://golang.org/cmd/go/#hdr-Description_of_testing_flags) command. For
34 example, the following command:
35
36 $ go test -run=none -bench=ClientServerParallel4 -cpuprofile=cprof net/http
37
38 will profile the given benchmark and write CPU profile into 'cprof' file.
39 Then:
40
41 $ go tool pprof --text http.test cprof
42
43 will print a list of the hottest functions.
44
45 There are several output types available, the most useful ones are:
46 `--text`, `--web` and `--list`. Run `go tool pprof` to get the complete
47 list. The obvious drawback of this option is that it works only for tests.
48
492. [net/http/pprof](http://golang.org/pkg/net/http/pprof): This is the ideal
50 solution for network servers. You merely need to import `net/http/pprof`,
51 and collect profiles with:
52
53 $ go tool pprof --text mybin http://myserver:6060:/debug/pprof/profile
54
553. Manual profile collection. You need to import
56 [runtime/pprof](http://golang.org/pkg/runtime/pprof/) and add the following
57 code to main function:
58
59 if *flagCpuprofile != "" {
60 f, err := os.Create(*flagCpuprofile)
61 if err != nil {
62 log.Fatal(err)
63 }
64 pprof.StartCPUProfile(f)
65 defer pprof.StopCPUProfile()
66 }
67
68 The profile will be written to the specified file, visualize it the same way
69 as in the first option.
70
71Here is an example of a profile visualized with `--web` option:
72[cpu_profile.png]
73
74You can investigate a single function with `--list=funcname` option. For
75example, the following profile shows that the time was spent in the `append`
76function:
77
78 . . 93: func (bp *buffer) WriteRune(r rune) error {
79 . . 94: if r < utf8.RuneSelf {
80 5 5 95: *bp = append(*bp, byte(r))
81 . . 96: return nil
82 . . 97: }
83 . . 98:
84 . . 99: b := *bp
85 . . 100: n := len(b)
86 . . 101: for n+utf8.UTFMax > cap(b) {
87 . . 102: b = append(b, 0)
88 . . 103: }
89 . . 104: w := utf8.EncodeRune(b[n:n+utf8.UTFMax], r)
90 . . 105: *bp = b[:n+w]
91 . . 106: return nil
92 . . 107: }
93
94There are also 3 special entries that the profiler uses when it can't unwind
95stack: GC, System and ExternalCode. GC means time spent during garbage
96collection, see Memory Profiler and Garbage Collector Trace sections below for
97optimization suggestions. System means time spent in goroutine scheduler, stack
98management code and other auxiliary runtime code. ExternalCode means time spent
99in native dynamic libraries.
100
101Here are some hints with respect to how to interpret what you see in the
102profile.
103
104If you see lots of time spent in `runtime.mallocgc` function, the program
105potentially makes excessive amount of small memory allocations. The profile will
106tell you where the allocations are coming from. See the memory profiler section
107for suggestions on how to optimize this case.
108
109If lots of time is spent in channel operations, sync.Mutex code and other
110synchronization primitives or System component, the program probably suffers
111from contention. Consider to restructure program to eliminate frequently
112accessed shared resources. Common techniques for this include
113sharding/partitioning, local buffering/batching and copy-on-write technique.
114
115If lots of time is spent in `syscall.Read/Write`, the program potentially makes
116excessive amount of small reads and writes. Bufio wrappers around os.File or
117net.Conn can help in this case.
118
119If lots of time is spent in GC component, the program either allocates too many
120transient objects or heap size is very small so garbage collections happen too
121frequently. See Garbage Collector Tracer and Memory Profiler sections for
122optimization suggestions.
123
Ralph Caraveo III93bb1d02018-09-20 21:20:15 -0700124Note: For darwin CPU profiler currently only [works on El Capitan or newer](https://code.google.com/p/go/issues/detail?id=6047).
Jaana Burcu Dogan9f859352017-05-08 20:04:42 -0700125
126Note: On windows you need to install Cygwin, Perl and Graphviz to generate
127svg/web profiles.
128
129
130## Memory Profiler
131
132Memory profiler shows what functions allocate heap memory. You can collect it in
133similar ways as CPU profile: with `go test --memprofile`
134(http://golang.org/cmd/go/#hdr-Description_of_testing_flags), with
135[net/http/pprof](http://golang.org/pkg/net/http/pprof) via
136http://myserver:6060:/debug/pprof/heap or by calling
137[runtime/pprof.WriteHeapProfile](http://golang.org/pkg/runtime/pprof/#WriteHeapProfile).
138
139You can visualize only allocations live at the time of profile collection
140(`--inuse_space flag to pprof`, default), or all allocations happened since
141program start (`--alloc_space` flag to `pprof`). The former is useful for
142profiles collected with net/http/pprof on live applications, the latter is
143useful for profiles collected at program end (otherwise you will see almost
144empty profile).
145
146Note: the memory profiler is sampling, that is, it collects information only
147about some subset of memory allocations. Probability of sampling an object is
148proportional to its size. You can change the sampling rate with go test
149`--memprofilerate` flag, or by setting `runtime.MemProfileRate` variable at
150program startup. The rate of 1 will lead to collection of information about all
151allocations, but it can slow down execution. The default sampling rate is 1
152sample per 512KB of allocated memory.
153
154You can also visualize number of bytes allocated or number of objects allocated
155(`--inuse/alloc_space` and `--inuse/alloc_objects` flags, respectively). The
156profiler tends to sample larger objects during profiling more. But it's
157important to understand that large objects affect memory consumption and GC
158time, while large number of tiny allocations affects execution speed (and GC
159time to some degree as well). So it may be useful to look at both.
160
161Objects can be persistent or transient. If you have several large persistent
162objects allocated at program start, they will be most likely sampled by the
163profiler (because they are large). Such objects do affect memory consumption and
164GC time, but they do not affect normal execution speed (no memory management
165operations happen on them). On the other hand if you have large number of
166objects with very short life durations, they can be barely represented in the
167profile (if you use the default `--inuse_space mode`). But they do significantly
168affect execution speed, because they are constantly allocated and freed. So,
169once again, it may be useful to look at both types of objects. So, generally, if
170you want to reduce memory consumption, you need to look at `--inuse_space`
171profile collected during normal program operation. If you want to improve
172execution speed, look at `--alloc_objects` profile collected after significant
173running time or at program end.
174
175There are several flags that control reporting granularity. `--functions` makes
176pprof report on function level (default). `--lines` makes pprof report on source
177line level, which is useful if hot functions allocate on different lines. And
178there are also `--addresses` and `--files` for exact instruction address and
179file level, respectively.
180
181There is a useful option for the memory profile -- you can look at it right in
182the browser (provided that you imported `net/http/pprof`). If you open
183http://myserver:6060/debug/pprof/heap?debug=1 you must see the heap profile
184along the lines of:
185
186 heap profile: 4: 266528 [123: 11284472] @ heap/1048576
187 1: 262144 [4: 376832] @ 0x28d9f 0x2a201 0x2a28a 0x2624d 0x26188 0x94ca3 0x94a0b 0x17add6 0x17ae9f 0x1069d3 0xfe911 0xf0a3e 0xf0d22 0x21a70
188 # 0x2a201 cnew+0xc1 runtime/malloc.goc:718
189 # 0x2a28a runtime.cnewarray+0x3a runtime/malloc.goc:731
190 # 0x2624d makeslice1+0x4d runtime/slice.c:57
191 # 0x26188 runtime.makeslice+0x98 runtime/slice.c:38
192 # 0x94ca3 bytes.makeSlice+0x63 bytes/buffer.go:191
193 # 0x94a0b bytes.(*Buffer).ReadFrom+0xcb bytes/buffer.go:163
194 # 0x17add6 io/ioutil.readAll+0x156 io/ioutil/ioutil.go:32
195 # 0x17ae9f io/ioutil.ReadAll+0x3f io/ioutil/ioutil.go:41
196 # 0x1069d3 godoc/vfs.ReadFile+0x133 godoc/vfs/vfs.go:44
197 # 0xfe911 godoc.funcĀ·023+0x471 godoc/meta.go:80
198 # 0xf0a3e godoc.(*Corpus).updateMetadata+0x9e godoc/meta.go:101
199 # 0xf0d22 godoc.(*Corpus).refreshMetadataLoop+0x42 godoc/meta.go:141
200
201 2: 4096 [2: 4096] @ 0x28d9f 0x29059 0x1d252 0x1d450 0x106993 0xf1225 0xe1489 0xfbcad 0x21a70
202 # 0x1d252 newdefer+0x112 runtime/panic.c:49
203 # 0x1d450 runtime.deferproc+0x10 runtime/panic.c:132
204 # 0x106993 godoc/vfs.ReadFile+0xf3 godoc/vfs/vfs.go:43
205 # 0xf1225 godoc.(*Corpus).parseFile+0x75 godoc/parser.go:20
206 # 0xe1489 godoc.(*treeBuilder).newDirTree+0x8e9 godoc/dirtrees.go:108
207 # 0xfbcad godoc.funcĀ·002+0x15d godoc/dirtrees.go:100
208
209The numbers in the beginning of each entry `("1: 262144 [4: 376832]")` represent
210number of currently live objects, amount of memory occupied by live objects,
211total number of allocations and amount of memory occupied by all allocations,
212respectively.
213
214Optimizations are usually application-specific, but here are some common
215suggestions.
216
2171. Combine objects into larger objects. For example, replace `*bytes.Buffer`
218 struct member with `bytes.Buffer` (you can preallocate buffer for writing by
219 calling `bytes.Buffer.Grow` later). This will reduce number of memory
220 allocations (faster) and also reduce pressure on garbage collector (faster
221 garbage collections).
222
2232. Local variables that escape from their declaration scope get promoted into
224 heap allocations. Compiler generally can't prove that several variables have
225 the same life time, so it allocates each such variable separately. So you
226 can use the above advise for local variables as well. For example, replace:
227
228 for k, v := range m {
229 k, v := k, v // copy for capturing by the goroutine
230 go func() {
231 // use k and v
232 }()
233 }
234
235 with:
236
237 for k, v := range m {
238 x := struct{ k, v string }{k, v} // copy for capturing by the goroutine
239 go func() {
240 // use x.k and x.v
241 }()
242 }
243
244 This replaces two memory allocations with a single allocation. However, this
245 optimization usually negatively affects code readability, so use it
246 reasonably.
247
2483. A special case of allocation combining is slice array preallocation. If you
249 know a typical size of the slice, you can preallocate a backing array for it
250 as follows:
251
252 type X struct {
253 buf []byte
254 bufArray [16]byte // Buf usually does not grow beyond 16 bytes.
255 }
256
257 func MakeX() *X {
258 x := &X{}
259 // Preinitialize buf with the backing array.
260 x.buf = x.bufArray[:0]
261 return x
262 }
263
2644. If possible use smaller data types. For example, use `int8` instead of
265 `int`.
266
2675. Objects that do not contain any pointers (note that strings, slices, maps
268 and chans contain implicit pointers), are not scanned by garbage collector.
269 For example, a 1GB byte slice virtually does not affect garbage collection
270 time. So if you remove pointers from actively used objects, it can
271 positively impact garbage collection time. Some possibilities are: replace
272 pointers with indices, split object into two parts one of which does not
273 contain pointers.
274
2756. Use freelists to reuse transient objects and reduce number of allocations.
276 Standard library contains [sync.Pool](http://tip.golang.org/pkg/sync/#Pool)
277 type that allows to reuse the same object several times in between garbage
278 collections. However, be aware that, as any manual memory management scheme,
279 incorrect use of sync.Pool can lead to use-after-free bugs.
280
281You can also use the Garbage Collector Trace (see below) to get some insights
282into memory issues.
283
284TODO(dvyukov): mention that stats are updated in a deferred way: "Memprof stats
285are updated in a deferred way. This is required in order to present a consistent
286picture in the situation when allocs are coming continuously and frees are
287coming in batches afterwards. Several consecutive GCs push the update pipeline
288forward. That's what you observe. So if you profile a live server then any
289sample will give you a consistent snapshot. However, if the program completes
290some activity and you want to collect the snapshot after this activity, then you
291need to execute 2 or 3 GCs before the collection."
292
293## Blocking Profiler
294
295The blocking profiler shows where goroutine block waiting on synchronization
296primitives (including timer channels). You can collect it in similar ways as CPU
297profile: with `go test --blockprofile`
298(http://golang.org/cmd/go/#hdr-Description_of_testing_flags), with
299[net/http/pprof](http://golang.org/pkg/net/http/pprof) via
300http://myserver:6060:/debug/pprof/block or by calling
301[runtime/pprof.Lookup("block").WriteTo](http://golang.org/pkg/runtime/pprof/#Profile.WriteTo).
302
303But there is significant caveat -- the blocking profiler is not enabled by
304default. `go test --blockprofile` will enable it for you automatically. However,
305if you use `net/http/pprof` or `runtime/pprof`, you need to enable it manually
306(otherwise the profile will be empty). To enable the blocking profiler call
307[runtime.SetBlockProfileRate](http://golang.org/pkg/runtime/#SetBlockProfileRate).
308SetBlockProfileRate controls the fraction of goroutine blocking events that are
309reported in the blocking profile. The profiler aims to sample an average of one
310blocking event per the specified amount of nanoseconds spent blocked. To include
311every blocking event in the profile, set the rate to 1.
312
313If a function contains several blocking operations and it's not obvious which
314one leads to blocking, use `--lines` flag to pprof.
315
316Note that not all blocking is bad. When a goroutine blocks, the underlying
317worker thread simply switches to another goroutine. So blocking in the
318cooperative Go environment is very different from blocking on a mutex in a
319non-cooperative systems (e.g. typical C++ or Java threading libraries, where
320blocking leads to thread idling and expensive thread context switches). To give
321you some feeling, let's consider some examples.
322
323Blocking on a time.Ticker is usually OK. If a goroutine blocks on a Ticker for
32410 seconds, you will see 10 seconds of blocking in the profile, which is
325perfectly fine. Blocking on `sync.WaitGroup` is frequently OK. For example, is a
326task takes 10 seconds, the goroutine waiting on a WaitGroup for completion will
327account for 10 seconds of blocking in the profile. Blocking on sync.Cond may or
328may not be OK, depending on the situation. Consumer blocking on a channel
329suggests slow producers or lack of work. Producer blocking on a channel suggests
330that consumers are slower, but this is frequently OK. Blocking on a
331channel-based semaphore shows how much goroutines are gated on the semaphore.
332Blocking on a sync.Mutex or sync.RWMutex is usually bad. You can use `--ignore`
333flag to pprof to exclude known uninteresting blocking from a profile during
334visualization.
335
336Blocking of goroutines can lead to two negative consequences:
337
3381. Program does not scale with processors due to lack of work. Scheduler Trace
339 can help to identify this case.
340
3412. Excessive goroutine blocking/unblocking consumes CPU time. CPU Profiler can
342 help to identify this case (look at the System component).
343
344Here are some common suggestions that can help to reduce goroutine blocking:
345
3461. Use sufficiently buffered channels in producer-consumer scenarios.
347 Unbuffered channels substantially limit available parallelism in the
348 program.
349
3502. Use `sync.RWMutex` instead of `sync.Mutex` for read-mostly workloads.
351 Readers never block other readers in `sync.RWMutex`, even on implementation
352 level.
353
3543. In some cases it's possible to remove mutexes entirely by using
355 copy-on-write technique. If the protected data structure is modified
356 infrequently and it's feasible to make copies of it, then it can be updated
357 as follows:
358
359 type Config struct {
360 Routes map[string]net.Addr
361 Backends []net.Addr
362 }
363
364 var config atomic.Value // actual type is *Config
365
366 // Worker goroutines use this function to obtain the current config.
367 // UpdateConfig must be called at least once before this func.
368 func CurrentConfig() *Config {
369 return config.Load().(*Config)
370 }
371
372 // Background goroutine periodically creates a new Config object
373 // as sets it as current using this function.
374 func UpdateConfig(cfg *Config) {
375 config.Store(cfg)
376 }
377
378 This pattern prevents the writer from blocking readers during update.
379
3804. Partitioning is another general technique for reducing contention/blocking
381 on shared mutable data structures. Below is an example of how to partition a
382 hashmap:
383
384 type Partition struct {
385 sync.RWMutex
386 m map[string]string
387 }
388
389 const partCount = 64
390 var m [partCount]Partition
391
392 func Find(k string) string {
393 idx := hash(k) % partCount
394 part := &m[idx]
395 part.RLock()
396 v := part.m[k]
397 part.RUnlock()
398 return v
399 }
400
4015. Local caching and batching of updates can help to reduce contention on
402 un-partitionable data structures. Below you can see how to batch sends to a
403 channel:
404
405 const CacheSize = 16
406
407 type Cache struct {
408 buf [CacheSize]int
409 pos int
410 }
411
412 func Send(c chan [CacheSize]int, cache *Cache, value int) {
413 cache.buf[cache.pos] = value
414 cache.pos++
415 if cache.pos == CacheSize {
416 c <- cache.buf
417 cache.pos = 0
418 }
419 }
420
421 This technique is not limited to channels. It can be used to batch updates
422 to a map, batch allocations, etc.
423
4246. Use [sync.Pool](http://tip.golang.org/pkg/sync/#Pool) for freelists instead
425 of chan-based or mutex-protected freelists. sync.Pool uses smart techniques
426 internally to reduce blocking.
427
428## Goroutine Profiler
429
430The goroutine profiler simply gives you current stacks of all live goroutines in
431the process. It can be handy to debug load balancing issues (see Scheduler Trace
432section below), or to debug deadlocks. The profile makes sense only for a
433running app, so go test does not expose it. You can collect the profile with
434[net/http/pprof](http://golang.org/pkg/net/http/pprof) via
435http://myserver:6060:/debug/pprof/goroutine, and visualize it to svg/pdf or by
436calling
437[runtime/pprof.Lookup("goroutine").WriteTo](http://golang.org/pkg/runtime/pprof/#Profile.WriteTo).
438But the most useful way is to type
439http://myserver:6060:/debug/pprof/goroutine?debug=2 in your browser, which will
440give you symbolized stacks similar to what you see when a program crashes. Note
441that goroutines in "syscall" state consume an OS thread, other goroutines do not
442(except for goroutines that called runtime.LockOSThread, which is,
443unfortunately, not visible in the profile). Note that goroutines in "IO wait"
444state also do not consume threads, they are parked on non-blocking network
445poller (which uses epoll/kqueue/GetQueuedCompletionStatus to unpark goroutines
446later).
447
448## Garbage Collector Trace
449
450Aside from the profiling tools, there is another kind of tools available --
451tracers. They allow to trace garbage collections, memory allocator and goroutine
452scheduler state. To enable the garbage collector (GC) trace, run the program
453with `GODEBUG=gctrace=1` environment variable:
454
455 $ GODEBUG=gctrace=1 ./myserver
456
457Then the program will print output similar to the following during execution:
458
459 gc9(2): 12+1+744+8 us, 2 -> 10 MB, 108615 (593983-485368) objects, 4825/3620/0 sweeps, 0(0) handoff, 6(91) steal, 16/1/0 yields
460 gc10(2): 12+6769+767+3 us, 1 -> 1 MB, 4222 (593983-589761) objects, 4825/0/1898 sweeps, 0(0) handoff, 6(93) steal, 16/10/2 yields
461 gc11(2): 799+3+2050+3 us, 1 -> 69 MB, 831819 (1484009-652190) objects, 4825/691/0 sweeps, 0(0) handoff, 5(105) steal, 16/1/0 yields
462
463Let's consider the meaning of these numbers. One line per GC is printed. The
464first number ("gc9") is the number of GC (this is the 9-th GC since program
465start). The number in parens ("(2)") is the number of worker threads
466participated in the GC. The next 4 numbers ("12+1+744+8 us") mean
467stop-the-world, sweeping, marking and waiting for worker threads to finish, in
468microseconds, respectively. The next 2 numbers ("2 -> 10 MB") mean size of live
469heap after the previous GC and full heap size (including garbage) before the
470current GC. The next 3 numbers ("108615 (593983-485368) objects") are total
471number of objects in heap (including garbage) and total number of memory
472allocation and free operations. The next 3 numbers ("4825/3620/0 sweeps")
473characterize sweep phase (of the previous GC): there were total 4825 memory
474spans, 3620 were swept on demand or in background, 0 were swept during
475stop-the-world phase (the rest were unused spans). The next 4 numbers ("0(0)
476handoff, 6(91) steal") characterize load balancing during parallel mark phase:
477there were 0 object handoff operations (0 objects were handoff), and 6 steal
478operations (91 objects were stolen). The last 3 numbers ("16/1/0 yields")
479characterize effectiveness of parallel mark phase: there were total of 17 yield
480operations during waiting for another thread.
481
482The GC is [mark-and-sweep
483type](http://www.brpreiss.com/books/opus5/html/page424.html). Total GC can be
484expressed as:
485
486 Tgc = Tseq + Tmark + Tsweep
487
488where Tseq is time to stop user goroutines and some preparation activities
489(usually small); Tmark is heap marking time, marking happens when all user
490goroutines are stopped, and thus can significantly affect latency of processing;
491Tsweep is heap sweeping time, sweeping generally happens concurrently with
492normal program execution, and so is not so critical for latency.
493
494Marking time can be approximately expressed as:
495
496 Tmark = C1*Nlive + C2*MEMlive_ptr + C3*Nlive_ptr
497
498where Nlive is the number of live objects in the heap during GC, `MEMlive_ptr`
499is the amount of memory occupied by live objects with pointers, `Nlive_ptr` is
500the number of pointers in live objects.
501
502Sweeping time can be approximately expressed as:
503
504 Tsweep = C4*MEMtotal + C5*MEMgarbage
505
506where `MEMtotal` is the total amount of heap memory, `MEMgarbage` is the amount
507of garbage in the heap.
508
509Next GC happens after the program has allocated an extra amount of memory
510proportional to the amount already in use. The proportion is controlled by GOGC
511environment variable (100 by default). If GOGC=100 and the program is using 4M
512of heap memory, runtime will trigger GC again when the program gets to 8M. This
513keeps the GC cost in linear proportion to the allocation cost. Adjusting GOGC
514changes the linear constant and also the amount of extra memory used.
515
516Only sweeping depends on total size of the heap, and sweeping happens
517concurrently with normal program execution. So it can make sense to set GOGC to
518a higher value (200, 300, 500, etc) if you can afford extra memory consumption.
519For example, GOGC=300 can reduce garbage collection overhead by up to 2 times
520while keeping latencies the same (at the cost of 2 times larger heap).
521
522GC is parallel and generally scales well with hardware parallelism. So it can
523make sense to set GOMAXPROCS to higher value even for sequential programs just
524to speed up garbage collections. However, note that number of garbage collector
525threads is currently bounded by 8.
526
527## Memory Allocator Trace
528
529Memory allocator traces simply dumps all memory allocation and free operations
530onto console. It's enabled with GODEBUG=allocfreetrace=1 environment variable.
531The output looks along the lines of:
532
533 tracealloc(0xc208062500, 0x100, array of parse.Node)
534 goroutine 16 [running]:
535 runtime.mallocgc(0x100, 0x3eb7c1, 0x0)
536 runtime/malloc.goc:190 +0x145 fp=0xc2080b39f8
537 runtime.growslice(0x31f840, 0xc208060700, 0x8, 0x8, 0x1, 0x0, 0x0, 0x0)
538 runtime/slice.goc:76 +0xbb fp=0xc2080b3a90
539 text/template/parse.(*Tree).parse(0xc2080820e0, 0xc208023620, 0x0, 0x0)
540 text/template/parse/parse.go:289 +0x549 fp=0xc2080b3c50
541 ...
542
543 tracefree(0xc208002d80, 0x120)
544 goroutine 16 [running]:
545 runtime.MSpan_Sweep(0x73b080)
546 runtime/mgc0.c:1880 +0x514 fp=0xc20804b8f0
547 runtime.MCentral_CacheSpan(0x69c858)
548 runtime/mcentral.c:48 +0x2b5 fp=0xc20804b920
549 runtime.MCache_Refill(0x737000, 0xc200000012)
550 runtime/mcache.c:78 +0x119 fp=0xc20804b950
551 ...
552
553The trace contains address of the memory block, size, type, goroutine id and the
554stack trace. It's probably more useful for debugging, but can give very
555fine-grained info for allocation optimizations as well.
556
557## Scheduler Trace
558
559Scheduler trace can provide insights into dynamic behavior of the goroutine
560scheduler and allow to debug load balancing and scalability issues. To enable
561the scheduler trace trace, run the program with GODEBUG=schedtrace=1000
562environment variable (the value means period of output, in ms, in this case it's
563once per second):
564
565 $ GODEBUG=schedtrace=1000 ./myserver
566
567Then the program will print output similar to the following during execution:
568
569 SCHED 1004ms: gomaxprocs=4 idleprocs=0 threads=11 idlethreads=4 runqueue=8 [0 1 0 3]
570 SCHED 2005ms: gomaxprocs=4 idleprocs=0 threads=11 idlethreads=5 runqueue=6 [1 5 4 0]
571 SCHED 3008ms: gomaxprocs=4 idleprocs=0 threads=11 idlethreads=4 runqueue=10 [2 2 2 1]
572
573The first number ("1004ms") is time since program start. Gomaxprocs is the
574current value of GOMAXPROCS. Idleprocs is the number of idling processors (the
575rest are executing Go code). Threads is the total number of worker threads
576created by the scheduler (threads can be in 3 states: execute Go code
577(gomaxprocs-idleprocs), execute syscalls/cgocalls or idle). Idlethreads is the
578number of idling worker threads. Runqueue is the length of global queue with
579runnable goroutines. The numbers in square brackets ("[0 1 0 3]") are lengths of
580per-processor queues with runnable goroutines. Sum of lengths of global and
581local queues represents the total number of goroutines available for execution.
582
583Note: You can combine any of the tracers as
584GODEBUG=gctrace=1,allocfreetrace=1,schedtrace=1000.
585
586Note: There is also detailed scheduler trace, which you can enable with
587GODEBUG=schedtrace=1000,scheddetail=1. It prints detailed info about every
588goroutine, worker thread and processor. We won't describe its format here as
589it's mainly useful for scheduler developers; but you can find details in
590[src/pkg/runtime/proc.c](https://code.google.com/p/go/source/browse/src/pkg/runtime/proc.c).
591
592The scheduler trace is useful when a program does not scale linearly with
593GOMAXPROCS and/or does not consume 100% of CPU time. The ideal situation is when
594all processors are busy executing Go code, number of threads is reasonable,
595there is plenty of work in all queues and the work is reasonably evenly
596distributed:
597
598 gomaxprocs=8 idleprocs=0 threads=40 idlethreads=5 runqueue=10 [20 20 20 20 20 20 20 20]
599
600A bad situation is when something of the above does not hold. For example the
601following sample demonstrates shortage of work to keep all processors busy:
602
603 gomaxprocs=8 idleprocs=6 threads=40 idlethreads=30 runqueue=0 [0 2 0 0 0 1 0 0]
604
605Note: use OS-provided means to measure actual CPU utilization as the ultimate
606characteristic. On Unix family of operating system it is top command; on Windows
607it is Task Manager.
608
609You can use the goroutine profiler to understand where goroutines block in the
610case of work shortage. Note that load imbalance is not ultimately bad as long as
611all processors are busy, it will just cause some moderate load balancing
612overheads.
613
614## Memory Statistics
615
616Go runtime exposes coarse-grained memory statistics via
617[runtime.ReadMemStats](http://golang.org/pkg/runtime/#ReadMemStats) function.
618The statistics are also exposed via net/http/pprof at the bottom of
619http://myserver:6060/debug/pprof/heap?debug=1. The statistics are described
620[here](http://golang.org/pkg/runtime/#MemStats). Some of the interesting fields
621are:
622
6231. HeapAlloc - current heap size.
6242. HeapSys - total heap size.
6253. HeapObjects - total number of objects in the heap.
6264. HeapReleased - amount of memory released to the OS; runtime releases to the
627 OS memory unused for 5 minutes, you can force this process with
628 runtime/debug.FreeOSMemory.
6295. Sys - total amount of memory allocated from OS.
6306. Sys-HeapReleased - effective memory consumption of the program.
6317. StackSys - memory consumed for goroutine stacks (note that some stacks are
632 allocated from heap and are not accounted here, unfortunately there is no
633 way to get total size of stacks
634 (https://code.google.com/p/go/issues/detail?id=7468)).
6358. MSpanSys/MCacheSys/BuckHashSys/GCSys/OtherSys - amount of memory allocated
636 by runtime for various auxiliary purposes; they are generally not
637 interesting, unless they are too high.
6389. PauseNs - durations of last garbage collections.
639
640## Heap Dumper
641
642The last available tool is heap dumper, it can write state of the whole heap
643into a file for future exploration. It can be useful for identifying memory
644leaks and getting insights into program memory consumption.
645
646First, you need to write the dump using
647[runtime/debug.WriteHeapDump](http://tip.golang.org/pkg/runtime/debug/#WriteHeapDump)
648function:
649
650 f, err := os.Create("heapdump")
651 if err != nil { ... }
652 debug.WriteHeapDump(f.Fd())
653
654Then you can either render it to a dot file with graphical representation of the
655heap or convert it to hprof format. To render it to a dot file:
656
657 $ go get github.com/randall77/hprof/dumptodot
658 $ dumptodot heapdump mybinary > heap.dot
659
660and open `heap.dot` with Graphviz.
661
662To convert it to `hprof` format:
663
664 $ go get github.com/randall77/hprof/dumptohprof
665 $ dumptohprof heapdump heap.hprof
666 $ jhat heap.hprof
667
668and navigate your browser to http://myserver:7000.
669
670## Concluding Remarks
671
672Optimization is an open problem, there are simple recipes that you can use to
673improve performance. Sometimes optimization requires complete re-architecture of
674the program. But we hope that the tools will be a valuable addition to your
675toolbox, that you can use to at least analyze and understand what happens.
676[Profiling Go Programs](http://blog.golang.org/profiling-go-programs) is a good
677tutorial on usage of CPU and memory profilers to optimize a simple program.