Jaana Burcu Dogan | 9f85935 | 2017-05-08 20:04:42 -0700 | [diff] [blame] | 1 | # Debugging performance issues in Go programs |
| 2 | |
| 3 | -- originally written by Dmitry Vyukov |
| 4 | |
| 5 | Let's assume you have a Go program and want to improve its performance. There |
| 6 | are several tools available that can help with this task. These tools can help |
| 7 | you to identify various types of hotspots (CPU, IO, memory), hotspots are the |
| 8 | places that you need to concentrate on in order to significantly improve |
| 9 | performance. However, another outcome is possible -- the tools can help you |
| 10 | identify obvious performance defects in the program. For example, you prepare an |
| 11 | SQL statement before each query while you could prepare it once at program |
| 12 | startup. Another example is if an O(N^2) algorithm somehow slipped into where an |
| 13 | obvious O(N) exists and is expected. In order to identify such cases you need to |
| 14 | sanity check what you see in profiles. For example for the first case |
| 15 | significant time spent in SQL statement preparation would be the red flag. |
| 16 | |
| 17 | It's also important to understand various bounding factors for performance. For |
| 18 | example, if the program communicates via 100 Mbps network link and it is already |
| 19 | utilizes >90Mbps, there is not much you can do with the program to improve its |
| 20 | performance. There are similar bounding factors for disk IO, memory consumption |
| 21 | and computational tasks. With that in mind we can look at the available tools. |
| 22 | |
| 23 | Note: The tools can interfere with each other. For example, precise memory |
| 24 | profiling skews CPU profiles, goroutine blocking profiling affects scheduler |
| 25 | trace, etc. Use tools in isolation to get more precise info. |
| 26 | |
| 27 | ## CPU Profiler |
| 28 | |
| 29 | Go runtime contains built-in CPU profiler, which shows what functions consume |
| 30 | what percent of CPU time. There are 3 ways you can get access to it: |
| 31 | |
| 32 | 1. The simplest one is -cpuprofile flag of 'go test' |
| 33 | (http://golang.org/cmd/go/#hdr-Description_of_testing_flags) command. For |
| 34 | example, the following command: |
| 35 | |
| 36 | $ go test -run=none -bench=ClientServerParallel4 -cpuprofile=cprof net/http |
| 37 | |
| 38 | will profile the given benchmark and write CPU profile into 'cprof' file. |
| 39 | Then: |
| 40 | |
| 41 | $ go tool pprof --text http.test cprof |
| 42 | |
| 43 | will print a list of the hottest functions. |
| 44 | |
| 45 | There are several output types available, the most useful ones are: |
| 46 | `--text`, `--web` and `--list`. Run `go tool pprof` to get the complete |
| 47 | list. The obvious drawback of this option is that it works only for tests. |
| 48 | |
| 49 | 2. [net/http/pprof](http://golang.org/pkg/net/http/pprof): This is the ideal |
| 50 | solution for network servers. You merely need to import `net/http/pprof`, |
| 51 | and collect profiles with: |
| 52 | |
| 53 | $ go tool pprof --text mybin http://myserver:6060:/debug/pprof/profile |
| 54 | |
| 55 | 3. Manual profile collection. You need to import |
| 56 | [runtime/pprof](http://golang.org/pkg/runtime/pprof/) and add the following |
| 57 | code to main function: |
| 58 | |
| 59 | if *flagCpuprofile != "" { |
| 60 | f, err := os.Create(*flagCpuprofile) |
| 61 | if err != nil { |
| 62 | log.Fatal(err) |
| 63 | } |
| 64 | pprof.StartCPUProfile(f) |
| 65 | defer pprof.StopCPUProfile() |
| 66 | } |
| 67 | |
| 68 | The profile will be written to the specified file, visualize it the same way |
| 69 | as in the first option. |
| 70 | |
| 71 | Here is an example of a profile visualized with `--web` option: |
| 72 | [cpu_profile.png] |
| 73 | |
| 74 | You can investigate a single function with `--list=funcname` option. For |
| 75 | example, the following profile shows that the time was spent in the `append` |
| 76 | function: |
| 77 | |
| 78 | . . 93: func (bp *buffer) WriteRune(r rune) error { |
| 79 | . . 94: if r < utf8.RuneSelf { |
| 80 | 5 5 95: *bp = append(*bp, byte(r)) |
| 81 | . . 96: return nil |
| 82 | . . 97: } |
| 83 | . . 98: |
| 84 | . . 99: b := *bp |
| 85 | . . 100: n := len(b) |
| 86 | . . 101: for n+utf8.UTFMax > cap(b) { |
| 87 | . . 102: b = append(b, 0) |
| 88 | . . 103: } |
| 89 | . . 104: w := utf8.EncodeRune(b[n:n+utf8.UTFMax], r) |
| 90 | . . 105: *bp = b[:n+w] |
| 91 | . . 106: return nil |
| 92 | . . 107: } |
| 93 | |
| 94 | There are also 3 special entries that the profiler uses when it can't unwind |
| 95 | stack: GC, System and ExternalCode. GC means time spent during garbage |
| 96 | collection, see Memory Profiler and Garbage Collector Trace sections below for |
| 97 | optimization suggestions. System means time spent in goroutine scheduler, stack |
| 98 | management code and other auxiliary runtime code. ExternalCode means time spent |
| 99 | in native dynamic libraries. |
| 100 | |
| 101 | Here are some hints with respect to how to interpret what you see in the |
| 102 | profile. |
| 103 | |
| 104 | If you see lots of time spent in `runtime.mallocgc` function, the program |
| 105 | potentially makes excessive amount of small memory allocations. The profile will |
| 106 | tell you where the allocations are coming from. See the memory profiler section |
| 107 | for suggestions on how to optimize this case. |
| 108 | |
| 109 | If lots of time is spent in channel operations, sync.Mutex code and other |
| 110 | synchronization primitives or System component, the program probably suffers |
| 111 | from contention. Consider to restructure program to eliminate frequently |
| 112 | accessed shared resources. Common techniques for this include |
| 113 | sharding/partitioning, local buffering/batching and copy-on-write technique. |
| 114 | |
| 115 | If lots of time is spent in `syscall.Read/Write`, the program potentially makes |
| 116 | excessive amount of small reads and writes. Bufio wrappers around os.File or |
| 117 | net.Conn can help in this case. |
| 118 | |
| 119 | If lots of time is spent in GC component, the program either allocates too many |
| 120 | transient objects or heap size is very small so garbage collections happen too |
| 121 | frequently. See Garbage Collector Tracer and Memory Profiler sections for |
| 122 | optimization suggestions. |
| 123 | |
Ralph Caraveo III | 93bb1d0 | 2018-09-20 21:20:15 -0700 | [diff] [blame] | 124 | Note: For darwin CPU profiler currently only [works on El Capitan or newer](https://code.google.com/p/go/issues/detail?id=6047). |
Jaana Burcu Dogan | 9f85935 | 2017-05-08 20:04:42 -0700 | [diff] [blame] | 125 | |
| 126 | Note: On windows you need to install Cygwin, Perl and Graphviz to generate |
| 127 | svg/web profiles. |
| 128 | |
| 129 | |
| 130 | ## Memory Profiler |
| 131 | |
| 132 | Memory profiler shows what functions allocate heap memory. You can collect it in |
| 133 | similar ways as CPU profile: with `go test --memprofile` |
| 134 | (http://golang.org/cmd/go/#hdr-Description_of_testing_flags), with |
| 135 | [net/http/pprof](http://golang.org/pkg/net/http/pprof) via |
| 136 | http://myserver:6060:/debug/pprof/heap or by calling |
| 137 | [runtime/pprof.WriteHeapProfile](http://golang.org/pkg/runtime/pprof/#WriteHeapProfile). |
| 138 | |
| 139 | You can visualize only allocations live at the time of profile collection |
| 140 | (`--inuse_space flag to pprof`, default), or all allocations happened since |
| 141 | program start (`--alloc_space` flag to `pprof`). The former is useful for |
| 142 | profiles collected with net/http/pprof on live applications, the latter is |
| 143 | useful for profiles collected at program end (otherwise you will see almost |
| 144 | empty profile). |
| 145 | |
| 146 | Note: the memory profiler is sampling, that is, it collects information only |
| 147 | about some subset of memory allocations. Probability of sampling an object is |
| 148 | proportional to its size. You can change the sampling rate with go test |
| 149 | `--memprofilerate` flag, or by setting `runtime.MemProfileRate` variable at |
| 150 | program startup. The rate of 1 will lead to collection of information about all |
| 151 | allocations, but it can slow down execution. The default sampling rate is 1 |
| 152 | sample per 512KB of allocated memory. |
| 153 | |
| 154 | You can also visualize number of bytes allocated or number of objects allocated |
| 155 | (`--inuse/alloc_space` and `--inuse/alloc_objects` flags, respectively). The |
| 156 | profiler tends to sample larger objects during profiling more. But it's |
| 157 | important to understand that large objects affect memory consumption and GC |
| 158 | time, while large number of tiny allocations affects execution speed (and GC |
| 159 | time to some degree as well). So it may be useful to look at both. |
| 160 | |
| 161 | Objects can be persistent or transient. If you have several large persistent |
| 162 | objects allocated at program start, they will be most likely sampled by the |
| 163 | profiler (because they are large). Such objects do affect memory consumption and |
| 164 | GC time, but they do not affect normal execution speed (no memory management |
| 165 | operations happen on them). On the other hand if you have large number of |
| 166 | objects with very short life durations, they can be barely represented in the |
| 167 | profile (if you use the default `--inuse_space mode`). But they do significantly |
| 168 | affect execution speed, because they are constantly allocated and freed. So, |
| 169 | once again, it may be useful to look at both types of objects. So, generally, if |
| 170 | you want to reduce memory consumption, you need to look at `--inuse_space` |
| 171 | profile collected during normal program operation. If you want to improve |
| 172 | execution speed, look at `--alloc_objects` profile collected after significant |
| 173 | running time or at program end. |
| 174 | |
| 175 | There are several flags that control reporting granularity. `--functions` makes |
| 176 | pprof report on function level (default). `--lines` makes pprof report on source |
| 177 | line level, which is useful if hot functions allocate on different lines. And |
| 178 | there are also `--addresses` and `--files` for exact instruction address and |
| 179 | file level, respectively. |
| 180 | |
| 181 | There is a useful option for the memory profile -- you can look at it right in |
| 182 | the browser (provided that you imported `net/http/pprof`). If you open |
| 183 | http://myserver:6060/debug/pprof/heap?debug=1 you must see the heap profile |
| 184 | along the lines of: |
| 185 | |
| 186 | heap profile: 4: 266528 [123: 11284472] @ heap/1048576 |
| 187 | 1: 262144 [4: 376832] @ 0x28d9f 0x2a201 0x2a28a 0x2624d 0x26188 0x94ca3 0x94a0b 0x17add6 0x17ae9f 0x1069d3 0xfe911 0xf0a3e 0xf0d22 0x21a70 |
| 188 | # 0x2a201 cnew+0xc1 runtime/malloc.goc:718 |
| 189 | # 0x2a28a runtime.cnewarray+0x3a runtime/malloc.goc:731 |
| 190 | # 0x2624d makeslice1+0x4d runtime/slice.c:57 |
| 191 | # 0x26188 runtime.makeslice+0x98 runtime/slice.c:38 |
| 192 | # 0x94ca3 bytes.makeSlice+0x63 bytes/buffer.go:191 |
| 193 | # 0x94a0b bytes.(*Buffer).ReadFrom+0xcb bytes/buffer.go:163 |
| 194 | # 0x17add6 io/ioutil.readAll+0x156 io/ioutil/ioutil.go:32 |
| 195 | # 0x17ae9f io/ioutil.ReadAll+0x3f io/ioutil/ioutil.go:41 |
| 196 | # 0x1069d3 godoc/vfs.ReadFile+0x133 godoc/vfs/vfs.go:44 |
| 197 | # 0xfe911 godoc.funcĀ·023+0x471 godoc/meta.go:80 |
| 198 | # 0xf0a3e godoc.(*Corpus).updateMetadata+0x9e godoc/meta.go:101 |
| 199 | # 0xf0d22 godoc.(*Corpus).refreshMetadataLoop+0x42 godoc/meta.go:141 |
| 200 | |
| 201 | 2: 4096 [2: 4096] @ 0x28d9f 0x29059 0x1d252 0x1d450 0x106993 0xf1225 0xe1489 0xfbcad 0x21a70 |
| 202 | # 0x1d252 newdefer+0x112 runtime/panic.c:49 |
| 203 | # 0x1d450 runtime.deferproc+0x10 runtime/panic.c:132 |
| 204 | # 0x106993 godoc/vfs.ReadFile+0xf3 godoc/vfs/vfs.go:43 |
| 205 | # 0xf1225 godoc.(*Corpus).parseFile+0x75 godoc/parser.go:20 |
| 206 | # 0xe1489 godoc.(*treeBuilder).newDirTree+0x8e9 godoc/dirtrees.go:108 |
| 207 | # 0xfbcad godoc.funcĀ·002+0x15d godoc/dirtrees.go:100 |
| 208 | |
| 209 | The numbers in the beginning of each entry `("1: 262144 [4: 376832]")` represent |
| 210 | number of currently live objects, amount of memory occupied by live objects, |
| 211 | total number of allocations and amount of memory occupied by all allocations, |
| 212 | respectively. |
| 213 | |
| 214 | Optimizations are usually application-specific, but here are some common |
| 215 | suggestions. |
| 216 | |
| 217 | 1. Combine objects into larger objects. For example, replace `*bytes.Buffer` |
| 218 | struct member with `bytes.Buffer` (you can preallocate buffer for writing by |
| 219 | calling `bytes.Buffer.Grow` later). This will reduce number of memory |
| 220 | allocations (faster) and also reduce pressure on garbage collector (faster |
| 221 | garbage collections). |
| 222 | |
| 223 | 2. Local variables that escape from their declaration scope get promoted into |
| 224 | heap allocations. Compiler generally can't prove that several variables have |
| 225 | the same life time, so it allocates each such variable separately. So you |
| 226 | can use the above advise for local variables as well. For example, replace: |
| 227 | |
| 228 | for k, v := range m { |
| 229 | k, v := k, v // copy for capturing by the goroutine |
| 230 | go func() { |
| 231 | // use k and v |
| 232 | }() |
| 233 | } |
| 234 | |
| 235 | with: |
| 236 | |
| 237 | for k, v := range m { |
| 238 | x := struct{ k, v string }{k, v} // copy for capturing by the goroutine |
| 239 | go func() { |
| 240 | // use x.k and x.v |
| 241 | }() |
| 242 | } |
| 243 | |
| 244 | This replaces two memory allocations with a single allocation. However, this |
| 245 | optimization usually negatively affects code readability, so use it |
| 246 | reasonably. |
| 247 | |
| 248 | 3. A special case of allocation combining is slice array preallocation. If you |
| 249 | know a typical size of the slice, you can preallocate a backing array for it |
| 250 | as follows: |
| 251 | |
| 252 | type X struct { |
| 253 | buf []byte |
| 254 | bufArray [16]byte // Buf usually does not grow beyond 16 bytes. |
| 255 | } |
| 256 | |
| 257 | func MakeX() *X { |
| 258 | x := &X{} |
| 259 | // Preinitialize buf with the backing array. |
| 260 | x.buf = x.bufArray[:0] |
| 261 | return x |
| 262 | } |
| 263 | |
| 264 | 4. If possible use smaller data types. For example, use `int8` instead of |
| 265 | `int`. |
| 266 | |
| 267 | 5. Objects that do not contain any pointers (note that strings, slices, maps |
| 268 | and chans contain implicit pointers), are not scanned by garbage collector. |
| 269 | For example, a 1GB byte slice virtually does not affect garbage collection |
| 270 | time. So if you remove pointers from actively used objects, it can |
| 271 | positively impact garbage collection time. Some possibilities are: replace |
| 272 | pointers with indices, split object into two parts one of which does not |
| 273 | contain pointers. |
| 274 | |
| 275 | 6. Use freelists to reuse transient objects and reduce number of allocations. |
| 276 | Standard library contains [sync.Pool](http://tip.golang.org/pkg/sync/#Pool) |
| 277 | type that allows to reuse the same object several times in between garbage |
| 278 | collections. However, be aware that, as any manual memory management scheme, |
| 279 | incorrect use of sync.Pool can lead to use-after-free bugs. |
| 280 | |
| 281 | You can also use the Garbage Collector Trace (see below) to get some insights |
| 282 | into memory issues. |
| 283 | |
| 284 | TODO(dvyukov): mention that stats are updated in a deferred way: "Memprof stats |
| 285 | are updated in a deferred way. This is required in order to present a consistent |
| 286 | picture in the situation when allocs are coming continuously and frees are |
| 287 | coming in batches afterwards. Several consecutive GCs push the update pipeline |
| 288 | forward. That's what you observe. So if you profile a live server then any |
| 289 | sample will give you a consistent snapshot. However, if the program completes |
| 290 | some activity and you want to collect the snapshot after this activity, then you |
| 291 | need to execute 2 or 3 GCs before the collection." |
| 292 | |
| 293 | ## Blocking Profiler |
| 294 | |
| 295 | The blocking profiler shows where goroutine block waiting on synchronization |
| 296 | primitives (including timer channels). You can collect it in similar ways as CPU |
| 297 | profile: with `go test --blockprofile` |
| 298 | (http://golang.org/cmd/go/#hdr-Description_of_testing_flags), with |
| 299 | [net/http/pprof](http://golang.org/pkg/net/http/pprof) via |
| 300 | http://myserver:6060:/debug/pprof/block or by calling |
| 301 | [runtime/pprof.Lookup("block").WriteTo](http://golang.org/pkg/runtime/pprof/#Profile.WriteTo). |
| 302 | |
| 303 | But there is significant caveat -- the blocking profiler is not enabled by |
| 304 | default. `go test --blockprofile` will enable it for you automatically. However, |
| 305 | if you use `net/http/pprof` or `runtime/pprof`, you need to enable it manually |
| 306 | (otherwise the profile will be empty). To enable the blocking profiler call |
| 307 | [runtime.SetBlockProfileRate](http://golang.org/pkg/runtime/#SetBlockProfileRate). |
| 308 | SetBlockProfileRate controls the fraction of goroutine blocking events that are |
| 309 | reported in the blocking profile. The profiler aims to sample an average of one |
| 310 | blocking event per the specified amount of nanoseconds spent blocked. To include |
| 311 | every blocking event in the profile, set the rate to 1. |
| 312 | |
| 313 | If a function contains several blocking operations and it's not obvious which |
| 314 | one leads to blocking, use `--lines` flag to pprof. |
| 315 | |
| 316 | Note that not all blocking is bad. When a goroutine blocks, the underlying |
| 317 | worker thread simply switches to another goroutine. So blocking in the |
| 318 | cooperative Go environment is very different from blocking on a mutex in a |
| 319 | non-cooperative systems (e.g. typical C++ or Java threading libraries, where |
| 320 | blocking leads to thread idling and expensive thread context switches). To give |
| 321 | you some feeling, let's consider some examples. |
| 322 | |
| 323 | Blocking on a time.Ticker is usually OK. If a goroutine blocks on a Ticker for |
| 324 | 10 seconds, you will see 10 seconds of blocking in the profile, which is |
| 325 | perfectly fine. Blocking on `sync.WaitGroup` is frequently OK. For example, is a |
| 326 | task takes 10 seconds, the goroutine waiting on a WaitGroup for completion will |
| 327 | account for 10 seconds of blocking in the profile. Blocking on sync.Cond may or |
| 328 | may not be OK, depending on the situation. Consumer blocking on a channel |
| 329 | suggests slow producers or lack of work. Producer blocking on a channel suggests |
| 330 | that consumers are slower, but this is frequently OK. Blocking on a |
| 331 | channel-based semaphore shows how much goroutines are gated on the semaphore. |
| 332 | Blocking on a sync.Mutex or sync.RWMutex is usually bad. You can use `--ignore` |
| 333 | flag to pprof to exclude known uninteresting blocking from a profile during |
| 334 | visualization. |
| 335 | |
| 336 | Blocking of goroutines can lead to two negative consequences: |
| 337 | |
| 338 | 1. Program does not scale with processors due to lack of work. Scheduler Trace |
| 339 | can help to identify this case. |
| 340 | |
| 341 | 2. Excessive goroutine blocking/unblocking consumes CPU time. CPU Profiler can |
| 342 | help to identify this case (look at the System component). |
| 343 | |
| 344 | Here are some common suggestions that can help to reduce goroutine blocking: |
| 345 | |
| 346 | 1. Use sufficiently buffered channels in producer-consumer scenarios. |
| 347 | Unbuffered channels substantially limit available parallelism in the |
| 348 | program. |
| 349 | |
| 350 | 2. Use `sync.RWMutex` instead of `sync.Mutex` for read-mostly workloads. |
| 351 | Readers never block other readers in `sync.RWMutex`, even on implementation |
| 352 | level. |
| 353 | |
| 354 | 3. In some cases it's possible to remove mutexes entirely by using |
| 355 | copy-on-write technique. If the protected data structure is modified |
| 356 | infrequently and it's feasible to make copies of it, then it can be updated |
| 357 | as follows: |
| 358 | |
| 359 | type Config struct { |
| 360 | Routes map[string]net.Addr |
| 361 | Backends []net.Addr |
| 362 | } |
| 363 | |
| 364 | var config atomic.Value // actual type is *Config |
| 365 | |
| 366 | // Worker goroutines use this function to obtain the current config. |
| 367 | // UpdateConfig must be called at least once before this func. |
| 368 | func CurrentConfig() *Config { |
| 369 | return config.Load().(*Config) |
| 370 | } |
| 371 | |
| 372 | // Background goroutine periodically creates a new Config object |
| 373 | // as sets it as current using this function. |
| 374 | func UpdateConfig(cfg *Config) { |
| 375 | config.Store(cfg) |
| 376 | } |
| 377 | |
| 378 | This pattern prevents the writer from blocking readers during update. |
| 379 | |
| 380 | 4. Partitioning is another general technique for reducing contention/blocking |
| 381 | on shared mutable data structures. Below is an example of how to partition a |
| 382 | hashmap: |
| 383 | |
| 384 | type Partition struct { |
| 385 | sync.RWMutex |
| 386 | m map[string]string |
| 387 | } |
| 388 | |
| 389 | const partCount = 64 |
| 390 | var m [partCount]Partition |
| 391 | |
| 392 | func Find(k string) string { |
| 393 | idx := hash(k) % partCount |
| 394 | part := &m[idx] |
| 395 | part.RLock() |
| 396 | v := part.m[k] |
| 397 | part.RUnlock() |
| 398 | return v |
| 399 | } |
| 400 | |
| 401 | 5. Local caching and batching of updates can help to reduce contention on |
| 402 | un-partitionable data structures. Below you can see how to batch sends to a |
| 403 | channel: |
| 404 | |
| 405 | const CacheSize = 16 |
| 406 | |
| 407 | type Cache struct { |
| 408 | buf [CacheSize]int |
| 409 | pos int |
| 410 | } |
| 411 | |
| 412 | func Send(c chan [CacheSize]int, cache *Cache, value int) { |
| 413 | cache.buf[cache.pos] = value |
| 414 | cache.pos++ |
| 415 | if cache.pos == CacheSize { |
| 416 | c <- cache.buf |
| 417 | cache.pos = 0 |
| 418 | } |
| 419 | } |
| 420 | |
| 421 | This technique is not limited to channels. It can be used to batch updates |
| 422 | to a map, batch allocations, etc. |
| 423 | |
| 424 | 6. Use [sync.Pool](http://tip.golang.org/pkg/sync/#Pool) for freelists instead |
| 425 | of chan-based or mutex-protected freelists. sync.Pool uses smart techniques |
| 426 | internally to reduce blocking. |
| 427 | |
| 428 | ## Goroutine Profiler |
| 429 | |
| 430 | The goroutine profiler simply gives you current stacks of all live goroutines in |
| 431 | the process. It can be handy to debug load balancing issues (see Scheduler Trace |
| 432 | section below), or to debug deadlocks. The profile makes sense only for a |
| 433 | running app, so go test does not expose it. You can collect the profile with |
| 434 | [net/http/pprof](http://golang.org/pkg/net/http/pprof) via |
| 435 | http://myserver:6060:/debug/pprof/goroutine, and visualize it to svg/pdf or by |
| 436 | calling |
| 437 | [runtime/pprof.Lookup("goroutine").WriteTo](http://golang.org/pkg/runtime/pprof/#Profile.WriteTo). |
| 438 | But the most useful way is to type |
| 439 | http://myserver:6060:/debug/pprof/goroutine?debug=2 in your browser, which will |
| 440 | give you symbolized stacks similar to what you see when a program crashes. Note |
| 441 | that goroutines in "syscall" state consume an OS thread, other goroutines do not |
| 442 | (except for goroutines that called runtime.LockOSThread, which is, |
| 443 | unfortunately, not visible in the profile). Note that goroutines in "IO wait" |
| 444 | state also do not consume threads, they are parked on non-blocking network |
| 445 | poller (which uses epoll/kqueue/GetQueuedCompletionStatus to unpark goroutines |
| 446 | later). |
| 447 | |
| 448 | ## Garbage Collector Trace |
| 449 | |
| 450 | Aside from the profiling tools, there is another kind of tools available -- |
| 451 | tracers. They allow to trace garbage collections, memory allocator and goroutine |
| 452 | scheduler state. To enable the garbage collector (GC) trace, run the program |
| 453 | with `GODEBUG=gctrace=1` environment variable: |
| 454 | |
| 455 | $ GODEBUG=gctrace=1 ./myserver |
| 456 | |
| 457 | Then the program will print output similar to the following during execution: |
| 458 | |
| 459 | gc9(2): 12+1+744+8 us, 2 -> 10 MB, 108615 (593983-485368) objects, 4825/3620/0 sweeps, 0(0) handoff, 6(91) steal, 16/1/0 yields |
| 460 | gc10(2): 12+6769+767+3 us, 1 -> 1 MB, 4222 (593983-589761) objects, 4825/0/1898 sweeps, 0(0) handoff, 6(93) steal, 16/10/2 yields |
| 461 | gc11(2): 799+3+2050+3 us, 1 -> 69 MB, 831819 (1484009-652190) objects, 4825/691/0 sweeps, 0(0) handoff, 5(105) steal, 16/1/0 yields |
| 462 | |
| 463 | Let's consider the meaning of these numbers. One line per GC is printed. The |
| 464 | first number ("gc9") is the number of GC (this is the 9-th GC since program |
| 465 | start). The number in parens ("(2)") is the number of worker threads |
| 466 | participated in the GC. The next 4 numbers ("12+1+744+8 us") mean |
| 467 | stop-the-world, sweeping, marking and waiting for worker threads to finish, in |
| 468 | microseconds, respectively. The next 2 numbers ("2 -> 10 MB") mean size of live |
| 469 | heap after the previous GC and full heap size (including garbage) before the |
| 470 | current GC. The next 3 numbers ("108615 (593983-485368) objects") are total |
| 471 | number of objects in heap (including garbage) and total number of memory |
| 472 | allocation and free operations. The next 3 numbers ("4825/3620/0 sweeps") |
| 473 | characterize sweep phase (of the previous GC): there were total 4825 memory |
| 474 | spans, 3620 were swept on demand or in background, 0 were swept during |
| 475 | stop-the-world phase (the rest were unused spans). The next 4 numbers ("0(0) |
| 476 | handoff, 6(91) steal") characterize load balancing during parallel mark phase: |
| 477 | there were 0 object handoff operations (0 objects were handoff), and 6 steal |
| 478 | operations (91 objects were stolen). The last 3 numbers ("16/1/0 yields") |
| 479 | characterize effectiveness of parallel mark phase: there were total of 17 yield |
| 480 | operations during waiting for another thread. |
| 481 | |
| 482 | The GC is [mark-and-sweep |
| 483 | type](http://www.brpreiss.com/books/opus5/html/page424.html). Total GC can be |
| 484 | expressed as: |
| 485 | |
| 486 | Tgc = Tseq + Tmark + Tsweep |
| 487 | |
| 488 | where Tseq is time to stop user goroutines and some preparation activities |
| 489 | (usually small); Tmark is heap marking time, marking happens when all user |
| 490 | goroutines are stopped, and thus can significantly affect latency of processing; |
| 491 | Tsweep is heap sweeping time, sweeping generally happens concurrently with |
| 492 | normal program execution, and so is not so critical for latency. |
| 493 | |
| 494 | Marking time can be approximately expressed as: |
| 495 | |
| 496 | Tmark = C1*Nlive + C2*MEMlive_ptr + C3*Nlive_ptr |
| 497 | |
| 498 | where Nlive is the number of live objects in the heap during GC, `MEMlive_ptr` |
| 499 | is the amount of memory occupied by live objects with pointers, `Nlive_ptr` is |
| 500 | the number of pointers in live objects. |
| 501 | |
| 502 | Sweeping time can be approximately expressed as: |
| 503 | |
| 504 | Tsweep = C4*MEMtotal + C5*MEMgarbage |
| 505 | |
| 506 | where `MEMtotal` is the total amount of heap memory, `MEMgarbage` is the amount |
| 507 | of garbage in the heap. |
| 508 | |
| 509 | Next GC happens after the program has allocated an extra amount of memory |
| 510 | proportional to the amount already in use. The proportion is controlled by GOGC |
| 511 | environment variable (100 by default). If GOGC=100 and the program is using 4M |
| 512 | of heap memory, runtime will trigger GC again when the program gets to 8M. This |
| 513 | keeps the GC cost in linear proportion to the allocation cost. Adjusting GOGC |
| 514 | changes the linear constant and also the amount of extra memory used. |
| 515 | |
| 516 | Only sweeping depends on total size of the heap, and sweeping happens |
| 517 | concurrently with normal program execution. So it can make sense to set GOGC to |
| 518 | a higher value (200, 300, 500, etc) if you can afford extra memory consumption. |
| 519 | For example, GOGC=300 can reduce garbage collection overhead by up to 2 times |
| 520 | while keeping latencies the same (at the cost of 2 times larger heap). |
| 521 | |
| 522 | GC is parallel and generally scales well with hardware parallelism. So it can |
| 523 | make sense to set GOMAXPROCS to higher value even for sequential programs just |
| 524 | to speed up garbage collections. However, note that number of garbage collector |
| 525 | threads is currently bounded by 8. |
| 526 | |
| 527 | ## Memory Allocator Trace |
| 528 | |
| 529 | Memory allocator traces simply dumps all memory allocation and free operations |
| 530 | onto console. It's enabled with GODEBUG=allocfreetrace=1 environment variable. |
| 531 | The output looks along the lines of: |
| 532 | |
| 533 | tracealloc(0xc208062500, 0x100, array of parse.Node) |
| 534 | goroutine 16 [running]: |
| 535 | runtime.mallocgc(0x100, 0x3eb7c1, 0x0) |
| 536 | runtime/malloc.goc:190 +0x145 fp=0xc2080b39f8 |
| 537 | runtime.growslice(0x31f840, 0xc208060700, 0x8, 0x8, 0x1, 0x0, 0x0, 0x0) |
| 538 | runtime/slice.goc:76 +0xbb fp=0xc2080b3a90 |
| 539 | text/template/parse.(*Tree).parse(0xc2080820e0, 0xc208023620, 0x0, 0x0) |
| 540 | text/template/parse/parse.go:289 +0x549 fp=0xc2080b3c50 |
| 541 | ... |
| 542 | |
| 543 | tracefree(0xc208002d80, 0x120) |
| 544 | goroutine 16 [running]: |
| 545 | runtime.MSpan_Sweep(0x73b080) |
| 546 | runtime/mgc0.c:1880 +0x514 fp=0xc20804b8f0 |
| 547 | runtime.MCentral_CacheSpan(0x69c858) |
| 548 | runtime/mcentral.c:48 +0x2b5 fp=0xc20804b920 |
| 549 | runtime.MCache_Refill(0x737000, 0xc200000012) |
| 550 | runtime/mcache.c:78 +0x119 fp=0xc20804b950 |
| 551 | ... |
| 552 | |
| 553 | The trace contains address of the memory block, size, type, goroutine id and the |
| 554 | stack trace. It's probably more useful for debugging, but can give very |
| 555 | fine-grained info for allocation optimizations as well. |
| 556 | |
| 557 | ## Scheduler Trace |
| 558 | |
| 559 | Scheduler trace can provide insights into dynamic behavior of the goroutine |
| 560 | scheduler and allow to debug load balancing and scalability issues. To enable |
| 561 | the scheduler trace trace, run the program with GODEBUG=schedtrace=1000 |
| 562 | environment variable (the value means period of output, in ms, in this case it's |
| 563 | once per second): |
| 564 | |
| 565 | $ GODEBUG=schedtrace=1000 ./myserver |
| 566 | |
| 567 | Then the program will print output similar to the following during execution: |
| 568 | |
| 569 | SCHED 1004ms: gomaxprocs=4 idleprocs=0 threads=11 idlethreads=4 runqueue=8 [0 1 0 3] |
| 570 | SCHED 2005ms: gomaxprocs=4 idleprocs=0 threads=11 idlethreads=5 runqueue=6 [1 5 4 0] |
| 571 | SCHED 3008ms: gomaxprocs=4 idleprocs=0 threads=11 idlethreads=4 runqueue=10 [2 2 2 1] |
| 572 | |
| 573 | The first number ("1004ms") is time since program start. Gomaxprocs is the |
| 574 | current value of GOMAXPROCS. Idleprocs is the number of idling processors (the |
| 575 | rest are executing Go code). Threads is the total number of worker threads |
| 576 | created by the scheduler (threads can be in 3 states: execute Go code |
| 577 | (gomaxprocs-idleprocs), execute syscalls/cgocalls or idle). Idlethreads is the |
| 578 | number of idling worker threads. Runqueue is the length of global queue with |
| 579 | runnable goroutines. The numbers in square brackets ("[0 1 0 3]") are lengths of |
| 580 | per-processor queues with runnable goroutines. Sum of lengths of global and |
| 581 | local queues represents the total number of goroutines available for execution. |
| 582 | |
| 583 | Note: You can combine any of the tracers as |
| 584 | GODEBUG=gctrace=1,allocfreetrace=1,schedtrace=1000. |
| 585 | |
| 586 | Note: There is also detailed scheduler trace, which you can enable with |
| 587 | GODEBUG=schedtrace=1000,scheddetail=1. It prints detailed info about every |
| 588 | goroutine, worker thread and processor. We won't describe its format here as |
| 589 | it's mainly useful for scheduler developers; but you can find details in |
| 590 | [src/pkg/runtime/proc.c](https://code.google.com/p/go/source/browse/src/pkg/runtime/proc.c). |
| 591 | |
| 592 | The scheduler trace is useful when a program does not scale linearly with |
| 593 | GOMAXPROCS and/or does not consume 100% of CPU time. The ideal situation is when |
| 594 | all processors are busy executing Go code, number of threads is reasonable, |
| 595 | there is plenty of work in all queues and the work is reasonably evenly |
| 596 | distributed: |
| 597 | |
| 598 | gomaxprocs=8 idleprocs=0 threads=40 idlethreads=5 runqueue=10 [20 20 20 20 20 20 20 20] |
| 599 | |
| 600 | A bad situation is when something of the above does not hold. For example the |
| 601 | following sample demonstrates shortage of work to keep all processors busy: |
| 602 | |
| 603 | gomaxprocs=8 idleprocs=6 threads=40 idlethreads=30 runqueue=0 [0 2 0 0 0 1 0 0] |
| 604 | |
| 605 | Note: use OS-provided means to measure actual CPU utilization as the ultimate |
| 606 | characteristic. On Unix family of operating system it is top command; on Windows |
| 607 | it is Task Manager. |
| 608 | |
| 609 | You can use the goroutine profiler to understand where goroutines block in the |
| 610 | case of work shortage. Note that load imbalance is not ultimately bad as long as |
| 611 | all processors are busy, it will just cause some moderate load balancing |
| 612 | overheads. |
| 613 | |
| 614 | ## Memory Statistics |
| 615 | |
| 616 | Go runtime exposes coarse-grained memory statistics via |
| 617 | [runtime.ReadMemStats](http://golang.org/pkg/runtime/#ReadMemStats) function. |
| 618 | The statistics are also exposed via net/http/pprof at the bottom of |
| 619 | http://myserver:6060/debug/pprof/heap?debug=1. The statistics are described |
| 620 | [here](http://golang.org/pkg/runtime/#MemStats). Some of the interesting fields |
| 621 | are: |
| 622 | |
| 623 | 1. HeapAlloc - current heap size. |
| 624 | 2. HeapSys - total heap size. |
| 625 | 3. HeapObjects - total number of objects in the heap. |
| 626 | 4. HeapReleased - amount of memory released to the OS; runtime releases to the |
| 627 | OS memory unused for 5 minutes, you can force this process with |
| 628 | runtime/debug.FreeOSMemory. |
| 629 | 5. Sys - total amount of memory allocated from OS. |
| 630 | 6. Sys-HeapReleased - effective memory consumption of the program. |
| 631 | 7. StackSys - memory consumed for goroutine stacks (note that some stacks are |
| 632 | allocated from heap and are not accounted here, unfortunately there is no |
| 633 | way to get total size of stacks |
| 634 | (https://code.google.com/p/go/issues/detail?id=7468)). |
| 635 | 8. MSpanSys/MCacheSys/BuckHashSys/GCSys/OtherSys - amount of memory allocated |
| 636 | by runtime for various auxiliary purposes; they are generally not |
| 637 | interesting, unless they are too high. |
| 638 | 9. PauseNs - durations of last garbage collections. |
| 639 | |
| 640 | ## Heap Dumper |
| 641 | |
| 642 | The last available tool is heap dumper, it can write state of the whole heap |
| 643 | into a file for future exploration. It can be useful for identifying memory |
| 644 | leaks and getting insights into program memory consumption. |
| 645 | |
| 646 | First, you need to write the dump using |
| 647 | [runtime/debug.WriteHeapDump](http://tip.golang.org/pkg/runtime/debug/#WriteHeapDump) |
| 648 | function: |
| 649 | |
| 650 | f, err := os.Create("heapdump") |
| 651 | if err != nil { ... } |
| 652 | debug.WriteHeapDump(f.Fd()) |
| 653 | |
| 654 | Then you can either render it to a dot file with graphical representation of the |
| 655 | heap or convert it to hprof format. To render it to a dot file: |
| 656 | |
| 657 | $ go get github.com/randall77/hprof/dumptodot |
| 658 | $ dumptodot heapdump mybinary > heap.dot |
| 659 | |
| 660 | and open `heap.dot` with Graphviz. |
| 661 | |
| 662 | To convert it to `hprof` format: |
| 663 | |
| 664 | $ go get github.com/randall77/hprof/dumptohprof |
| 665 | $ dumptohprof heapdump heap.hprof |
| 666 | $ jhat heap.hprof |
| 667 | |
| 668 | and navigate your browser to http://myserver:7000. |
| 669 | |
| 670 | ## Concluding Remarks |
| 671 | |
| 672 | Optimization is an open problem, there are simple recipes that you can use to |
| 673 | improve performance. Sometimes optimization requires complete re-architecture of |
| 674 | the program. But we hope that the tools will be a valuable addition to your |
| 675 | toolbox, that you can use to at least analyze and understand what happens. |
| 676 | [Profiling Go Programs](http://blog.golang.org/profiling-go-programs) is a good |
| 677 | tutorial on usage of CPU and memory profilers to optimize a simple program. |