Performance.md - wiki - Git at Google

 # Debugging performance issues in Go programs

 -- originally written by Dmitry Vyukov

 Let's assume you have a Go program and want to improve its performance. There
 are several tools available that can help with this task. These tools can help
 you to identify various types of hotspots (CPU, IO, memory), hotspots are the
 places that you need to concentrate on in order to significantly improve
 performance. However, another outcome is possible -- the tools can help you
 identify obvious performance defects in the program. For example, you prepare an
 SQL statement before each query while you could prepare it once at program
 startup. Another example is if an O(N^2) algorithm somehow slipped into where an
 obvious O(N) exists and is expected. In order to identify such cases you need to
 sanity check what you see in profiles. For example for the first case
 significant time spent in SQL statement preparation would be the red flag.

 It's also important to understand various bounding factors for performance. For
 example, if the program communicates via 100 Mbps network link and it is already
 utilizes >90Mbps, there is not much you can do with the program to improve its
 performance. There are similar bounding factors for disk IO, memory consumption
 and computational tasks. With that in mind we can look at the available tools.

 Note: The tools can interfere with each other. For example, precise memory
 profiling skews CPU profiles, goroutine blocking profiling affects scheduler
 trace, etc. Use tools in isolation to get more precise info.

 ## CPU Profiler

 Go runtime contains built-in CPU profiler, which shows what functions consume
 what percent of CPU time. There are 3 ways you can get access to it:

 1.  The simplest one is -cpuprofile flag of 'go test'
     (http://golang.org/cmd/go/#hdr-Description_of_testing_flags) command. For
     example, the following command:

         $ go test -run=none -bench=ClientServerParallel4 -cpuprofile=cprof net/http

     will profile the given benchmark and write CPU profile into 'cprof' file.
     Then:

         $ go tool pprof --text http.test cprof

     will print a list of the hottest functions.

     There are several output types available, the most useful ones are:
     `--text`, `--web` and `--list`. Run `go tool pprof` to get the complete
     list. The obvious drawback of this option is that it works only for tests.

 2.  [net/http/pprof](http://golang.org/pkg/net/http/pprof): This is the ideal
     solution for network servers. You merely need to import `net/http/pprof`,
     and collect profiles with:

         $ go tool pprof --text mybin  http://myserver:6060:/debug/pprof/profile

 3.  Manual profile collection. You need to import
     [runtime/pprof](http://golang.org/pkg/runtime/pprof/) and add the following
     code to main function:

         if *flagCpuprofile != "" {
             f, err := os.Create(*flagCpuprofile)
             if err != nil {
                 log.Fatal(err)
             }
             pprof.StartCPUProfile(f)
             defer pprof.StopCPUProfile()
         }

     The profile will be written to the specified file, visualize it the same way
     as in the first option.

 Here is an example of a profile visualized with `--web` option:
 [cpu_profile.png]

 You can investigate a single function with `--list=funcname` option. For
 example, the following profile shows that the time was spent in the `append`
 function:

      .      .   93: func (bp *buffer) WriteRune(r rune) error {
      .      .   94:     if r < utf8.RuneSelf {
      5      5   95:         *bp = append(*bp, byte(r))
      .      .   96:         return nil
      .      .   97:     }
      .      .   98:
      .      .   99:     b := *bp
      .      .  100:     n := len(b)
      .      .  101:     for n+utf8.UTFMax > cap(b) {
      .      .  102:         b = append(b, 0)
      .      .  103:     }
      .      .  104:     w := utf8.EncodeRune(b[n:n+utf8.UTFMax], r)
      .      .  105:     *bp = b[:n+w]
      .      .  106:     return nil
      .      .  107: }

 There are also 3 special entries that the profiler uses when it can't unwind
 stack: GC, System and ExternalCode. GC means time spent during garbage
 collection, see Memory Profiler and Garbage Collector Trace sections below for
 optimization suggestions. System means time spent in goroutine scheduler, stack
 management code and other auxiliary runtime code. ExternalCode means time spent
 in native dynamic libraries.

 Here are some hints with respect to how to interpret what you see in the
 profile.

 If you see lots of time spent in `runtime.mallocgc` function, the program
 potentially makes excessive amount of small memory allocations. The profile will
 tell you where the allocations are coming from. See the memory profiler section
 for suggestions on how to optimize this case.

 If lots of time is spent in channel operations, sync.Mutex code and other
 synchronization primitives or System component, the program probably suffers
 from contention. Consider to restructure program to eliminate frequently
 accessed shared resources. Common techniques for this include
 sharding/partitioning, local buffering/batching and copy-on-write technique.

 If lots of time is spent in `syscall.Read/Write`, the program potentially makes
 excessive amount of small reads and writes. Bufio wrappers around os.File or
 net.Conn can help in this case.

 If lots of time is spent in GC component, the program either allocates too many
 transient objects or heap size is very small so garbage collections happen too
 frequently. See Garbage Collector Tracer and Memory Profiler sections for
 optimization suggestions.

 Note: For darwin CPU profiler currently only [works on El Capitan or newer](https://code.google.com/p/go/issues/detail?id=6047).

 Note: On windows you need to install Cygwin, Perl and Graphviz to generate
 svg/web profiles.


 ## Memory Profiler

 Memory profiler shows what functions allocate heap memory. You can collect it in
 similar ways as CPU profile: with `go test --memprofile`
 (http://golang.org/cmd/go/#hdr-Description_of_testing_flags), with
 [net/http/pprof](http://golang.org/pkg/net/http/pprof) via
 http://myserver:6060:/debug/pprof/heap or by calling
 [runtime/pprof.WriteHeapProfile](http://golang.org/pkg/runtime/pprof/#WriteHeapProfile).

 You can visualize only allocations live at the time of profile collection
 (`--inuse_space flag to pprof`, default), or all allocations happened since
 program start (`--alloc_space` flag to `pprof`). The former is useful for
 profiles collected with net/http/pprof on live applications, the latter is
 useful for profiles collected at program end (otherwise you will see almost
 empty profile).

 Note: the memory profiler is sampling, that is, it collects information only
 about some subset of memory allocations. Probability of sampling an object is
 proportional to its size. You can change the sampling rate with go test
 `--memprofilerate` flag, or by setting `runtime.MemProfileRate` variable at
 program startup. The rate of 1 will lead to collection of information about all
 allocations, but it can slow down execution. The default sampling rate is 1
 sample per 512KB of allocated memory.

 You can also visualize number of bytes allocated or number of objects allocated
 (`--inuse/alloc_space` and `--inuse/alloc_objects` flags, respectively). The
 profiler tends to sample larger objects during profiling more. But it's
 important to understand that large objects affect memory consumption and GC
 time, while large number of tiny allocations affects execution speed (and GC
 time to some degree as well). So it may be useful to look at both.

 Objects can be persistent or transient. If you have several large persistent
 objects allocated at program start, they will be most likely sampled by the
 profiler (because they are large). Such objects do affect memory consumption and
 GC time, but they do not affect normal execution speed (no memory management
 operations happen on them). On the other hand if you have large number of
 objects with very short life durations, they can be barely represented in the
 profile (if you use the default `--inuse_space mode`). But they do significantly
 affect execution speed, because they are constantly allocated and freed. So,
 once again, it may be useful to look at both types of objects. So, generally, if
 you want to reduce memory consumption, you need to look at `--inuse_space`
 profile collected during normal program operation. If you want to improve
 execution speed, look at `--alloc_objects` profile collected after significant
 running time or at program end.

 There are several flags that control reporting granularity. `--functions` makes
 pprof report on function level (default). `--lines` makes pprof report on source
 line level, which is useful if hot functions allocate on different lines. And
 there are also `--addresses` and `--files` for exact instruction address and
 file level, respectively.

 There is a useful option for the memory profile -- you can look at it right in
 the browser (provided that you imported `net/http/pprof`). If you open
 http://myserver:6060/debug/pprof/heap?debug=1 you must see the heap profile
 along the lines of:

     heap profile: 4: 266528 [123: 11284472] @ heap/1048576
     1: 262144 [4: 376832] @ 0x28d9f 0x2a201 0x2a28a 0x2624d 0x26188 0x94ca3 0x94a0b 0x17add6 0x17ae9f 0x1069d3 0xfe911 0xf0a3e 0xf0d22 0x21a70
     #   0x2a201 cnew+0xc1   runtime/malloc.goc:718
     #   0x2a28a runtime.cnewarray+0x3a          runtime/malloc.goc:731
     #   0x2624d makeslice1+0x4d             runtime/slice.c:57
     #   0x26188 runtime.makeslice+0x98          runtime/slice.c:38
     #   0x94ca3 bytes.makeSlice+0x63            bytes/buffer.go:191
     #   0x94a0b bytes.(*Buffer).ReadFrom+0xcb       bytes/buffer.go:163
     #   0x17add6    io/ioutil.readAll+0x156         io/ioutil/ioutil.go:32
     #   0x17ae9f    io/ioutil.ReadAll+0x3f          io/ioutil/ioutil.go:41
     #   0x1069d3    godoc/vfs.ReadFile+0x133            godoc/vfs/vfs.go:44
     #   0xfe911 godoc.func·023+0x471            godoc/meta.go:80
     #   0xf0a3e godoc.(*Corpus).updateMetadata+0x9e     godoc/meta.go:101
     #   0xf0d22 godoc.(*Corpus).refreshMetadataLoop+0x42    godoc/meta.go:141

     2: 4096 [2: 4096] @ 0x28d9f 0x29059 0x1d252 0x1d450 0x106993 0xf1225 0xe1489 0xfbcad 0x21a70
     #   0x1d252 newdefer+0x112              runtime/panic.c:49
     #   0x1d450 runtime.deferproc+0x10          runtime/panic.c:132
     #   0x106993    godoc/vfs.ReadFile+0xf3         godoc/vfs/vfs.go:43
     #   0xf1225 godoc.(*Corpus).parseFile+0x75      godoc/parser.go:20
     #   0xe1489 godoc.(*treeBuilder).newDirTree+0x8e9   godoc/dirtrees.go:108
     #   0xfbcad godoc.func·002+0x15d            godoc/dirtrees.go:100

 The numbers in the beginning of each entry `("1: 262144 [4: 376832]")` represent
 number of currently live objects, amount of memory occupied by live objects,
 total number of allocations and amount of memory occupied by all allocations,
 respectively.

 Optimizations are usually application-specific, but here are some common
 suggestions.

 1.  Combine objects into larger objects. For example, replace `*bytes.Buffer`
     struct member with `bytes.Buffer` (you can preallocate buffer for writing by
     calling `bytes.Buffer.Grow` later). This will reduce number of memory
     allocations (faster) and also reduce pressure on garbage collector (faster
     garbage collections).

 2.  Local variables that escape from their declaration scope get promoted into
     heap allocations. Compiler generally can't prove that several variables have
     the same life time, so it allocates each such variable separately. So you
     can use the above advise for local variables as well. For example, replace:

         for k, v := range m {
             k, v := k, v   // copy for capturing by the goroutine
             go func() {
                 // use k and v
             }()
         }

     with:

         for k, v := range m {
             x := struct{ k, v string }{k, v}   // copy for capturing by the goroutine
             go func() {
                 // use x.k and x.v
             }()
         }

     This replaces two memory allocations with a single allocation. However, this
     optimization usually negatively affects code readability, so use it
     reasonably.

 3.  A special case of allocation combining is slice array preallocation. If you
     know a typical size of the slice, you can preallocate a backing array for it
     as follows:

         type X struct {
             buf      []byte
             bufArray [16]byte // Buf usually does not grow beyond 16 bytes.
         }

         func MakeX() *X {
             x := &X{}
             // Preinitialize buf with the backing array.
             x.buf = x.bufArray[:0]
             return x
         }

 4.  If possible use smaller data types. For example, use `int8` instead of
     `int`.

 5.  Objects that do not contain any pointers (note that strings, slices, maps
     and chans contain implicit pointers), are not scanned by garbage collector.
     For example, a 1GB byte slice virtually does not affect garbage collection
     time. So if you remove pointers from actively used objects, it can
     positively impact garbage collection time. Some possibilities are: replace
     pointers with indices, split object into two parts one of which does not
     contain pointers.

 6.  Use freelists to reuse transient objects and reduce number of allocations.
     Standard library contains [sync.Pool](http://tip.golang.org/pkg/sync/#Pool)
     type that allows to reuse the same object several times in between garbage
     collections. However, be aware that, as any manual memory management scheme,
     incorrect use of sync.Pool can lead to use-after-free bugs.

 You can also use the Garbage Collector Trace (see below) to get some insights
 into memory issues.

 TODO(dvyukov): mention that stats are updated in a deferred way: "Memprof stats
 are updated in a deferred way. This is required in order to present a consistent
 picture in the situation when allocs are coming continuously and frees are
 coming in batches afterwards. Several consecutive GCs push the update pipeline
 forward. That's what you observe. So if you profile a live server then any
 sample will give you a consistent snapshot. However, if the program completes
 some activity and you want to collect the snapshot after this activity, then you
 need to execute 2 or 3 GCs before the collection."

 ## Blocking Profiler

 The blocking profiler shows where goroutine block waiting on synchronization
 primitives (including timer channels). You can collect it in similar ways as CPU
 profile: with `go test --blockprofile`
 (http://golang.org/cmd/go/#hdr-Description_of_testing_flags), with
 [net/http/pprof](http://golang.org/pkg/net/http/pprof) via
 http://myserver:6060:/debug/pprof/block or by calling
 [runtime/pprof.Lookup("block").WriteTo](http://golang.org/pkg/runtime/pprof/#Profile.WriteTo).

 But there is significant caveat -- the blocking profiler is not enabled by
 default. `go test --blockprofile` will enable it for you automatically. However,
 if you use `net/http/pprof` or `runtime/pprof`, you need to enable it manually
 (otherwise the profile will be empty). To enable the blocking profiler call
 [runtime.SetBlockProfileRate](http://golang.org/pkg/runtime/#SetBlockProfileRate).
 SetBlockProfileRate controls the fraction of goroutine blocking events that are
 reported in the blocking profile. The profiler aims to sample an average of one
 blocking event per the specified amount of nanoseconds spent blocked. To include
 every blocking event in the profile, set the rate to 1.

 If a function contains several blocking operations and it's not obvious which
 one leads to blocking, use `--lines` flag to pprof.

 Note that not all blocking is bad. When a goroutine blocks, the underlying
 worker thread simply switches to another goroutine. So blocking in the
 cooperative Go environment is very different from blocking on a mutex in a
 non-cooperative systems (e.g. typical C++ or Java threading libraries, where
 blocking leads to thread idling and expensive thread context switches). To give
 you some feeling, let's consider some examples.

 Blocking on a time.Ticker is usually OK. If a goroutine blocks on a Ticker for
 10 seconds, you will see 10 seconds of blocking in the profile, which is
 perfectly fine. Blocking on `sync.WaitGroup` is frequently OK. For example, is a
 task takes 10 seconds, the goroutine waiting on a WaitGroup for completion will
 account for 10 seconds of blocking in the profile. Blocking on sync.Cond may or
 may not be OK, depending on the situation. Consumer blocking on a channel
 suggests slow producers or lack of work. Producer blocking on a channel suggests
 that consumers are slower, but this is frequently OK. Blocking on a
 channel-based semaphore shows how much goroutines are gated on the semaphore.
 Blocking on a sync.Mutex or sync.RWMutex is usually bad. You can use `--ignore`
 flag to pprof to exclude known uninteresting blocking from a profile during
 visualization.

 Blocking of goroutines can lead to two negative consequences:

 1.  Program does not scale with processors due to lack of work. Scheduler Trace
     can help to identify this case.

 2.  Excessive goroutine blocking/unblocking consumes CPU time. CPU Profiler can
     help to identify this case (look at the System component).

 Here are some common suggestions that can help to reduce goroutine blocking:

 1.  Use sufficiently buffered channels in producer-consumer scenarios.
     Unbuffered channels substantially limit available parallelism in the
     program.

 2.  Use `sync.RWMutex` instead of `sync.Mutex` for read-mostly workloads.
     Readers never block other readers in `sync.RWMutex`, even on implementation
     level.

 3.  In some cases it's possible to remove mutexes entirely by using
     copy-on-write technique. If the protected data structure is modified
     infrequently and it's feasible to make copies of it, then it can be updated
     as follows:

         type Config struct {
             Routes   map[string]net.Addr
             Backends []net.Addr
         }

         var config atomic.Value  // actual type is *Config

         // Worker goroutines use this function to obtain the current config.
           // UpdateConfig must be called at least once before this func.
         func CurrentConfig() *Config {
             return config.Load().(*Config)
         }

         // Background goroutine periodically creates a new Config object
         // as sets it as current using this function.
         func UpdateConfig(cfg *Config) {
           config.Store(cfg)
         }

     This pattern prevents the writer from blocking readers during update.

 4.  Partitioning is another general technique for reducing contention/blocking
     on shared mutable data structures. Below is an example of how to partition a
     hashmap:

         type Partition struct {
             sync.RWMutex
             m map[string]string
         }

         const partCount = 64
         var m [partCount]Partition

         func Find(k string) string {
             idx := hash(k) % partCount
             part := &m[idx]
             part.RLock()
             v := part.m[k]
             part.RUnlock()
             return v
         }

 5.  Local caching and batching of updates can help to reduce contention on
     un-partitionable data structures. Below you can see how to batch sends to a
     channel:

         const CacheSize = 16

         type Cache struct {
             buf [CacheSize]int
             pos int
         }

         func Send(c chan [CacheSize]int, cache *Cache, value int) {
             cache.buf[cache.pos] = value
             cache.pos++
             if cache.pos == CacheSize {
                 c <- cache.buf
                 cache.pos = 0
             }
         }

     This technique is not limited to channels. It can be used to batch updates
     to a map, batch allocations, etc.

 6.  Use [sync.Pool](http://tip.golang.org/pkg/sync/#Pool) for freelists instead
     of chan-based or mutex-protected freelists. sync.Pool uses smart techniques
     internally to reduce blocking.

 ## Goroutine Profiler

 The goroutine profiler simply gives you current stacks of all live goroutines in
 the process. It can be handy to debug load balancing issues (see Scheduler Trace
 section below), or to debug deadlocks. The profile makes sense only for a
 running app, so go test does not expose it. You can collect the profile with
 [net/http/pprof](http://golang.org/pkg/net/http/pprof) via
 http://myserver:6060:/debug/pprof/goroutine, and visualize it to svg/pdf or by
 calling
 [runtime/pprof.Lookup("goroutine").WriteTo](http://golang.org/pkg/runtime/pprof/#Profile.WriteTo).
 But the most useful way is to type
 http://myserver:6060:/debug/pprof/goroutine?debug=2 in your browser, which will
 give you symbolized stacks similar to what you see when a program crashes. Note
 that goroutines in "syscall" state consume an OS thread, other goroutines do not
 (except for goroutines that called runtime.LockOSThread, which is,
 unfortunately, not visible in the profile). Note that goroutines in "IO wait"
 state also do not consume threads, they are parked on non-blocking network
 poller (which uses epoll/kqueue/GetQueuedCompletionStatus to unpark goroutines
 later).

 ## Garbage Collector Trace

 Aside from the profiling tools, there is another kind of tools available --
 tracers. They allow to trace garbage collections, memory allocator and goroutine
 scheduler state. To enable the garbage collector (GC) trace, run the program
 with `GODEBUG=gctrace=1` environment variable:

     $ GODEBUG=gctrace=1 ./myserver

 Then the program will print output similar to the following during execution:

     gc9(2): 12+1+744+8 us, 2 -> 10 MB, 108615 (593983-485368) objects, 4825/3620/0 sweeps, 0(0) handoff, 6(91) steal, 16/1/0 yields
     gc10(2): 12+6769+767+3 us, 1 -> 1 MB, 4222 (593983-589761) objects, 4825/0/1898 sweeps, 0(0) handoff, 6(93) steal, 16/10/2 yields
     gc11(2): 799+3+2050+3 us, 1 -> 69 MB, 831819 (1484009-652190) objects, 4825/691/0 sweeps, 0(0) handoff, 5(105) steal, 16/1/0 yields

 Let's consider the meaning of these numbers. One line per GC is printed. The
 first number ("gc9") is the number of GC (this is the 9-th GC since program
 start). The number in parens ("(2)") is the number of worker threads
 participated in the GC. The next 4 numbers ("12+1+744+8 us") mean
 stop-the-world, sweeping, marking and waiting for worker threads to finish, in
 microseconds, respectively. The next 2 numbers ("2 -> 10 MB") mean size of live
 heap after the previous GC and full heap size (including garbage) before the
 current GC. The next 3 numbers ("108615 (593983-485368) objects") are total
 number of objects in heap (including garbage) and total number of memory
 allocation and free operations. The next 3 numbers ("4825/3620/0 sweeps")
 characterize sweep phase (of the previous GC): there were total 4825 memory
 spans, 3620 were swept on demand or in background, 0 were swept during
 stop-the-world phase (the rest were unused spans). The next 4 numbers ("0(0)
 handoff, 6(91) steal") characterize load balancing during parallel mark phase:
 there were 0 object handoff operations (0 objects were handoff), and 6 steal
 operations (91 objects were stolen). The last 3 numbers ("16/1/0 yields")
 characterize effectiveness of parallel mark phase: there were total of 17 yield
 operations during waiting for another thread.

 The GC is [mark-and-sweep
 type](http://www.brpreiss.com/books/opus5/html/page424.html). Total GC can be
 expressed as:

     Tgc = Tseq + Tmark + Tsweep

 where Tseq is time to stop user goroutines and some preparation activities
 (usually small); Tmark is heap marking time, marking happens when all user
 goroutines are stopped, and thus can significantly affect latency of processing;
 Tsweep is heap sweeping time, sweeping generally happens concurrently with
 normal program execution, and so is not so critical for latency.

 Marking time can be approximately expressed as:

     Tmark = C1*Nlive + C2*MEMlive_ptr + C3*Nlive_ptr

 where Nlive is the number of live objects in the heap during GC, `MEMlive_ptr`
 is the amount of memory occupied by live objects with pointers, `Nlive_ptr` is
 the number of pointers in live objects.

 Sweeping time can be approximately expressed as:

     Tsweep = C4*MEMtotal + C5*MEMgarbage

 where `MEMtotal` is the total amount of heap memory, `MEMgarbage` is the amount
 of garbage in the heap.

 Next GC happens after the program has allocated an extra amount of memory
 proportional to the amount already in use. The proportion is controlled by GOGC
 environment variable (100 by default). If GOGC=100 and the program is using 4M
 of heap memory, runtime will trigger GC again when the program gets to 8M. This
 keeps the GC cost in linear proportion to the allocation cost. Adjusting GOGC
 changes the linear constant and also the amount of extra memory used.

 Only sweeping depends on total size of the heap, and sweeping happens
 concurrently with normal program execution. So it can make sense to set GOGC to
 a higher value (200, 300, 500, etc) if you can afford extra memory consumption.
 For example, GOGC=300 can reduce garbage collection overhead by up to 2 times
 while keeping latencies the same (at the cost of 2 times larger heap).

 GC is parallel and generally scales well with hardware parallelism. So it can
 make sense to set GOMAXPROCS to higher value even for sequential programs just
 to speed up garbage collections. However, note that number of garbage collector
 threads is currently bounded by 8.

 ## Memory Allocator Trace

 Memory allocator traces simply dumps all memory allocation and free operations
 onto console. It's enabled with GODEBUG=allocfreetrace=1 environment variable.
 The output looks along the lines of:

     tracealloc(0xc208062500, 0x100, array of parse.Node)
     goroutine 16 [running]:
     runtime.mallocgc(0x100, 0x3eb7c1, 0x0)
         runtime/malloc.goc:190 +0x145 fp=0xc2080b39f8
     runtime.growslice(0x31f840, 0xc208060700, 0x8, 0x8, 0x1, 0x0, 0x0, 0x0)
         runtime/slice.goc:76 +0xbb fp=0xc2080b3a90
     text/template/parse.(*Tree).parse(0xc2080820e0, 0xc208023620, 0x0, 0x0)
         text/template/parse/parse.go:289 +0x549 fp=0xc2080b3c50
     ...

     tracefree(0xc208002d80, 0x120)
     goroutine 16 [running]:
     runtime.MSpan_Sweep(0x73b080)
             runtime/mgc0.c:1880 +0x514 fp=0xc20804b8f0
     runtime.MCentral_CacheSpan(0x69c858)
             runtime/mcentral.c:48 +0x2b5 fp=0xc20804b920
     runtime.MCache_Refill(0x737000, 0xc200000012)
             runtime/mcache.c:78 +0x119 fp=0xc20804b950
     ...

 The trace contains address of the memory block, size, type, goroutine id and the
 stack trace. It's probably more useful for debugging, but can give very
 fine-grained info for allocation optimizations as well.

 ## Scheduler Trace

 Scheduler trace can provide insights into dynamic behavior of the goroutine
 scheduler and allow to debug load balancing and scalability issues. To enable
 the scheduler trace trace, run the program with GODEBUG=schedtrace=1000
 environment variable (the value means period of output, in ms, in this case it's
 once per second):

     $ GODEBUG=schedtrace=1000 ./myserver

 Then the program will print output similar to the following during execution:

     SCHED 1004ms: gomaxprocs=4 idleprocs=0 threads=11 idlethreads=4 runqueue=8 [0 1 0 3]
     SCHED 2005ms: gomaxprocs=4 idleprocs=0 threads=11 idlethreads=5 runqueue=6 [1 5 4 0]
     SCHED 3008ms: gomaxprocs=4 idleprocs=0 threads=11 idlethreads=4 runqueue=10 [2 2 2 1]

 The first number ("1004ms") is time since program start. Gomaxprocs is the
 current value of GOMAXPROCS. Idleprocs is the number of idling processors (the
 rest are executing Go code). Threads is the total number of worker threads
 created by the scheduler (threads can be in 3 states: execute Go code
 (gomaxprocs-idleprocs), execute syscalls/cgocalls or idle). Idlethreads is the
 number of idling worker threads. Runqueue is the length of global queue with
 runnable goroutines. The numbers in square brackets ("[0 1 0 3]") are lengths of
 per-processor queues with runnable goroutines. Sum of lengths of global and
 local queues represents the total number of goroutines available for execution.

 Note: You can combine any of the tracers as
 GODEBUG=gctrace=1,allocfreetrace=1,schedtrace=1000.

 Note: There is also detailed scheduler trace, which you can enable with
 GODEBUG=schedtrace=1000,scheddetail=1. It prints detailed info about every
 goroutine, worker thread and processor. We won't describe its format here as
 it's mainly useful for scheduler developers; but you can find details in
 [src/pkg/runtime/proc.c](https://code.google.com/p/go/source/browse/src/pkg/runtime/proc.c).

 The scheduler trace is useful when a program does not scale linearly with
 GOMAXPROCS and/or does not consume 100% of CPU time. The ideal situation is when
 all processors are busy executing Go code, number of threads is reasonable,
 there is plenty of work in all queues and the work is reasonably evenly
 distributed:

     gomaxprocs=8 idleprocs=0 threads=40 idlethreads=5 runqueue=10 [20 20 20 20 20 20 20 20]

 A bad situation is when something of the above does not hold. For example the
 following sample demonstrates shortage of work to keep all processors busy:

     gomaxprocs=8 idleprocs=6 threads=40 idlethreads=30 runqueue=0 [0 2 0 0 0 1 0 0]

 Note: use OS-provided means to measure actual CPU utilization as the ultimate
 characteristic. On Unix family of operating system it is top command; on Windows
 it is Task Manager.

 You can use the goroutine profiler to understand where goroutines block in the
 case of work shortage. Note that load imbalance is not ultimately bad as long as
 all processors are busy, it will just cause some moderate load balancing
 overheads.

 ## Memory Statistics

 Go runtime exposes coarse-grained memory statistics via
 [runtime.ReadMemStats](http://golang.org/pkg/runtime/#ReadMemStats) function.
 The statistics are also exposed via net/http/pprof at the bottom of
 http://myserver:6060/debug/pprof/heap?debug=1. The statistics are described
 [here](http://golang.org/pkg/runtime/#MemStats). Some of the interesting fields
 are:

 1.  HeapAlloc - current heap size.
 2.  HeapSys - total heap size.
 3.  HeapObjects - total number of objects in the heap.
 4.  HeapReleased - amount of memory released to the OS; runtime releases to the
     OS memory unused for 5 minutes, you can force this process with
     runtime/debug.FreeOSMemory.
 5.  Sys - total amount of memory allocated from OS.
 6.  Sys-HeapReleased - effective memory consumption of the program.
 7.  StackSys - memory consumed for goroutine stacks (note that some stacks are
     allocated from heap and are not accounted here, unfortunately there is no
     way to get total size of stacks
     (https://code.google.com/p/go/issues/detail?id=7468)).
 8.  MSpanSys/MCacheSys/BuckHashSys/GCSys/OtherSys - amount of memory allocated
     by runtime for various auxiliary purposes; they are generally not
     interesting, unless they are too high.
 9.  PauseNs - durations of last garbage collections.

 ## Heap Dumper

 The last available tool is heap dumper, it can write state of the whole heap
 into a file for future exploration. It can be useful for identifying memory
 leaks and getting insights into program memory consumption.

 First, you need to write the dump using
 [runtime/debug.WriteHeapDump](http://tip.golang.org/pkg/runtime/debug/#WriteHeapDump)
 function:

         f, err := os.Create("heapdump")
         if err != nil { ... }
         debug.WriteHeapDump(f.Fd())

 Then you can either render it to a dot file with graphical representation of the
 heap or convert it to hprof format. To render it to a dot file:

     $ go get github.com/randall77/hprof/dumptodot
     $ dumptodot heapdump mybinary > heap.dot

 and open `heap.dot` with Graphviz.

 To convert it to `hprof` format:

     $ go get github.com/randall77/hprof/dumptohprof
     $ dumptohprof heapdump heap.hprof
     $ jhat heap.hprof

 and navigate your browser to http://myserver:7000.

 ## Concluding Remarks

 Optimization is an open problem, there are simple recipes that you can use to
 improve performance. Sometimes optimization requires complete re-architecture of
 the program. But we hope that the tools will be a valuable addition to your
 toolbox, that you can use to at least analyze and understand what happens.
 [Profiling Go Programs](http://blog.golang.org/profiling-go-programs) is a good
 tutorial on usage of CPU and memory profilers to optimize a simple program.