Andrew Gerrand | 2a18984 | 2011-08-17 15:53:17 +1000 | [diff] [blame] | 1 | <!-- |
| 2 | Copyright 2011 The Go Authors. All rights reserved. |
| 3 | Use of this source code is governed by a BSD-style |
| 4 | license that can be found in the LICENSE file. |
| 5 | --> |
| 6 | |
| 7 | <codewalk title="Generating arbitrary text: a Markov chain algorithm"> |
| 8 | |
| 9 | <step title="Introduction" src="doc/codewalk/markov.go:/Generating/,/line\./"> |
| 10 | This codewalk describes a program that generates random text using |
| 11 | a Markov chain algorithm. The package comment describes the algorithm |
| 12 | and the operation of the program. Please read it before continuing. |
| 13 | </step> |
| 14 | |
| 15 | <step title="Modeling Markov chains" src="doc/codewalk/markov.go:/ chain/"> |
| 16 | A chain consists of a prefix and a suffix. Each prefix is a set |
| 17 | number of words, while a suffix is a single word. |
| 18 | A prefix can have an arbitrary number of suffixes. |
| 19 | To model this data, we use a <code>map[string][]string</code>. |
| 20 | Each map key is a prefix (a <code>string</code>) and its values are |
| 21 | lists of suffixes (a slice of strings, <code>[]string</code>). |
| 22 | <br/><br/> |
| 23 | Here is the example table from the package comment |
| 24 | as modeled by this data structure: |
| 25 | <pre> |
| 26 | map[string][]string{ |
| 27 | " ": {"I"}, |
| 28 | " I": {"am"}, |
| 29 | "I am": {"a", "not"}, |
| 30 | "a free": {"man!"}, |
| 31 | "am a": {"free"}, |
| 32 | "am not": {"a"}, |
| 33 | "a number!": {"I"}, |
| 34 | "number! I": {"am"}, |
| 35 | "not a": {"number!"}, |
| 36 | }</pre> |
| 37 | While each prefix consists of multiple words, we |
| 38 | store prefixes in the map as a single <code>string</code>. |
| 39 | It would seem more natural to store the prefix as a |
| 40 | <code>[]string</code>, but we can't do this with a map because the |
| 41 | key type of a map must implement equality (and slices do not). |
| 42 | <br/><br/> |
| 43 | Therefore, in most of our code we will model prefixes as a |
| 44 | <code>[]string</code> and join the strings together with a space |
| 45 | to generate the map key: |
| 46 | <pre> |
| 47 | Prefix Map key |
| 48 | |
| 49 | []string{"", ""} " " |
| 50 | []string{"", "I"} " I" |
| 51 | []string{"I", "am"} "I am" |
| 52 | </pre> |
| 53 | </step> |
| 54 | |
| 55 | <step title="The Chain struct" src="doc/codewalk/markov.go:/type Chain/,/}/"> |
| 56 | The complete state of the chain table consists of the table itself and |
| 57 | the word length of the prefixes. The <code>Chain</code> struct stores |
| 58 | this data. |
| 59 | </step> |
| 60 | |
Oling Cat | c5ebeff | 2012-10-18 08:12:44 +1100 | [diff] [blame] | 61 | <step title="The NewChain constructor function" src="doc/codewalk/markov.go:/func New/,/\n}/"> |
Andrew Gerrand | 2a18984 | 2011-08-17 15:53:17 +1000 | [diff] [blame] | 62 | The <code>Chain</code> struct has two unexported fields (those that |
| 63 | do not begin with an upper case character), and so we write a |
| 64 | <code>NewChain</code> constructor function that initializes the |
| 65 | <code>chain</code> map with <code>make</code> and sets the |
| 66 | <code>prefixLen</code> field. |
| 67 | <br/><br/> |
| 68 | This is constructor function is not strictly necessary as this entire |
| 69 | program is within a single package (<code>main</code>) and therefore |
| 70 | there is little practical difference between exported and unexported |
| 71 | fields. We could just as easily write out the contents of this function |
| 72 | when we want to construct a new Chain. |
| 73 | But using these unexported fields is good practice; it clearly denotes |
| 74 | that only methods of Chain and its constructor function should access |
| 75 | those fields. Also, structuring <code>Chain</code> like this means we |
| 76 | could easily move it into its own package at some later date. |
| 77 | </step> |
| 78 | |
| 79 | <step title="The Prefix type" src="doc/codewalk/markov.go:/type Prefix/"> |
| 80 | Since we'll be working with prefixes often, we define a |
| 81 | <code>Prefix</code> type with the concrete type <code>[]string</code>. |
| 82 | Defining a named type clearly allows us to be explicit when we are |
| 83 | working with a prefix instead of just a <code>[]string</code>. |
| 84 | Also, in Go we can define methods on any named type (not just structs), |
| 85 | so we can add methods that operate on <code>Prefix</code> if we need to. |
| 86 | </step> |
| 87 | |
| 88 | <step title="The String method" src="doc/codewalk/markov.go:/func[^\n]+String/,/}/"> |
| 89 | The first method we define on <code>Prefix</code> is |
| 90 | <code>String</code>. It returns a <code>string</code> representation |
| 91 | of a <code>Prefix</code> by joining the slice elements together with |
| 92 | spaces. We will use this method to generate keys when working with |
| 93 | the chain map. |
| 94 | </step> |
| 95 | |
| 96 | <step title="Building the chain" src="doc/codewalk/markov.go:/func[^\n]+Build/,/\n}/"> |
| 97 | The <code>Build</code> method reads text from an <code>io.Reader</code> |
| 98 | and parses it into prefixes and suffixes that are stored in the |
| 99 | <code>Chain</code>. |
| 100 | <br/><br/> |
| 101 | The <code><a href="/pkg/io/#Reader">io.Reader</a></code> is an |
| 102 | interface type that is widely used by the standard library and |
| 103 | other Go code. Our code uses the |
| 104 | <code><a href="/pkg/fmt/#Fscan">fmt.Fscan</a></code> function, which |
| 105 | reads space-separated values from an <code>io.Reader</code>. |
| 106 | <br/><br/> |
| 107 | The <code>Build</code> method returns once the <code>Reader</code>'s |
Vincent Vanackere | eb1717e | 2011-11-03 14:01:30 -0700 | [diff] [blame] | 108 | <code>Read</code> method returns <code>io.EOF</code> (end of file) |
Andrew Gerrand | 2a18984 | 2011-08-17 15:53:17 +1000 | [diff] [blame] | 109 | or some other read error occurs. |
| 110 | </step> |
| 111 | |
| 112 | <step title="Buffering the input" src="doc/codewalk/markov.go:/bufio\.NewReader/"> |
| 113 | This function does many small reads, which can be inefficient for some |
| 114 | <code>Readers</code>. For efficiency we wrap the provided |
| 115 | <code>io.Reader</code> with |
| 116 | <code><a href="/pkg/bufio/">bufio.NewReader</a></code> to create a |
| 117 | new <code>io.Reader</code> that provides buffering. |
| 118 | </step> |
| 119 | |
| 120 | <step title="The Prefix variable" src="doc/codewalk/markov.go:/make\(Prefix/"> |
| 121 | At the top of the function we make a <code>Prefix</code> slice |
| 122 | <code>p</code> using the <code>Chain</code>'s <code>prefixLen</code> |
| 123 | field as its length. |
| 124 | We'll use this variable to hold the current prefix and mutate it with |
| 125 | each new word we encounter. |
| 126 | </step> |
| 127 | |
| 128 | <step title="Scanning words" src="doc/codewalk/markov.go:/var s string/,/\n }/"> |
| 129 | In our loop we read words from the <code>Reader</code> into a |
| 130 | <code>string</code> variable <code>s</code> using |
| 131 | <code>fmt.Fscan</code>. Since <code>Fscan</code> uses space to |
| 132 | separate each input value, each call will yield just one word |
| 133 | (including punctuation), which is exactly what we need. |
| 134 | <br/><br/> |
| 135 | <code>Fscan</code> returns an error if it encounters a read error |
Vincent Vanackere | eb1717e | 2011-11-03 14:01:30 -0700 | [diff] [blame] | 136 | (<code>io.EOF</code>, for example) or if it can't scan the requested |
Andrew Gerrand | 2a18984 | 2011-08-17 15:53:17 +1000 | [diff] [blame] | 137 | value (in our case, a single string). In either case we just want to |
| 138 | stop scanning, so we <code>break</code> out of the loop. |
| 139 | </step> |
| 140 | |
| 141 | <step title="Adding a prefix and suffix to the chain" src="doc/codewalk/markov.go:/ key/,/key\], s\)"> |
| 142 | The word stored in <code>s</code> is a new suffix. We add the new |
| 143 | prefix/suffix combination to the <code>chain</code> map by computing |
| 144 | the map key with <code>p.String</code> and appending the suffix |
| 145 | to the slice stored under that key. |
| 146 | <br/><br/> |
| 147 | The built-in <code>append</code> function appends elements to a slice |
| 148 | and allocates new storage when necessary. When the provided slice is |
| 149 | <code>nil</code>, <code>append</code> allocates a new slice. |
| 150 | This behavior conveniently ties in with the semantics of our map: |
| 151 | retrieving an unset key returns the zero value of the value type and |
| 152 | the zero value of <code>[]string</code> is <code>nil</code>. |
| 153 | When our program encounters a new prefix (yielding a <code>nil</code> |
| 154 | value in the map) <code>append</code> will allocate a new slice. |
| 155 | <br/><br/> |
| 156 | For more information about the <code>append</code> function and slices |
| 157 | in general see the |
Shenghou Ma | 97b13ac | 2012-03-07 08:15:47 +1100 | [diff] [blame] | 158 | <a href="/doc/articles/slices_usage_and_internals.html">Slices: usage and internals</a> article. |
Andrew Gerrand | 2a18984 | 2011-08-17 15:53:17 +1000 | [diff] [blame] | 159 | </step> |
| 160 | |
| 161 | <step title="Pushing the suffix onto the prefix" src="doc/codewalk/markov.go:/p\.Shift/"> |
| 162 | Before reading the next word our algorithm requires us to drop the |
| 163 | first word from the prefix and push the current suffix onto the prefix. |
| 164 | <br/><br/> |
| 165 | When in this state |
| 166 | <pre> |
| 167 | p == Prefix{"I", "am"} |
| 168 | s == "not" </pre> |
| 169 | the new value for <code>p</code> would be |
| 170 | <pre> |
| 171 | p == Prefix{"am", "not"}</pre> |
| 172 | This operation is also required during text generation so we put |
| 173 | the code to perform this mutation of the slice inside a method on |
| 174 | <code>Prefix</code> named <code>Shift</code>. |
| 175 | </step> |
| 176 | |
| 177 | <step title="The Shift method" src="doc/codewalk/markov.go:/func[^\n]+Shift/,/\n}/"> |
| 178 | The <code>Shift</code> method uses the built-in <code>copy</code> |
| 179 | function to copy the last len(p)-1 elements of <code>p</code> to |
| 180 | the start of the slice, effectively moving the elements |
| 181 | one index to the left (if you consider zero as the leftmost index). |
| 182 | <pre> |
| 183 | p := Prefix{"I", "am"} |
Rob Pike | b91ae5c | 2013-04-01 15:52:15 -0700 | [diff] [blame] | 184 | copy(p, p[1:]) |
Andrew Gerrand | 2a18984 | 2011-08-17 15:53:17 +1000 | [diff] [blame] | 185 | // p == Prefix{"am", "am"}</pre> |
| 186 | We then assign the provided <code>word</code> to the last index |
| 187 | of the slice: |
| 188 | <pre> |
| 189 | // suffix == "not" |
| 190 | p[len(p)-1] = suffix |
| 191 | // p == Prefix{"am", "not"}</pre> |
| 192 | </step> |
| 193 | |
| 194 | <step title="Generating text" src="doc/codewalk/markov.go:/func[^\n]+Generate/,/\n}/"> |
| 195 | The <code>Generate</code> method is similar to <code>Build</code> |
| 196 | except that instead of reading words from a <code>Reader</code> |
| 197 | and storing them in a map, it reads words from the map and |
| 198 | appends them to a slice (<code>words</code>). |
| 199 | <br/><br/> |
| 200 | <code>Generate</code> uses a conditional for loop to generate |
| 201 | up to <code>n</code> words. |
| 202 | </step> |
| 203 | |
| 204 | <step title="Getting potential suffixes" src="doc/codewalk/markov.go:/choices/,/}\n/"> |
| 205 | At each iteration of the loop we retrieve a list of potential suffixes |
| 206 | for the current prefix. We access the <code>chain</code> map at key |
| 207 | <code>p.String()</code> and assign its contents to <code>choices</code>. |
| 208 | <br/><br/> |
| 209 | If <code>len(choices)</code> is zero we break out of the loop as there |
| 210 | are no potential suffixes for that prefix. |
| 211 | This test also works if the key isn't present in the map at all: |
| 212 | in that case, <code>choices</code> will be <code>nil</code> and the |
| 213 | length of a <code>nil</code> slice is zero. |
| 214 | </step> |
| 215 | |
| 216 | <step title="Choosing a suffix at random" src="doc/codewalk/markov.go:/next := choices/,/Shift/"> |
| 217 | To choose a suffix we use the |
Shenghou Ma | c24daa2 | 2012-03-30 15:00:23 +0800 | [diff] [blame] | 218 | <code><a href="/pkg/math/rand/#Intn">rand.Intn</a></code> function. |
Andrew Gerrand | 2a18984 | 2011-08-17 15:53:17 +1000 | [diff] [blame] | 219 | It returns a random integer up to (but not including) the provided |
| 220 | value. Passing in <code>len(choices)</code> gives us a random index |
| 221 | into the full length of the list. |
| 222 | <br/><br/> |
| 223 | We use that index to pick our new suffix, assign it to |
| 224 | <code>next</code> and append it to the <code>words</code> slice. |
| 225 | <br/><br/> |
| 226 | Next, we <code>Shift</code> the new suffix onto the prefix just as |
| 227 | we did in the <code>Build</code> method. |
| 228 | </step> |
| 229 | |
| 230 | <step title="Returning the generated text" src="doc/codewalk/markov.go:/Join\(words/"> |
| 231 | Before returning the generated text as a string, we use the |
| 232 | <code>strings.Join</code> function to join the elements of |
| 233 | the <code>words</code> slice together, separated by spaces. |
| 234 | </step> |
| 235 | |
| 236 | <step title="Command-line flags" src="doc/codewalk/markov.go:/Register command-line flags/,/prefixLen/"> |
| 237 | To make it easy to tweak the prefix and generated text lengths we |
| 238 | use the <code><a href="/pkg/flag/">flag</a></code> package to parse |
| 239 | command-line flags. |
| 240 | <br/><br/> |
| 241 | These calls to <code>flag.Int</code> register new flags with the |
| 242 | <code>flag</code> package. The arguments to <code>Int</code> are the |
| 243 | flag name, its default value, and a description. The <code>Int</code> |
| 244 | function returns a pointer to an integer that will contain the |
| 245 | user-supplied value (or the default value if the flag was omitted on |
| 246 | the command-line). |
| 247 | </step> |
| 248 | |
| 249 | <step title="Program set up" src="doc/codewalk/markov.go:/flag.Parse/,/rand.Seed/"> |
| 250 | The <code>main</code> function begins by parsing the command-line |
| 251 | flags with <code>flag.Parse</code> and seeding the <code>rand</code> |
| 252 | package's random number generator with the current time. |
| 253 | <br/><br/> |
| 254 | If the command-line flags provided by the user are invalid the |
| 255 | <code>flag.Parse</code> function will print an informative usage |
| 256 | message and terminate the program. |
| 257 | </step> |
| 258 | |
| 259 | <step title="Creating and building a new Chain" src="doc/codewalk/markov.go:/c := NewChain/,/c\.Build/"> |
| 260 | To create the new <code>Chain</code> we call <code>NewChain</code> |
| 261 | with the value of the <code>prefix</code> flag. |
| 262 | <br/><br/> |
| 263 | To build the chain we call <code>Build</code> with |
| 264 | <code>os.Stdin</code> (which implements <code>io.Reader</code>) so |
| 265 | that it will read its input from standard input. |
| 266 | </step> |
| 267 | |
| 268 | <step title="Generating and printing text" src="doc/codewalk/markov.go:/c\.Generate/,/fmt.Println/"> |
| 269 | Finally, to generate text we call <code>Generate</code> with |
| 270 | the value of the <code>words</code> flag and assigning the result |
| 271 | to the variable <code>text</code>. |
| 272 | <br/><br/> |
| 273 | Then we call <code>fmt.Println</code> to write the text to standard |
| 274 | output, followed by a carriage return. |
| 275 | </step> |
| 276 | |
| 277 | <step title="Using this program" src="doc/codewalk/markov.go"> |
Andrew Gerrand | 2a5879d | 2012-03-20 13:50:05 +1100 | [diff] [blame] | 278 | To use this program, first build it with the |
| 279 | <a href="/cmd/go/">go</a> command: |
Andrew Gerrand | 2a18984 | 2011-08-17 15:53:17 +1000 | [diff] [blame] | 280 | <pre> |
Andrew Gerrand | 2a5879d | 2012-03-20 13:50:05 +1100 | [diff] [blame] | 281 | $ go build markov.go</pre> |
Andrew Gerrand | 2a18984 | 2011-08-17 15:53:17 +1000 | [diff] [blame] | 282 | And then execute it while piping in some input text: |
| 283 | <pre> |
Andrew Gerrand | 2a5879d | 2012-03-20 13:50:05 +1100 | [diff] [blame] | 284 | $ echo "a man a plan a canal panama" \ |
| 285 | | ./markov -prefix=1 |
| 286 | a plan a man a plan a canal panama</pre> |
Andrew Gerrand | 2a18984 | 2011-08-17 15:53:17 +1000 | [diff] [blame] | 287 | Here's a transcript of generating some text using the Go distribution's |
| 288 | README file as source material: |
| 289 | <pre> |
Shenghou Ma | c24daa2 | 2012-03-30 15:00:23 +0800 | [diff] [blame] | 290 | $ ./markov -words=10 < $GOROOT/README |
Andrew Gerrand | 2a18984 | 2011-08-17 15:53:17 +1000 | [diff] [blame] | 291 | This is the source code repository for the Go source |
Shenghou Ma | c24daa2 | 2012-03-30 15:00:23 +0800 | [diff] [blame] | 292 | $ ./markov -prefix=1 -words=10 < $GOROOT/README |
Andrew Gerrand | 2a18984 | 2011-08-17 15:53:17 +1000 | [diff] [blame] | 293 | This is the go directory (the one containing this README). |
Shenghou Ma | c24daa2 | 2012-03-30 15:00:23 +0800 | [diff] [blame] | 294 | $ ./markov -prefix=1 -words=10 < $GOROOT/README |
Andrew Gerrand | 2a18984 | 2011-08-17 15:53:17 +1000 | [diff] [blame] | 295 | This is the variable if you have just untarred a</pre> |
| 296 | </step> |
| 297 | |
| 298 | <step title="An exercise for the reader" src="doc/codewalk/markov.go"> |
| 299 | The <code>Generate</code> function does a lot of allocations when it |
| 300 | builds the <code>words</code> slice. As an exercise, modify it to |
| 301 | take an <code>io.Writer</code> to which it incrementally writes the |
| 302 | generated text with <code>Fprint</code>. |
| 303 | Aside from being more efficient this makes <code>Generate</code> |
| 304 | more symmetrical to <code>Build</code>. |
| 305 | </step> |
| 306 | |
| 307 | </codewalk> |