blob: 76c448ac32a0f1a3dffba647e44290fbe4b05559 [file] [log] [blame]
Andrew Gerrand2a189842011-08-17 15:53:17 +10001<!--
2Copyright 2011 The Go Authors. All rights reserved.
3Use of this source code is governed by a BSD-style
4license that can be found in the LICENSE file.
5-->
6
7<codewalk title="Generating arbitrary text: a Markov chain algorithm">
8
9<step title="Introduction" src="doc/codewalk/markov.go:/Generating/,/line\./">
10 This codewalk describes a program that generates random text using
11 a Markov chain algorithm. The package comment describes the algorithm
12 and the operation of the program. Please read it before continuing.
13</step>
14
15<step title="Modeling Markov chains" src="doc/codewalk/markov.go:/ chain/">
16 A chain consists of a prefix and a suffix. Each prefix is a set
17 number of words, while a suffix is a single word.
18 A prefix can have an arbitrary number of suffixes.
19 To model this data, we use a <code>map[string][]string</code>.
20 Each map key is a prefix (a <code>string</code>) and its values are
21 lists of suffixes (a slice of strings, <code>[]string</code>).
22 <br/><br/>
23 Here is the example table from the package comment
24 as modeled by this data structure:
25 <pre>
26map[string][]string{
27 " ": {"I"},
28 " I": {"am"},
29 "I am": {"a", "not"},
30 "a free": {"man!"},
31 "am a": {"free"},
32 "am not": {"a"},
33 "a number!": {"I"},
34 "number! I": {"am"},
35 "not a": {"number!"},
36}</pre>
37 While each prefix consists of multiple words, we
38 store prefixes in the map as a single <code>string</code>.
39 It would seem more natural to store the prefix as a
40 <code>[]string</code>, but we can't do this with a map because the
41 key type of a map must implement equality (and slices do not).
42 <br/><br/>
43 Therefore, in most of our code we will model prefixes as a
44 <code>[]string</code> and join the strings together with a space
45 to generate the map key:
46 <pre>
47Prefix Map key
48
49[]string{"", ""} " "
50[]string{"", "I"} " I"
51[]string{"I", "am"} "I am"
52</pre>
53</step>
54
55<step title="The Chain struct" src="doc/codewalk/markov.go:/type Chain/,/}/">
56 The complete state of the chain table consists of the table itself and
57 the word length of the prefixes. The <code>Chain</code> struct stores
58 this data.
59</step>
60
Oling Catc5ebeff2012-10-18 08:12:44 +110061<step title="The NewChain constructor function" src="doc/codewalk/markov.go:/func New/,/\n}/">
Andrew Gerrand2a189842011-08-17 15:53:17 +100062 The <code>Chain</code> struct has two unexported fields (those that
63 do not begin with an upper case character), and so we write a
64 <code>NewChain</code> constructor function that initializes the
65 <code>chain</code> map with <code>make</code> and sets the
66 <code>prefixLen</code> field.
67 <br/><br/>
68 This is constructor function is not strictly necessary as this entire
69 program is within a single package (<code>main</code>) and therefore
70 there is little practical difference between exported and unexported
71 fields. We could just as easily write out the contents of this function
72 when we want to construct a new Chain.
73 But using these unexported fields is good practice; it clearly denotes
74 that only methods of Chain and its constructor function should access
75 those fields. Also, structuring <code>Chain</code> like this means we
76 could easily move it into its own package at some later date.
77</step>
78
79<step title="The Prefix type" src="doc/codewalk/markov.go:/type Prefix/">
80 Since we'll be working with prefixes often, we define a
81 <code>Prefix</code> type with the concrete type <code>[]string</code>.
82 Defining a named type clearly allows us to be explicit when we are
83 working with a prefix instead of just a <code>[]string</code>.
84 Also, in Go we can define methods on any named type (not just structs),
85 so we can add methods that operate on <code>Prefix</code> if we need to.
86</step>
87
88<step title="The String method" src="doc/codewalk/markov.go:/func[^\n]+String/,/}/">
89 The first method we define on <code>Prefix</code> is
90 <code>String</code>. It returns a <code>string</code> representation
91 of a <code>Prefix</code> by joining the slice elements together with
92 spaces. We will use this method to generate keys when working with
93 the chain map.
94</step>
95
96<step title="Building the chain" src="doc/codewalk/markov.go:/func[^\n]+Build/,/\n}/">
97 The <code>Build</code> method reads text from an <code>io.Reader</code>
98 and parses it into prefixes and suffixes that are stored in the
99 <code>Chain</code>.
100 <br/><br/>
101 The <code><a href="/pkg/io/#Reader">io.Reader</a></code> is an
102 interface type that is widely used by the standard library and
103 other Go code. Our code uses the
104 <code><a href="/pkg/fmt/#Fscan">fmt.Fscan</a></code> function, which
105 reads space-separated values from an <code>io.Reader</code>.
106 <br/><br/>
107 The <code>Build</code> method returns once the <code>Reader</code>'s
Vincent Vanackereeb1717e2011-11-03 14:01:30 -0700108 <code>Read</code> method returns <code>io.EOF</code> (end of file)
Andrew Gerrand2a189842011-08-17 15:53:17 +1000109 or some other read error occurs.
110</step>
111
112<step title="Buffering the input" src="doc/codewalk/markov.go:/bufio\.NewReader/">
113 This function does many small reads, which can be inefficient for some
114 <code>Readers</code>. For efficiency we wrap the provided
115 <code>io.Reader</code> with
116 <code><a href="/pkg/bufio/">bufio.NewReader</a></code> to create a
117 new <code>io.Reader</code> that provides buffering.
118</step>
119
120<step title="The Prefix variable" src="doc/codewalk/markov.go:/make\(Prefix/">
121 At the top of the function we make a <code>Prefix</code> slice
122 <code>p</code> using the <code>Chain</code>'s <code>prefixLen</code>
123 field as its length.
124 We'll use this variable to hold the current prefix and mutate it with
125 each new word we encounter.
126</step>
127
128<step title="Scanning words" src="doc/codewalk/markov.go:/var s string/,/\n }/">
129 In our loop we read words from the <code>Reader</code> into a
130 <code>string</code> variable <code>s</code> using
131 <code>fmt.Fscan</code>. Since <code>Fscan</code> uses space to
132 separate each input value, each call will yield just one word
133 (including punctuation), which is exactly what we need.
134 <br/><br/>
135 <code>Fscan</code> returns an error if it encounters a read error
Vincent Vanackereeb1717e2011-11-03 14:01:30 -0700136 (<code>io.EOF</code>, for example) or if it can't scan the requested
Andrew Gerrand2a189842011-08-17 15:53:17 +1000137 value (in our case, a single string). In either case we just want to
138 stop scanning, so we <code>break</code> out of the loop.
139</step>
140
141<step title="Adding a prefix and suffix to the chain" src="doc/codewalk/markov.go:/ key/,/key\], s\)">
142 The word stored in <code>s</code> is a new suffix. We add the new
143 prefix/suffix combination to the <code>chain</code> map by computing
144 the map key with <code>p.String</code> and appending the suffix
145 to the slice stored under that key.
146 <br/><br/>
147 The built-in <code>append</code> function appends elements to a slice
148 and allocates new storage when necessary. When the provided slice is
149 <code>nil</code>, <code>append</code> allocates a new slice.
150 This behavior conveniently ties in with the semantics of our map:
151 retrieving an unset key returns the zero value of the value type and
152 the zero value of <code>[]string</code> is <code>nil</code>.
153 When our program encounters a new prefix (yielding a <code>nil</code>
154 value in the map) <code>append</code> will allocate a new slice.
155 <br/><br/>
156 For more information about the <code>append</code> function and slices
157 in general see the
Shenghou Ma97b13ac2012-03-07 08:15:47 +1100158 <a href="/doc/articles/slices_usage_and_internals.html">Slices: usage and internals</a> article.
Andrew Gerrand2a189842011-08-17 15:53:17 +1000159</step>
160
161<step title="Pushing the suffix onto the prefix" src="doc/codewalk/markov.go:/p\.Shift/">
162 Before reading the next word our algorithm requires us to drop the
163 first word from the prefix and push the current suffix onto the prefix.
164 <br/><br/>
165 When in this state
166 <pre>
167p == Prefix{"I", "am"}
168s == "not" </pre>
169 the new value for <code>p</code> would be
170 <pre>
171p == Prefix{"am", "not"}</pre>
172 This operation is also required during text generation so we put
173 the code to perform this mutation of the slice inside a method on
174 <code>Prefix</code> named <code>Shift</code>.
175</step>
176
177<step title="The Shift method" src="doc/codewalk/markov.go:/func[^\n]+Shift/,/\n}/">
178 The <code>Shift</code> method uses the built-in <code>copy</code>
179 function to copy the last len(p)-1 elements of <code>p</code> to
180 the start of the slice, effectively moving the elements
181 one index to the left (if you consider zero as the leftmost index).
182 <pre>
183p := Prefix{"I", "am"}
Rob Pikeb91ae5c2013-04-01 15:52:15 -0700184copy(p, p[1:])
Andrew Gerrand2a189842011-08-17 15:53:17 +1000185// p == Prefix{"am", "am"}</pre>
186 We then assign the provided <code>word</code> to the last index
187 of the slice:
188 <pre>
189// suffix == "not"
190p[len(p)-1] = suffix
191// p == Prefix{"am", "not"}</pre>
192</step>
193
194<step title="Generating text" src="doc/codewalk/markov.go:/func[^\n]+Generate/,/\n}/">
195 The <code>Generate</code> method is similar to <code>Build</code>
196 except that instead of reading words from a <code>Reader</code>
197 and storing them in a map, it reads words from the map and
198 appends them to a slice (<code>words</code>).
199 <br/><br/>
200 <code>Generate</code> uses a conditional for loop to generate
201 up to <code>n</code> words.
202</step>
203
204<step title="Getting potential suffixes" src="doc/codewalk/markov.go:/choices/,/}\n/">
205 At each iteration of the loop we retrieve a list of potential suffixes
206 for the current prefix. We access the <code>chain</code> map at key
207 <code>p.String()</code> and assign its contents to <code>choices</code>.
208 <br/><br/>
209 If <code>len(choices)</code> is zero we break out of the loop as there
210 are no potential suffixes for that prefix.
211 This test also works if the key isn't present in the map at all:
212 in that case, <code>choices</code> will be <code>nil</code> and the
213 length of a <code>nil</code> slice is zero.
214</step>
215
216<step title="Choosing a suffix at random" src="doc/codewalk/markov.go:/next := choices/,/Shift/">
217 To choose a suffix we use the
Shenghou Mac24daa22012-03-30 15:00:23 +0800218 <code><a href="/pkg/math/rand/#Intn">rand.Intn</a></code> function.
Andrew Gerrand2a189842011-08-17 15:53:17 +1000219 It returns a random integer up to (but not including) the provided
220 value. Passing in <code>len(choices)</code> gives us a random index
221 into the full length of the list.
222 <br/><br/>
223 We use that index to pick our new suffix, assign it to
224 <code>next</code> and append it to the <code>words</code> slice.
225 <br/><br/>
226 Next, we <code>Shift</code> the new suffix onto the prefix just as
227 we did in the <code>Build</code> method.
228</step>
229
230<step title="Returning the generated text" src="doc/codewalk/markov.go:/Join\(words/">
231 Before returning the generated text as a string, we use the
232 <code>strings.Join</code> function to join the elements of
233 the <code>words</code> slice together, separated by spaces.
234</step>
235
236<step title="Command-line flags" src="doc/codewalk/markov.go:/Register command-line flags/,/prefixLen/">
237 To make it easy to tweak the prefix and generated text lengths we
238 use the <code><a href="/pkg/flag/">flag</a></code> package to parse
239 command-line flags.
240 <br/><br/>
241 These calls to <code>flag.Int</code> register new flags with the
242 <code>flag</code> package. The arguments to <code>Int</code> are the
243 flag name, its default value, and a description. The <code>Int</code>
244 function returns a pointer to an integer that will contain the
245 user-supplied value (or the default value if the flag was omitted on
246 the command-line).
247</step>
248
249<step title="Program set up" src="doc/codewalk/markov.go:/flag.Parse/,/rand.Seed/">
250 The <code>main</code> function begins by parsing the command-line
251 flags with <code>flag.Parse</code> and seeding the <code>rand</code>
252 package's random number generator with the current time.
253 <br/><br/>
254 If the command-line flags provided by the user are invalid the
255 <code>flag.Parse</code> function will print an informative usage
256 message and terminate the program.
257</step>
258
259<step title="Creating and building a new Chain" src="doc/codewalk/markov.go:/c := NewChain/,/c\.Build/">
260 To create the new <code>Chain</code> we call <code>NewChain</code>
261 with the value of the <code>prefix</code> flag.
262 <br/><br/>
263 To build the chain we call <code>Build</code> with
264 <code>os.Stdin</code> (which implements <code>io.Reader</code>) so
265 that it will read its input from standard input.
266</step>
267
268<step title="Generating and printing text" src="doc/codewalk/markov.go:/c\.Generate/,/fmt.Println/">
269 Finally, to generate text we call <code>Generate</code> with
270 the value of the <code>words</code> flag and assigning the result
271 to the variable <code>text</code>.
272 <br/><br/>
273 Then we call <code>fmt.Println</code> to write the text to standard
274 output, followed by a carriage return.
275</step>
276
277<step title="Using this program" src="doc/codewalk/markov.go">
Andrew Gerrand2a5879d2012-03-20 13:50:05 +1100278 To use this program, first build it with the
279 <a href="/cmd/go/">go</a> command:
Andrew Gerrand2a189842011-08-17 15:53:17 +1000280 <pre>
Andrew Gerrand2a5879d2012-03-20 13:50:05 +1100281$ go build markov.go</pre>
Andrew Gerrand2a189842011-08-17 15:53:17 +1000282 And then execute it while piping in some input text:
283 <pre>
Andrew Gerrand2a5879d2012-03-20 13:50:05 +1100284$ echo "a man a plan a canal panama" \
285 | ./markov -prefix=1
286a plan a man a plan a canal panama</pre>
Andrew Gerrand2a189842011-08-17 15:53:17 +1000287 Here's a transcript of generating some text using the Go distribution's
288 README file as source material:
289 <pre>
Shenghou Mac24daa22012-03-30 15:00:23 +0800290$ ./markov -words=10 &lt; $GOROOT/README
Andrew Gerrand2a189842011-08-17 15:53:17 +1000291This is the source code repository for the Go source
Shenghou Mac24daa22012-03-30 15:00:23 +0800292$ ./markov -prefix=1 -words=10 &lt; $GOROOT/README
Andrew Gerrand2a189842011-08-17 15:53:17 +1000293This is the go directory (the one containing this README).
Shenghou Mac24daa22012-03-30 15:00:23 +0800294$ ./markov -prefix=1 -words=10 &lt; $GOROOT/README
Andrew Gerrand2a189842011-08-17 15:53:17 +1000295This is the variable if you have just untarred a</pre>
296</step>
297
298<step title="An exercise for the reader" src="doc/codewalk/markov.go">
299 The <code>Generate</code> function does a lot of allocations when it
300 builds the <code>words</code> slice. As an exercise, modify it to
301 take an <code>io.Writer</code> to which it incrementally writes the
302 generated text with <code>Fprint</code>.
303 Aside from being more efficient this makes <code>Generate</code>
304 more symmetrical to <code>Build</code>.
305</step>
306
307</codewalk>