Blame - doc/codewalk/markov.xml - go

blob: 76c448ac32a0f1a3dffba647e44290fbe4b05559 [file] [log] [blame]

Andrew Gerrand	2a18984	2011-08-17 15:53:17 +1000	[diff] [blame]	1	<!--
				2	Copyright 2011 The Go Authors. All rights reserved.
				3	Use of this source code is governed by a BSD-style
				4	license that can be found in the LICENSE file.
				5	-->
				6
				7	<codewalk title="Generating arbitrary text: a Markov chain algorithm">
				8
				9	<step title="Introduction" src="doc/codewalk/markov.go:/Generating/,/line\./">
				10	This codewalk describes a program that generates random text using
				11	a Markov chain algorithm. The package comment describes the algorithm
				12	and the operation of the program. Please read it before continuing.
				13	</step>
				14
				15	<step title="Modeling Markov chains" src="doc/codewalk/markov.go:/ chain/">
				16	A chain consists of a prefix and a suffix. Each prefix is a set
				17	number of words, while a suffix is a single word.
				18	A prefix can have an arbitrary number of suffixes.
				19	To model this data, we use a <code>map[string][]string</code>.
				20	Each map key is a prefix (a <code>string</code>) and its values are
				21	lists of suffixes (a slice of strings, <code>[]string</code>).
				22	<br/><br/>
				23	Here is the example table from the package comment
				24	as modeled by this data structure:
				25	<pre>
				26	map[string][]string{
				27	" ": {"I"},
				28	" I": {"am"},
				29	"I am": {"a", "not"},
				30	"a free": {"man!"},
				31	"am a": {"free"},
				32	"am not": {"a"},
				33	"a number!": {"I"},
				34	"number! I": {"am"},
				35	"not a": {"number!"},
				36	}</pre>
				37	While each prefix consists of multiple words, we
				38	store prefixes in the map as a single <code>string</code>.
				39	It would seem more natural to store the prefix as a
				40	<code>[]string</code>, but we can't do this with a map because the
				41	key type of a map must implement equality (and slices do not).
				42	<br/><br/>
				43	Therefore, in most of our code we will model prefixes as a
				44	<code>[]string</code> and join the strings together with a space
				45	to generate the map key:
				46	<pre>
				47	Prefix Map key
				48
				49	[]string{"", ""} " "
				50	[]string{"", "I"} " I"
				51	[]string{"I", "am"} "I am"
				52	</pre>
				53	</step>
				54
				55	<step title="The Chain struct" src="doc/codewalk/markov.go:/type Chain/,/}/">
				56	The complete state of the chain table consists of the table itself and
				57	the word length of the prefixes. The <code>Chain</code> struct stores
				58	this data.
				59	</step>
				60
Oling Cat	c5ebeff	2012-10-18 08:12:44 +1100	[diff] [blame]	61	<step title="The NewChain constructor function" src="doc/codewalk/markov.go:/func New/,/\n}/">
Andrew Gerrand	2a18984	2011-08-17 15:53:17 +1000	[diff] [blame]	62	The <code>Chain</code> struct has two unexported fields (those that
				63	do not begin with an upper case character), and so we write a
				64	<code>NewChain</code> constructor function that initializes the
				65	<code>chain</code> map with <code>make</code> and sets the
				66	<code>prefixLen</code> field.
				67	<br/><br/>
				68	This is constructor function is not strictly necessary as this entire
				69	program is within a single package (<code>main</code>) and therefore
				70	there is little practical difference between exported and unexported
				71	fields. We could just as easily write out the contents of this function
				72	when we want to construct a new Chain.
				73	But using these unexported fields is good practice; it clearly denotes
				74	that only methods of Chain and its constructor function should access
				75	those fields. Also, structuring <code>Chain</code> like this means we
				76	could easily move it into its own package at some later date.
				77	</step>
				78
				79	<step title="The Prefix type" src="doc/codewalk/markov.go:/type Prefix/">
				80	Since we'll be working with prefixes often, we define a
				81	<code>Prefix</code> type with the concrete type <code>[]string</code>.
				82	Defining a named type clearly allows us to be explicit when we are
				83	working with a prefix instead of just a <code>[]string</code>.
				84	Also, in Go we can define methods on any named type (not just structs),
				85	so we can add methods that operate on <code>Prefix</code> if we need to.
				86	</step>
				87
				88	<step title="The String method" src="doc/codewalk/markov.go:/func[^\n]+String/,/}/">
				89	The first method we define on <code>Prefix</code> is
				90	<code>String</code>. It returns a <code>string</code> representation
				91	of a <code>Prefix</code> by joining the slice elements together with
				92	spaces. We will use this method to generate keys when working with
				93	the chain map.
				94	</step>
				95
				96	<step title="Building the chain" src="doc/codewalk/markov.go:/func[^\n]+Build/,/\n}/">
				97	The <code>Build</code> method reads text from an <code>io.Reader</code>
				98	and parses it into prefixes and suffixes that are stored in the
				99	<code>Chain</code>.
				100	<br/><br/>
				101	The <code><a href="/pkg/io/#Reader">io.Reader</a></code> is an
				102	interface type that is widely used by the standard library and
				103	other Go code. Our code uses the
				104	<code><a href="/pkg/fmt/#Fscan">fmt.Fscan</a></code> function, which
				105	reads space-separated values from an <code>io.Reader</code>.
				106	<br/><br/>
				107	The <code>Build</code> method returns once the <code>Reader</code>'s
Vincent Vanackere	eb1717e	2011-11-03 14:01:30 -0700	[diff] [blame]	108	<code>Read</code> method returns <code>io.EOF</code> (end of file)
Andrew Gerrand	2a18984	2011-08-17 15:53:17 +1000	[diff] [blame]	109	or some other read error occurs.
				110	</step>
				111
				112	<step title="Buffering the input" src="doc/codewalk/markov.go:/bufio\.NewReader/">
				113	This function does many small reads, which can be inefficient for some
				114	<code>Readers</code>. For efficiency we wrap the provided
				115	<code>io.Reader</code> with
				116	<code><a href="/pkg/bufio/">bufio.NewReader</a></code> to create a
				117	new <code>io.Reader</code> that provides buffering.
				118	</step>
				119
				120	<step title="The Prefix variable" src="doc/codewalk/markov.go:/make\(Prefix/">
				121	At the top of the function we make a <code>Prefix</code> slice
				122	<code>p</code> using the <code>Chain</code>'s <code>prefixLen</code>
				123	field as its length.
				124	We'll use this variable to hold the current prefix and mutate it with
				125	each new word we encounter.
				126	</step>
				127
				128	<step title="Scanning words" src="doc/codewalk/markov.go:/var s string/,/\n }/">
				129	In our loop we read words from the <code>Reader</code> into a
				130	<code>string</code> variable <code>s</code> using
				131	<code>fmt.Fscan</code>. Since <code>Fscan</code> uses space to
				132	separate each input value, each call will yield just one word
				133	(including punctuation), which is exactly what we need.
				134	<br/><br/>
				135	<code>Fscan</code> returns an error if it encounters a read error
Vincent Vanackere	eb1717e	2011-11-03 14:01:30 -0700	[diff] [blame]	136	(<code>io.EOF</code>, for example) or if it can't scan the requested
Andrew Gerrand	2a18984	2011-08-17 15:53:17 +1000	[diff] [blame]	137	value (in our case, a single string). In either case we just want to
				138	stop scanning, so we <code>break</code> out of the loop.
				139	</step>
				140
				141	<step title="Adding a prefix and suffix to the chain" src="doc/codewalk/markov.go:/ key/,/key\], s\)">
				142	The word stored in <code>s</code> is a new suffix. We add the new
				143	prefix/suffix combination to the <code>chain</code> map by computing
				144	the map key with <code>p.String</code> and appending the suffix
				145	to the slice stored under that key.
				146	<br/><br/>
				147	The built-in <code>append</code> function appends elements to a slice
				148	and allocates new storage when necessary. When the provided slice is
				149	<code>nil</code>, <code>append</code> allocates a new slice.
				150	This behavior conveniently ties in with the semantics of our map:
				151	retrieving an unset key returns the zero value of the value type and
				152	the zero value of <code>[]string</code> is <code>nil</code>.
				153	When our program encounters a new prefix (yielding a <code>nil</code>
				154	value in the map) <code>append</code> will allocate a new slice.
				155	<br/><br/>
				156	For more information about the <code>append</code> function and slices
				157	in general see the
Shenghou Ma	97b13ac	2012-03-07 08:15:47 +1100	[diff] [blame]	158	<a href="/doc/articles/slices_usage_and_internals.html">Slices: usage and internals</a> article.
Andrew Gerrand	2a18984	2011-08-17 15:53:17 +1000	[diff] [blame]	159	</step>
				160
				161	<step title="Pushing the suffix onto the prefix" src="doc/codewalk/markov.go:/p\.Shift/">
				162	Before reading the next word our algorithm requires us to drop the
				163	first word from the prefix and push the current suffix onto the prefix.
				164	<br/><br/>
				165	When in this state
				166	<pre>
				167	p == Prefix{"I", "am"}
				168	s == "not" </pre>
				169	the new value for <code>p</code> would be
				170	<pre>
				171	p == Prefix{"am", "not"}</pre>
				172	This operation is also required during text generation so we put
				173	the code to perform this mutation of the slice inside a method on
				174	<code>Prefix</code> named <code>Shift</code>.
				175	</step>
				176
				177	<step title="The Shift method" src="doc/codewalk/markov.go:/func[^\n]+Shift/,/\n}/">
				178	The <code>Shift</code> method uses the built-in <code>copy</code>
				179	function to copy the last len(p)-1 elements of <code>p</code> to
				180	the start of the slice, effectively moving the elements
				181	one index to the left (if you consider zero as the leftmost index).
				182	<pre>
				183	p := Prefix{"I", "am"}
Rob Pike	b91ae5c	2013-04-01 15:52:15 -0700	[diff] [blame]	184	copy(p, p[1:])
Andrew Gerrand	2a18984	2011-08-17 15:53:17 +1000	[diff] [blame]	185	// p == Prefix{"am", "am"}</pre>
				186	We then assign the provided <code>word</code> to the last index
				187	of the slice:
				188	<pre>
				189	// suffix == "not"
				190	p[len(p)-1] = suffix
				191	// p == Prefix{"am", "not"}</pre>
				192	</step>
				193
				194	<step title="Generating text" src="doc/codewalk/markov.go:/func[^\n]+Generate/,/\n}/">
				195	The <code>Generate</code> method is similar to <code>Build</code>
				196	except that instead of reading words from a <code>Reader</code>
				197	and storing them in a map, it reads words from the map and
				198	appends them to a slice (<code>words</code>).
				199	<br/><br/>
				200	<code>Generate</code> uses a conditional for loop to generate
				201	up to <code>n</code> words.
				202	</step>
				203
				204	<step title="Getting potential suffixes" src="doc/codewalk/markov.go:/choices/,/}\n/">
				205	At each iteration of the loop we retrieve a list of potential suffixes
				206	for the current prefix. We access the <code>chain</code> map at key
				207	<code>p.String()</code> and assign its contents to <code>choices</code>.
				208	<br/><br/>
				209	If <code>len(choices)</code> is zero we break out of the loop as there
				210	are no potential suffixes for that prefix.
				211	This test also works if the key isn't present in the map at all:
				212	in that case, <code>choices</code> will be <code>nil</code> and the
				213	length of a <code>nil</code> slice is zero.
				214	</step>
				215
				216	<step title="Choosing a suffix at random" src="doc/codewalk/markov.go:/next := choices/,/Shift/">
				217	To choose a suffix we use the
Shenghou Ma	c24daa2	2012-03-30 15:00:23 +0800	[diff] [blame]	218	<code><a href="/pkg/math/rand/#Intn">rand.Intn</a></code> function.
Andrew Gerrand	2a18984	2011-08-17 15:53:17 +1000	[diff] [blame]	219	It returns a random integer up to (but not including) the provided
				220	value. Passing in <code>len(choices)</code> gives us a random index
				221	into the full length of the list.
				222	<br/><br/>
				223	We use that index to pick our new suffix, assign it to
				224	<code>next</code> and append it to the <code>words</code> slice.
				225	<br/><br/>
				226	Next, we <code>Shift</code> the new suffix onto the prefix just as
				227	we did in the <code>Build</code> method.
				228	</step>
				229
				230	<step title="Returning the generated text" src="doc/codewalk/markov.go:/Join\(words/">
				231	Before returning the generated text as a string, we use the
				232	<code>strings.Join</code> function to join the elements of
				233	the <code>words</code> slice together, separated by spaces.
				234	</step>
				235
				236	<step title="Command-line flags" src="doc/codewalk/markov.go:/Register command-line flags/,/prefixLen/">
				237	To make it easy to tweak the prefix and generated text lengths we
				238	use the <code><a href="/pkg/flag/">flag</a></code> package to parse
				239	command-line flags.
				240	<br/><br/>
				241	These calls to <code>flag.Int</code> register new flags with the
				242	<code>flag</code> package. The arguments to <code>Int</code> are the
				243	flag name, its default value, and a description. The <code>Int</code>
				244	function returns a pointer to an integer that will contain the
				245	user-supplied value (or the default value if the flag was omitted on
				246	the command-line).
				247	</step>
				248
				249	<step title="Program set up" src="doc/codewalk/markov.go:/flag.Parse/,/rand.Seed/">
				250	The <code>main</code> function begins by parsing the command-line
				251	flags with <code>flag.Parse</code> and seeding the <code>rand</code>
				252	package's random number generator with the current time.
				253	<br/><br/>
				254	If the command-line flags provided by the user are invalid the
				255	<code>flag.Parse</code> function will print an informative usage
				256	message and terminate the program.
				257	</step>
				258
				259	<step title="Creating and building a new Chain" src="doc/codewalk/markov.go:/c := NewChain/,/c\.Build/">
				260	To create the new <code>Chain</code> we call <code>NewChain</code>
				261	with the value of the <code>prefix</code> flag.
				262	<br/><br/>
				263	To build the chain we call <code>Build</code> with
				264	<code>os.Stdin</code> (which implements <code>io.Reader</code>) so
				265	that it will read its input from standard input.
				266	</step>
				267
				268	<step title="Generating and printing text" src="doc/codewalk/markov.go:/c\.Generate/,/fmt.Println/">
				269	Finally, to generate text we call <code>Generate</code> with
				270	the value of the <code>words</code> flag and assigning the result
				271	to the variable <code>text</code>.
				272	<br/><br/>
				273	Then we call <code>fmt.Println</code> to write the text to standard
				274	output, followed by a carriage return.
				275	</step>
				276
				277	<step title="Using this program" src="doc/codewalk/markov.go">
Andrew Gerrand	2a5879d	2012-03-20 13:50:05 +1100	[diff] [blame]	278	To use this program, first build it with the
				279	<a href="/cmd/go/">go</a> command:
Andrew Gerrand	2a18984	2011-08-17 15:53:17 +1000	[diff] [blame]	280	<pre>
Andrew Gerrand	2a5879d	2012-03-20 13:50:05 +1100	[diff] [blame]	281	$ go build markov.go</pre>
Andrew Gerrand	2a18984	2011-08-17 15:53:17 +1000	[diff] [blame]	282	And then execute it while piping in some input text:
				283	<pre>
Andrew Gerrand	2a5879d	2012-03-20 13:50:05 +1100	[diff] [blame]	284	$ echo "a man a plan a canal panama" \
				285	\| ./markov -prefix=1
				286	a plan a man a plan a canal panama</pre>
Andrew Gerrand	2a18984	2011-08-17 15:53:17 +1000	[diff] [blame]	287	Here's a transcript of generating some text using the Go distribution's
				288	README file as source material:
				289	<pre>
Shenghou Ma	c24daa2	2012-03-30 15:00:23 +0800	[diff] [blame]	290	$ ./markov -words=10 < $GOROOT/README
Andrew Gerrand	2a18984	2011-08-17 15:53:17 +1000	[diff] [blame]	291	This is the source code repository for the Go source
Shenghou Ma	c24daa2	2012-03-30 15:00:23 +0800	[diff] [blame]	292	$ ./markov -prefix=1 -words=10 < $GOROOT/README
Andrew Gerrand	2a18984	2011-08-17 15:53:17 +1000	[diff] [blame]	293	This is the go directory (the one containing this README).
Shenghou Ma	c24daa2	2012-03-30 15:00:23 +0800	[diff] [blame]	294	$ ./markov -prefix=1 -words=10 < $GOROOT/README
Andrew Gerrand	2a18984	2011-08-17 15:53:17 +1000	[diff] [blame]	295	This is the variable if you have just untarred a</pre>
				296	</step>
				297
				298	<step title="An exercise for the reader" src="doc/codewalk/markov.go">
				299	The <code>Generate</code> function does a lot of allocations when it
				300	builds the <code>words</code> slice. As an exercise, modify it to
				301	take an <code>io.Writer</code> to which it incrementally writes the
				302	generated text with <code>Fprint</code>.
				303	Aside from being more efficient this makes <code>Generate</code>
				304	more symmetrical to <code>Build</code>.
				305	</step>
				306
				307	</codewalk>