_content/talks/2011/lex.slide - website - Git at Google

 Lexical Scanning in Go
 GTUG Sydney
 30 Aug 2011

 Rob Pike
 r@golang.org


 * Video

 A video of this talk was recorded at the Go Sydney Meetup.

 .link https://www.youtube.com/watch?v=HxaD_trXwRE Watch the talk on YouTube


 * Structural mismatch

 Many programming problems realign one data structure to fit another structure.

 - breaking text into lines
 - "blocking" and "deblocking"
 - packet assembly and disassembly
 - parsing
 - lexing

 * Sometimes hard

 The pieces on either side have independent state, lookahead, buffers, ...
 Can be messy to do well.

 Coroutines were invented to solve this problem!
 They enable us to write the two pieces independently.

 Let's look at this topic in the context of a lexer.


 * A new template system

 Wanted to replace the old Go template package.
 It had many problems:

 - inflexible
 - inexpressive
 - code too fragile

 * A new template system

 Key change was re-engineering with a true lexer, parser, and evaluator.
 Has arbitrary text plus actions in `{{` `}}`.

 .code lex/snippets /Evaluation/,/Control.structures/

 * Today we focus on the lexer

 Must tokenize:

 - the stuff outside actions
 - action delimiters
 - identifiers
 - numeric constants
 - string constants
 - and others

 * Lex items

 Two things identify each lexed item:

 - its type
 - its value; a string is all we need

 .code lex/lex1.oldgo /item.represents/,/^}/

 * Lex type

 The type is just an integer constant.
 We use `iota` to define the values.

 .code lex/lex1.oldgo /itemType.identifies/,/type/
 .code lex/lex1.oldgo /const/,/itemEOF/

 * Lex type values (continued)

 .code lex/lex1.oldgo /itemElse/,/^\)/

 * Printing a lex item

 `Printf` has a convention making it easy to print any type: just define a `String()` method:

 .code lex/lex1.oldgo /func.*item.*String/,/^}/

 * How to tokenize?

 Many approaches available:

 - use a tool such as lex or ragel
 - use regular expressions
 - use states, actions, and a switch statement

 * Tools

 Nothing wrong with using a tool but:

 - hard to get good errors (can be very important)
 - tend to require learning another language
 - result can be large, even slow
 - often a poor fit
 - but lexing is easy to do yourself!

 * Regular expressions

 Blogged about this last week.

 - overkill
 - slow
 - can explore the state space too much
 - misuse of a dynamic engine to ask static questions

 * Let's write our own

 It's easy!

 Plus, most programming languages lex pretty much the same tokens, so once we learn how it's trivial to adapt the lexer for the next purpose.

 - an argument both for and against tools

 * State machine

 Many people will tell you to write a switch statement,
 something like this:

 .code lex/snippets /One/,/^}/

 * State machines are forgetful

 Boring and repetitive and error-prone, but anyway:

 Why switch?

 After each action, you know where you want to be;
 the new state is the result of the action.

 But we throw the info away and recompute it from the state.

 (A consequence of returning to the caller.)

 A tool can compile that out, but so can we.

 * What is a state? An action?

 State represents where we are and what we expect.

 Action represents what we do.

 Actions result in a new state.

 * State function

 Let's put them together: a state function.

 Executes an action, returns the next state—as a state function.

 Recursive definition but simple and clear.

 .code lex/lex1.oldgo /stateFn/,/type/

 * The run loop

 Our state machine is trivial:
 just run until the state goes to `nil`, representing "done".

 .code lex/snippets /run.lexes/,/^}/

 * The concurrent step

 How do we make tokens available to the client?
 Tokens can emerge at times that are inconvenient to stop to return to the caller.

 Use concurrency:
 Run the state machine as a goroutine,
 emit values on a channel.

 * The lexer type

 Here is the `lexer` type. Notice the channel of items; ignore the rest for now.

 .code lex/lex1.oldgo /lexer.holds/,/^}/

 * Starting the lexer

 A `lexer` initializes itself to lex a string and launches the state machine as a goroutine, returning the lexer itself and a channel of items.

 The API will change, don't worry about it now.

 .code lex/lex1.oldgo /func.lex/,/^}/

 * The real run routine

 Here's the real state machine run function, which runs as a goroutine.

 .code lex/lex1.oldgo /run.lexes/,/^}/

 * The token emitter

 A token is a type and a value, but (yay Go) the value can just be sliced from the input string.
 The `lexer` remembers where it is in the input and the emit routine just lobs that substring to the caller as the token's value.

 .code lex/lex1.oldgo /input.*scanned/,/pos.*position/
 .code lex/lex1.oldgo /emit.passes/,/^}/

 * Starting the machine

 As the `lexer` begins it's looking for plain text, so the initial state is the function `lexText`.
 It absorbs plain text until a "left meta" is encountered.

 .code lex/lex1.oldgo /run.lexes/,/^}/
 .code lex/lex1.oldgo /leftMeta/

 * lexText

 .code lex/lex1.oldgo /^func.lexText/,/^}/

 * lexLeftMeta

 A trivial state function.
 When we get here, we know there's a `leftMeta` in the input.

 .code lex/lex1.oldgo /^func.lexLeftMeta/,/^}/

 * lexInsideAction

 .code lex/lex1.oldgo /^func.lexInsideAction/,/itemPipe/

 * More of lexInsideAction

 This will give you the flavor.

 .code lex/lex1.oldgo /case.*"/,/lexRawQuote/
 .code lex/lex1.oldgo /case.*9/,/lexIdentifier/

 * The next function

 .code lex/lex1.oldgo /next.returns.the/,/^}/

 * Some lexing helpers

 .code lex/lex1.oldgo /ignore.skips/,/^}/
 .code lex/lex1.oldgo /backup.steps/,/^}/

 * The peek function

 .code lex/lex1.oldgo /peek.returns.but/,/^}/

 * The accept functions

 .code lex/lex1.oldgo /accept.consumes/,/^}/
 .code lex/lex1.oldgo /acceptRun.consumes/,/^}/

 * Lexing a number, including floating point

 .code lex/lex1.oldgo /^func.lexNumber/,/imaginary/

 * Lexing a number, continued

 This is more accepting than it should be, but not by much. Caller must call `Atof` to validate.

 .code lex/lex1.oldgo /Is.it.imaginary/,/^}/

 * Errors

 Easy to handle: emit the bad token and shut down the machine.

 .code lex/lex1.oldgo /error.returns/,/^}/

 * Summary

 Concurrency makes the lexer easy to design.

 Goroutines allow lexer and caller (parser) each to run at its own rate, as clean sequential code.

 Channels give us a clean way to emit tokens.

 * A problem

 Can't run a goroutine to completion during initialization.
 Forbidden by the language specification.
 (Raises awful issues about order of init, best avoided.)

 That means we can't lex & parse a template during init.

 The goroutine is a problem....

 _(Note:_This_restriction_was_lifted_in_Go_version_1_but_the_discussion_is_still_interesting.)_

 * Design vs. implementation

 …but it's not necessary anyway.

 The work is done by the design; now we just adjust the API.

 We can change the API to hide the channel, provide a function to get the next token, and rewrite the run function.

 It's easy.

 * A new API

 Hide the channel and buffer it slightly, turning it into a ring buffer.

 .code lex/r59-lex.go /lex.creates.a.new/,/^}/

 * A function for the next item

 Traditional lexer API: return next item.
 Includes the modified state machine runner.

 .code lex/r59-lex.go /nextItem.returns/,/^}/

 * That's it

 We now have a traditional API for a lexer with a simple, concurrent implementation under the covers.

 Even though the implementation is no longer truly concurrent, it still has all the advantages of concurrent design.

 We wouldn't have such a clean, efficient design if we hadn't thought about the problem in a concurrent way, without worrying about "restart".

 Model completely removes concerns about "structural mismatch".

 * Concurrency is a design approach

 Concurrency is not about parallelism.

 (Although it can enable parallelism).

 Concurrency is a way to design a program by decomposing it into independently executing pieces.

 The result can be clean, efficient, and very adaptable.

 * Conclusion

 Lexers are fun.

 Concurrency is fun.

 Go is fun.

 * For more information

 Go: [[http://golang.org]]

 New templates: http://golang.org/pkg/exp/template/

 (Next release will move them out of experimental.)
	Lexical Scanning in Go
	GTUG Sydney
	30 Aug 2011

	Rob Pike
	r@golang.org


	* Video

	A video of this talk was recorded at the Go Sydney Meetup.

	.link https://www.youtube.com/watch?v=HxaD_trXwRE Watch the talk on YouTube


	* Structural mismatch

	Many programming problems realign one data structure to fit another structure.

	- breaking text into lines
	- "blocking" and "deblocking"
	- packet assembly and disassembly
	- parsing
	- lexing

	* Sometimes hard

	The pieces on either side have independent state, lookahead, buffers, ...
	Can be messy to do well.

	Coroutines were invented to solve this problem!
	They enable us to write the two pieces independently.

	Let's look at this topic in the context of a lexer.


	* A new template system

	Wanted to replace the old Go template package.
	It had many problems:

	- inflexible
	- inexpressive
	- code too fragile

	* A new template system

	Key change was re-engineering with a true lexer, parser, and evaluator.
	Has arbitrary text plus actions in `{{` `}}`.

	.code lex/snippets /Evaluation/,/Control.structures/

	* Today we focus on the lexer

	Must tokenize:

	- the stuff outside actions
	- action delimiters
	- identifiers
	- numeric constants
	- string constants
	- and others

	* Lex items

	Two things identify each lexed item:

	- its type
	- its value; a string is all we need

	.code lex/lex1.oldgo /item.represents/,/^}/

	* Lex type

	The type is just an integer constant.
	We use `iota` to define the values.

	.code lex/lex1.oldgo /itemType.identifies/,/type/
	.code lex/lex1.oldgo /const/,/itemEOF/

	* Lex type values (continued)

	.code lex/lex1.oldgo /itemElse/,/^\)/

	* Printing a lex item

	`Printf` has a convention making it easy to print any type: just define a `String()` method:

	.code lex/lex1.oldgo /func.item.String/,/^}/

	* How to tokenize?

	Many approaches available:

	- use a tool such as lex or ragel
	- use regular expressions
	- use states, actions, and a switch statement

	* Tools

	Nothing wrong with using a tool but:

	- hard to get good errors (can be very important)
	- tend to require learning another language
	- result can be large, even slow
	- often a poor fit
	- but lexing is easy to do yourself!

	* Regular expressions

	Blogged about this last week.

	- overkill
	- slow
	- can explore the state space too much
	- misuse of a dynamic engine to ask static questions

	* Let's write our own

	It's easy!

	Plus, most programming languages lex pretty much the same tokens, so once we learn how it's trivial to adapt the lexer for the next purpose.

	- an argument both for and against tools

	* State machine

	Many people will tell you to write a switch statement,
	something like this:

	.code lex/snippets /One/,/^}/

	* State machines are forgetful

	Boring and repetitive and error-prone, but anyway:

	Why switch?

	After each action, you know where you want to be;
	the new state is the result of the action.

	But we throw the info away and recompute it from the state.

	(A consequence of returning to the caller.)

	A tool can compile that out, but so can we.

	* What is a state? An action?

	State represents where we are and what we expect.

	Action represents what we do.

	Actions result in a new state.

	* State function

	Let's put them together: a state function.

	Executes an action, returns the next state—as a state function.

	Recursive definition but simple and clear.

	.code lex/lex1.oldgo /stateFn/,/type/

	* The run loop

	Our state machine is trivial:
	just run until the state goes to `nil`, representing "done".

	.code lex/snippets /run.lexes/,/^}/

	* The concurrent step

	How do we make tokens available to the client?
	Tokens can emerge at times that are inconvenient to stop to return to the caller.

	Use concurrency:
	Run the state machine as a goroutine,
	emit values on a channel.

	* The lexer type

	Here is the `lexer` type. Notice the channel of items; ignore the rest for now.

	.code lex/lex1.oldgo /lexer.holds/,/^}/

	* Starting the lexer

	A `lexer` initializes itself to lex a string and launches the state machine as a goroutine, returning the lexer itself and a channel of items.

	The API will change, don't worry about it now.

	.code lex/lex1.oldgo /func.lex/,/^}/

	* The real run routine

	Here's the real state machine run function, which runs as a goroutine.

	.code lex/lex1.oldgo /run.lexes/,/^}/

	* The token emitter

	A token is a type and a value, but (yay Go) the value can just be sliced from the input string.
	The `lexer` remembers where it is in the input and the emit routine just lobs that substring to the caller as the token's value.

	.code lex/lex1.oldgo /input.scanned/,/pos.position/
	.code lex/lex1.oldgo /emit.passes/,/^}/

	* Starting the machine

	As the `lexer` begins it's looking for plain text, so the initial state is the function `lexText`.
	It absorbs plain text until a "left meta" is encountered.

	.code lex/lex1.oldgo /run.lexes/,/^}/
	.code lex/lex1.oldgo /leftMeta/

	* lexText

	.code lex/lex1.oldgo /^func.lexText/,/^}/

	* lexLeftMeta

	A trivial state function.
	When we get here, we know there's a `leftMeta` in the input.

	.code lex/lex1.oldgo /^func.lexLeftMeta/,/^}/

	* lexInsideAction

	.code lex/lex1.oldgo /^func.lexInsideAction/,/itemPipe/

	* More of lexInsideAction

	This will give you the flavor.

	.code lex/lex1.oldgo /case.*"/,/lexRawQuote/
	.code lex/lex1.oldgo /case.*9/,/lexIdentifier/

	* The next function

	.code lex/lex1.oldgo /next.returns.the/,/^}/

	* Some lexing helpers

	.code lex/lex1.oldgo /ignore.skips/,/^}/
	.code lex/lex1.oldgo /backup.steps/,/^}/

	* The peek function

	.code lex/lex1.oldgo /peek.returns.but/,/^}/

	* The accept functions

	.code lex/lex1.oldgo /accept.consumes/,/^}/
	.code lex/lex1.oldgo /acceptRun.consumes/,/^}/

	* Lexing a number, including floating point

	.code lex/lex1.oldgo /^func.lexNumber/,/imaginary/

	* Lexing a number, continued

	This is more accepting than it should be, but not by much. Caller must call `Atof` to validate.

	.code lex/lex1.oldgo /Is.it.imaginary/,/^}/

	* Errors

	Easy to handle: emit the bad token and shut down the machine.

	.code lex/lex1.oldgo /error.returns/,/^}/

	* Summary

	Concurrency makes the lexer easy to design.

	Goroutines allow lexer and caller (parser) each to run at its own rate, as clean sequential code.

	Channels give us a clean way to emit tokens.

	* A problem

	Can't run a goroutine to completion during initialization.
	Forbidden by the language specification.
	(Raises awful issues about order of init, best avoided.)

	That means we can't lex & parse a template during init.

	The goroutine is a problem....

	_(Note:_This_restriction_was_lifted_in_Go_version_1_but_the_discussion_is_still_interesting.)_

	* Design vs. implementation

	…but it's not necessary anyway.

	The work is done by the design; now we just adjust the API.

	We can change the API to hide the channel, provide a function to get the next token, and rewrite the run function.

	It's easy.

	* A new API

	Hide the channel and buffer it slightly, turning it into a ring buffer.

	.code lex/r59-lex.go /lex.creates.a.new/,/^}/

	* A function for the next item

	Traditional lexer API: return next item.
	Includes the modified state machine runner.

	.code lex/r59-lex.go /nextItem.returns/,/^}/

	* That's it

	We now have a traditional API for a lexer with a simple, concurrent implementation under the covers.

	Even though the implementation is no longer truly concurrent, it still has all the advantages of concurrent design.

	We wouldn't have such a clean, efficient design if we hadn't thought about the problem in a concurrent way, without worrying about "restart".

	Model completely removes concerns about "structural mismatch".

	* Concurrency is a design approach

	Concurrency is not about parallelism.

	(Although it can enable parallelism).

	Concurrency is a way to design a program by decomposing it into independently executing pieces.

	The result can be clean, efficient, and very adaptable.

	* Conclusion

	Lexers are fun.

	Concurrency is fun.

	Go is fun.

	* For more information

	Go: [[http://golang.org]]

	New templates: http://golang.org/pkg/exp/template/

	(Next release will move them out of experimental.)