blob: d2fed7b06d80c7a871cf72178bebf2e2a5ef44ab [file] [log] [blame]
Lexical Scanning in Go
GTUG Sydney
30 Aug 2011
Rob Pike
r@golang.org
* Video
A video of this talk was recorded at the Go Sydney Meetup.
.link https://www.youtube.com/watch?v=HxaD_trXwRE Watch the talk on YouTube
* Structural mismatch
Many programming problems realign one data structure to fit another structure.
- breaking text into lines
- "blocking" and "deblocking"
- packet assembly and disassembly
- parsing
- lexing
* Sometimes hard
The pieces on either side have independent state, lookahead, buffers, ...
Can be messy to do well.
Coroutines were invented to solve this problem!
They enable us to write the two pieces independently.
Let's look at this topic in the context of a lexer.
* A new template system
Wanted to replace the old Go template package.
It had many problems:
- inflexible
- inexpressive
- code too fragile
* A new template system
Key change was re-engineering with a true lexer, parser, and evaluator.
Has arbitrary text plus actions in `{{` `}}`.
.code lex/snippets /Evaluation/,/Control.structures/
* Today we focus on the lexer
Must tokenize:
- the stuff outside actions
- action delimiters
- identifiers
- numeric constants
- string constants
- and others
* Lex items
Two things identify each lexed item:
- its type
- its value; a string is all we need
.code lex/lex1.oldgo /item.represents/,/^}/
* Lex type
The type is just an integer constant.
We use `iota` to define the values.
.code lex/lex1.oldgo /itemType.identifies/,/type/
.code lex/lex1.oldgo /const/,/itemEOF/
* Lex type values (continued)
.code lex/lex1.oldgo /itemElse/,/^\)/
* Printing a lex item
`Printf` has a convention making it easy to print any type: just define a `String()` method:
.code lex/lex1.oldgo /func.*item.*String/,/^}/
* How to tokenize?
Many approaches available:
- use a tool such as lex or ragel
- use regular expressions
- use states, actions, and a switch statement
* Tools
Nothing wrong with using a tool but:
- hard to get good errors (can be very important)
- tend to require learning another language
- result can be large, even slow
- often a poor fit
- but lexing is easy to do yourself!
* Regular expressions
Blogged about this last week.
- overkill
- slow
- can explore the state space too much
- misuse of a dynamic engine to ask static questions
* Let's write our own
It's easy!
Plus, most programming languages lex pretty much the same tokens, so once we learn how it's trivial to adapt the lexer for the next purpose.
- an argument both for and against tools
* State machine
Many people will tell you to write a switch statement,
something like this:
.code lex/snippets /One/,/^}/
* State machines are forgetful
Boring and repetitive and error-prone, but anyway:
Why switch?
After each action, you know where you want to be;
the new state is the result of the action.
But we throw the info away and recompute it from the state.
(A consequence of returning to the caller.)
A tool can compile that out, but so can we.
* What is a state? An action?
State represents where we are and what we expect.
Action represents what we do.
Actions result in a new state.
* State function
Let's put them together: a state function.
Executes an action, returns the next state—as a state function.
Recursive definition but simple and clear.
.code lex/lex1.oldgo /stateFn/,/type/
* The run loop
Our state machine is trivial:
just run until the state goes to `nil`, representing "done".
.code lex/snippets /run.lexes/,/^}/
* The concurrent step
How do we make tokens available to the client?
Tokens can emerge at times that are inconvenient to stop to return to the caller.
Use concurrency:
Run the state machine as a goroutine,
emit values on a channel.
* The lexer type
Here is the `lexer` type. Notice the channel of items; ignore the rest for now.
.code lex/lex1.oldgo /lexer.holds/,/^}/
* Starting the lexer
A `lexer` initializes itself to lex a string and launches the state machine as a goroutine, returning the lexer itself and a channel of items.
The API will change, don't worry about it now.
.code lex/lex1.oldgo /func.lex/,/^}/
* The real run routine
Here's the real state machine run function, which runs as a goroutine.
.code lex/lex1.oldgo /run.lexes/,/^}/
* The token emitter
A token is a type and a value, but (yay Go) the value can just be sliced from the input string.
The `lexer` remembers where it is in the input and the emit routine just lobs that substring to the caller as the token's value.
.code lex/lex1.oldgo /input.*scanned/,/pos.*position/
.code lex/lex1.oldgo /emit.passes/,/^}/
* Starting the machine
As the `lexer` begins it's looking for plain text, so the initial state is the function `lexText`.
It absorbs plain text until a "left meta" is encountered.
.code lex/lex1.oldgo /run.lexes/,/^}/
.code lex/lex1.oldgo /leftMeta/
* lexText
.code lex/lex1.oldgo /^func.lexText/,/^}/
* lexLeftMeta
A trivial state function.
When we get here, we know there's a `leftMeta` in the input.
.code lex/lex1.oldgo /^func.lexLeftMeta/,/^}/
* lexInsideAction
.code lex/lex1.oldgo /^func.lexInsideAction/,/itemPipe/
* More of lexInsideAction
This will give you the flavor.
.code lex/lex1.oldgo /case.*"/,/lexRawQuote/
.code lex/lex1.oldgo /case.*9/,/lexIdentifier/
* The next function
.code lex/lex1.oldgo /next.returns.the/,/^}/
* Some lexing helpers
.code lex/lex1.oldgo /ignore.skips/,/^}/
.code lex/lex1.oldgo /backup.steps/,/^}/
* The peek function
.code lex/lex1.oldgo /peek.returns.but/,/^}/
* The accept functions
.code lex/lex1.oldgo /accept.consumes/,/^}/
.code lex/lex1.oldgo /acceptRun.consumes/,/^}/
* Lexing a number, including floating point
.code lex/lex1.oldgo /^func.lexNumber/,/imaginary/
* Lexing a number, continued
This is more accepting than it should be, but not by much. Caller must call `Atof` to validate.
.code lex/lex1.oldgo /Is.it.imaginary/,/^}/
* Errors
Easy to handle: emit the bad token and shut down the machine.
.code lex/lex1.oldgo /error.returns/,/^}/
* Summary
Concurrency makes the lexer easy to design.
Goroutines allow lexer and caller (parser) each to run at its own rate, as clean sequential code.
Channels give us a clean way to emit tokens.
* A problem
Can't run a goroutine to completion during initialization.
Forbidden by the language specification.
(Raises awful issues about order of init, best avoided.)
That means we can't lex & parse a template during init.
The goroutine is a problem....
_(Note:_This_restriction_was_lifted_in_Go_version_1_but_the_discussion_is_still_interesting.)_
* Design vs. implementation
…but it's not necessary anyway.
The work is done by the design; now we just adjust the API.
We can change the API to hide the channel, provide a function to get the next token, and rewrite the run function.
It's easy.
* A new API
Hide the channel and buffer it slightly, turning it into a ring buffer.
.code lex/r59-lex.go /lex.creates.a.new/,/^}/
* A function for the next item
Traditional lexer API: return next item.
Includes the modified state machine runner.
.code lex/r59-lex.go /nextItem.returns/,/^}/
* That's it
We now have a traditional API for a lexer with a simple, concurrent implementation under the covers.
Even though the implementation is no longer truly concurrent, it still has all the advantages of concurrent design.
We wouldn't have such a clean, efficient design if we hadn't thought about the problem in a concurrent way, without worrying about "restart".
Model completely removes concerns about "structural mismatch".
* Concurrency is a design approach
Concurrency is not about parallelism.
(Although it can enable parallelism).
Concurrency is a way to design a program by decomposing it into independently executing pieces.
The result can be clean, efficient, and very adaptable.
* Conclusion
Lexers are fun.
Concurrency is fun.
Go is fun.
* For more information
Go: [[http://golang.org]]
New templates: http://golang.org/pkg/exp/template/
(Next release will move them out of experimental.)