| Lexical Scanning in Go |
| GTUG Sydney |
| 30 Aug 2011 |
| |
| Rob Pike |
| r@golang.org |
| |
| |
| * Video |
| |
| A video of this talk was recorded at the Go Sydney Meetup. |
| |
| .link https://www.youtube.com/watch?v=HxaD_trXwRE Watch the talk on YouTube |
| |
| |
| * Structural mismatch |
| |
| Many programming problems realign one data structure to fit another structure. |
| |
| - breaking text into lines |
| - "blocking" and "deblocking" |
| - packet assembly and disassembly |
| - parsing |
| - lexing |
| |
| * Sometimes hard |
| |
| The pieces on either side have independent state, lookahead, buffers, ... |
| Can be messy to do well. |
| |
| Coroutines were invented to solve this problem! |
| They enable us to write the two pieces independently. |
| |
| Let's look at this topic in the context of a lexer. |
| |
| |
| * A new template system |
| |
| Wanted to replace the old Go template package. |
| It had many problems: |
| |
| - inflexible |
| - inexpressive |
| - code too fragile |
| |
| * A new template system |
| |
| Key change was re-engineering with a true lexer, parser, and evaluator. |
| Has arbitrary text plus actions in `{{` `}}`. |
| |
| .code lex/snippets /Evaluation/,/Control.structures/ |
| |
| * Today we focus on the lexer |
| |
| Must tokenize: |
| |
| - the stuff outside actions |
| - action delimiters |
| - identifiers |
| - numeric constants |
| - string constants |
| - and others |
| |
| * Lex items |
| |
| Two things identify each lexed item: |
| |
| - its type |
| - its value; a string is all we need |
| |
| .code lex/lex1.oldgo /item.represents/,/^}/ |
| |
| * Lex type |
| |
| The type is just an integer constant. |
| We use `iota` to define the values. |
| |
| .code lex/lex1.oldgo /itemType.identifies/,/type/ |
| .code lex/lex1.oldgo /const/,/itemEOF/ |
| |
| * Lex type values (continued) |
| |
| .code lex/lex1.oldgo /itemElse/,/^\)/ |
| |
| * Printing a lex item |
| |
| `Printf` has a convention making it easy to print any type: just define a `String()` method: |
| |
| .code lex/lex1.oldgo /func.*item.*String/,/^}/ |
| |
| * How to tokenize? |
| |
| Many approaches available: |
| |
| - use a tool such as lex or ragel |
| - use regular expressions |
| - use states, actions, and a switch statement |
| |
| * Tools |
| |
| Nothing wrong with using a tool but: |
| |
| - hard to get good errors (can be very important) |
| - tend to require learning another language |
| - result can be large, even slow |
| - often a poor fit |
| - but lexing is easy to do yourself! |
| |
| * Regular expressions |
| |
| Blogged about this last week. |
| |
| - overkill |
| - slow |
| - can explore the state space too much |
| - misuse of a dynamic engine to ask static questions |
| |
| * Let's write our own |
| |
| It's easy! |
| |
| Plus, most programming languages lex pretty much the same tokens, so once we learn how it's trivial to adapt the lexer for the next purpose. |
| |
| - an argument both for and against tools |
| |
| * State machine |
| |
| Many people will tell you to write a switch statement, |
| something like this: |
| |
| .code lex/snippets /One/,/^}/ |
| |
| * State machines are forgetful |
| |
| Boring and repetitive and error-prone, but anyway: |
| |
| Why switch? |
| |
| After each action, you know where you want to be; |
| the new state is the result of the action. |
| |
| But we throw the info away and recompute it from the state. |
| |
| (A consequence of returning to the caller.) |
| |
| A tool can compile that out, but so can we. |
| |
| * What is a state? An action? |
| |
| State represents where we are and what we expect. |
| |
| Action represents what we do. |
| |
| Actions result in a new state. |
| |
| * State function |
| |
| Let's put them together: a state function. |
| |
| Executes an action, returns the next state—as a state function. |
| |
| Recursive definition but simple and clear. |
| |
| .code lex/lex1.oldgo /stateFn/,/type/ |
| |
| * The run loop |
| |
| Our state machine is trivial: |
| just run until the state goes to `nil`, representing "done". |
| |
| .code lex/snippets /run.lexes/,/^}/ |
| |
| * The concurrent step |
| |
| How do we make tokens available to the client? |
| Tokens can emerge at times that are inconvenient to stop to return to the caller. |
| |
| Use concurrency: |
| Run the state machine as a goroutine, |
| emit values on a channel. |
| |
| * The lexer type |
| |
| Here is the `lexer` type. Notice the channel of items; ignore the rest for now. |
| |
| .code lex/lex1.oldgo /lexer.holds/,/^}/ |
| |
| * Starting the lexer |
| |
| A `lexer` initializes itself to lex a string and launches the state machine as a goroutine, returning the lexer itself and a channel of items. |
| |
| The API will change, don't worry about it now. |
| |
| .code lex/lex1.oldgo /func.lex/,/^}/ |
| |
| * The real run routine |
| |
| Here's the real state machine run function, which runs as a goroutine. |
| |
| .code lex/lex1.oldgo /run.lexes/,/^}/ |
| |
| * The token emitter |
| |
| A token is a type and a value, but (yay Go) the value can just be sliced from the input string. |
| The `lexer` remembers where it is in the input and the emit routine just lobs that substring to the caller as the token's value. |
| |
| .code lex/lex1.oldgo /input.*scanned/,/pos.*position/ |
| .code lex/lex1.oldgo /emit.passes/,/^}/ |
| |
| * Starting the machine |
| |
| As the `lexer` begins it's looking for plain text, so the initial state is the function `lexText`. |
| It absorbs plain text until a "left meta" is encountered. |
| |
| .code lex/lex1.oldgo /run.lexes/,/^}/ |
| .code lex/lex1.oldgo /leftMeta/ |
| |
| * lexText |
| |
| .code lex/lex1.oldgo /^func.lexText/,/^}/ |
| |
| * lexLeftMeta |
| |
| A trivial state function. |
| When we get here, we know there's a `leftMeta` in the input. |
| |
| .code lex/lex1.oldgo /^func.lexLeftMeta/,/^}/ |
| |
| * lexInsideAction |
| |
| .code lex/lex1.oldgo /^func.lexInsideAction/,/itemPipe/ |
| |
| * More of lexInsideAction |
| |
| This will give you the flavor. |
| |
| .code lex/lex1.oldgo /case.*"/,/lexRawQuote/ |
| .code lex/lex1.oldgo /case.*9/,/lexIdentifier/ |
| |
| * The next function |
| |
| .code lex/lex1.oldgo /next.returns.the/,/^}/ |
| |
| * Some lexing helpers |
| |
| .code lex/lex1.oldgo /ignore.skips/,/^}/ |
| .code lex/lex1.oldgo /backup.steps/,/^}/ |
| |
| * The peek function |
| |
| .code lex/lex1.oldgo /peek.returns.but/,/^}/ |
| |
| * The accept functions |
| |
| .code lex/lex1.oldgo /accept.consumes/,/^}/ |
| .code lex/lex1.oldgo /acceptRun.consumes/,/^}/ |
| |
| * Lexing a number, including floating point |
| |
| .code lex/lex1.oldgo /^func.lexNumber/,/imaginary/ |
| |
| * Lexing a number, continued |
| |
| This is more accepting than it should be, but not by much. Caller must call `Atof` to validate. |
| |
| .code lex/lex1.oldgo /Is.it.imaginary/,/^}/ |
| |
| * Errors |
| |
| Easy to handle: emit the bad token and shut down the machine. |
| |
| .code lex/lex1.oldgo /error.returns/,/^}/ |
| |
| * Summary |
| |
| Concurrency makes the lexer easy to design. |
| |
| Goroutines allow lexer and caller (parser) each to run at its own rate, as clean sequential code. |
| |
| Channels give us a clean way to emit tokens. |
| |
| * A problem |
| |
| Can't run a goroutine to completion during initialization. |
| Forbidden by the language specification. |
| (Raises awful issues about order of init, best avoided.) |
| |
| That means we can't lex & parse a template during init. |
| |
| The goroutine is a problem.... |
| |
| _(Note:_This_restriction_was_lifted_in_Go_version_1_but_the_discussion_is_still_interesting.)_ |
| |
| * Design vs. implementation |
| |
| …but it's not necessary anyway. |
| |
| The work is done by the design; now we just adjust the API. |
| |
| We can change the API to hide the channel, provide a function to get the next token, and rewrite the run function. |
| |
| It's easy. |
| |
| * A new API |
| |
| Hide the channel and buffer it slightly, turning it into a ring buffer. |
| |
| .code lex/r59-lex.go /lex.creates.a.new/,/^}/ |
| |
| * A function for the next item |
| |
| Traditional lexer API: return next item. |
| Includes the modified state machine runner. |
| |
| .code lex/r59-lex.go /nextItem.returns/,/^}/ |
| |
| * That's it |
| |
| We now have a traditional API for a lexer with a simple, concurrent implementation under the covers. |
| |
| Even though the implementation is no longer truly concurrent, it still has all the advantages of concurrent design. |
| |
| We wouldn't have such a clean, efficient design if we hadn't thought about the problem in a concurrent way, without worrying about "restart". |
| |
| Model completely removes concerns about "structural mismatch". |
| |
| * Concurrency is a design approach |
| |
| Concurrency is not about parallelism. |
| |
| (Although it can enable parallelism). |
| |
| Concurrency is a way to design a program by decomposing it into independently executing pieces. |
| |
| The result can be clean, efficient, and very adaptable. |
| |
| * Conclusion |
| |
| Lexers are fun. |
| |
| Concurrency is fun. |
| |
| Go is fun. |
| |
| * For more information |
| |
| Go: [[http://golang.org]] |
| |
| New templates: http://golang.org/pkg/exp/template/ |
| |
| (Next release will move them out of experimental.) |