| Stacks of Tokens |
| A study in interfaces |
| |
| Sydney Go Meetup |
| 15 September 2016 |
| |
| Rob Pike |
| Google |
| @rob_pike |
| [[http://golang.org/s/plusrob][+RobPikeTheHuman]] |
| http://golang.org/ |
| |
| * Background |
| |
| Spoke at Gophercon this year: [[https://talks.golang.org/2016/asm.slide]] |
| |
| That talk was about system design and portability. |
| |
| Today's talk is about its lexer. |
| |
| Spoke about lexing before: [[https://talks.golang.org/2011/lex.slide]] |
| |
| That talk showed a way to use concurrency to build a lexer. |
| |
| Today's talk is about interfaces. |
| |
| * Lexer |
| |
| A lexer, also called a scanner, breaks the input stream into distinct "tokens": |
| |
| - identifiers |
| - numbers |
| - quoted strings |
| - operators |
| - miscellaneous characters such as comma, colon. |
| |
| Each token has a _type_ and a _value_. |
| |
| * Example |
| |
| MOVQ R0, 4(R1) |
| |
| The lexer turns this into the stream |
| |
| - identifier `MOVQ` |
| - identifier `R0` |
| - character `,` |
| - number `4` |
| - character `(` |
| - identifier `R1` |
| - character `)` |
| |
| Spaces are ignored. |
| |
| The parser then reads these tokens to parse the input into a _parse_tree_. |
| |
| * Go's text/scanner package |
| |
| There is a nice, efficient lexer package in the Go standard library: |
| |
| - [[https://golang.org/pkg/text/scanner/][`text/scanner`]] |
| |
| It can do this job just fine. |
| |
| But.... that is not enough for the assembler because of |
| |
| * Backwards compatibility |
| |
| The new Go assembler had to be totally compatible with the ones it replaces, which used YACC and were written in C. (See [[https://talks.golang.org/2015/gogo.slide][]].) |
| |
| Each assembler (one per architecture) contained these lines at the end of `lex.c`: |
| |
| #include "../cc/lexbody" |
| #include "../cc/macbody" |
| |
| This gave the assemblers the same lexer as the C compiler. |
| The differences between C tokens and Go tokens are minor and can be handled, but.... |
| |
| |
| The C lexer brings in something problematic. |
| |
| * The C preprocessor |
| |
| The old assemblers had a C preprocessor built in! |
| An old-fashioned one, without `#if` and token pasting, but still: |
| |
| #include "file" |
| #define MAXBUF 512 |
| #define MULDIV(a, b, c) ((a)*(b)/(c)) |
| #ifdef MAXBUF |
| #endif |
| |
| The `text/scanner` package can't handle this. |
| But we need to do it to be compatible. |
| |
| This talk is about how to use Go's interfaces to do it. |
| |
| * An observation |
| |
| What is standard input? An input source. |
| |
| - read the input |
| |
| What is an included file? An input source. |
| |
| - read the file |
| |
| What is a macro invocation? An input source. |
| |
| - read the macro definition |
| |
| Sounds a lot like `io.Reader`. |
| |
| * Token reader |
| |
| We don't want bytes, we want tokens. (Why?) |
| |
| Instead of |
| |
| type Reader interface { |
| Read(p []byte) (n int, err error) |
| } |
| |
| we want something like |
| |
| type TokenReader interface { |
| ReadToken() (Token, error) |
| } |
| |
| In practice the parser needs something different from `Read`, but the basic idea works. |
| |
| We build a lexer around an interface that reads tokens. |
| |
| * The observation in practice |
| |
| What is standard input? An input source. |
| |
| - get tokens from the input |
| |
| What is an included file? An input source. |
| |
| - get tokens from the file |
| |
| What is a macro invocation? An input source. |
| |
| - get tokens from the macro definition |
| |
| Each of these implements the `TokenReader` interface. |
| |
| * TokenReader |
| |
| type TokenReader interface { |
| // Next returns the next token. |
| Next() ScanToken |
| // The following methods all refer to the most recent token returned by Next. |
| // Text returns the original string representation of the token. |
| Text() string |
| // File reports the source file name of the token. |
| File() string |
| // Line reports the source line number of the token. |
| Line() int |
| // Col reports the source column number of the token. |
| Col() int |
| } |
| |
| Parser calls `Next`, then can ask about the token: what is, where it is. |
| `ScanToken` is just `text/scanner.Token` with tweaks. |
| |
| Note: No `Peek`. This has no lookahead. |
| |
| * Tokenizer |
| |
| `Tokenizer`, the foundational `TokenReader`, turns a `text/scanner.Scanner` into a `TokenReader`. |
| |
| // A Tokenizer is a simple wrapping of text/scanner.Scanner, configured |
| // for our purposes and made a TokenReader. It forms the lowest level, |
| // turning text from readers into tokens. |
| type Tokenizer struct { |
| tok ScanToken // Most recent token. |
| s *scanner.Scanner |
| line int |
| fileName string |
| } |
| |
| func NewTokenizer(name string, r io.Reader, file *os.File) *Tokenizer |
| |
| Either the reader or the file may be nil. |
| |
| `Tokenizer` implements `TokenReader` |
| |
| * Tokenizer.Next |
| |
| To give the flavor: |
| |
| func (t *Tokenizer) Next() ScanToken { |
| s := t.s |
| for { |
| t.tok = ScanToken(s.Scan()) |
| if t.tok != scanner.Comment { |
| break |
| } |
| length := strings.Count(s.TokenText(), "\n") |
| t.line += length |
| histLine += length |
| // For now, just discard all comments. |
| } |
| // Special processing for '\n' etc. elided. |
| return t.tok |
| } |
| |
| * Tokenizer.Text |
| |
| func (t *Tokenizer) Text() string { |
| switch t.tok { |
| case LSH: // Special handling of odd tokens used by ARM. |
| return "<<" |
| case RSH: |
| return ">>" |
| case ARR: |
| return "->" |
| case ROT: |
| return "@>" |
| } |
| return t.s.TokenText() |
| } |
| |
| `LSH` etc. are the reason for `ScanToken`: the set of tokens is a superset of the underlying type `text/scanner.Token`. |
| |
| * Macro definitions |
| |
| It's easy to see how files work: `NewTokenizer` can do that. |
| |
| What about a macro definition? |
| |
| #define A(arg) 27+(arg) |
| |
| Becomes the tokens |
| |
| 27 + ( arg ) |
| |
| When we encounter `A(x)`, we substitute the argument and get |
| |
| 27 + ( x ) |
| |
| Use a slice of tokens and store them in a `Macro` struct. |
| |
| type Macro struct { |
| name string // The #define name. |
| args []string // Formal arguments. |
| tokens []Token // Body of macro. |
| } |
| |
| * Slice |
| |
| // A Slice reads from a slice of Tokens. |
| type Slice struct { |
| tokens []Token |
| fileName string |
| line int |
| pos int |
| } |
| |
| Implements `TokenReader`. |
| |
| func (s *Slice) Next() ScanToken { |
| s.pos++ |
| if s.pos >= len(s.tokens) { |
| return scanner.EOF |
| } |
| return s.tokens[s.pos].ScanToken |
| } |
| |
| To invoke a macro, substitute the _actuals_ for the _formals_ and make a `Slice`. |
| |
| * Command-line flags |
| |
| A command-line flag `-D` can define a macro before execution: |
| |
| go tool asm -D 'macro=value' file.s |
| |
| That's easy! |
| |
| var DFlag MultiFlag |
| flag.Var(&DFlag, "D", "predefined symbol D=identifier...") |
| |
| type MultiFlag []string // Implements flag.Value, allows multiple settings. |
| |
| predefine(DFlag) |
| |
| * Predefined macros |
| |
| // predefine installs the macros set by the -D flag on the command line. |
| func predefine(defines MultiFlag) map[string]*Macro { |
| macros := make(map[string]*Macro) |
| for _, name := range defines { |
| value := "1" |
| i := strings.IndexRune(name, '=') |
| if i > 0 { |
| name, value = name[:i], name[i+1:] |
| } |
| // Various error checks elided. |
| macros[name] = &Macro{ |
| name: name, |
| args: nil, // No arguments allowed. |
| tokens: Tokenize(value), // Turn the value into tokens. |
| } |
| } |
| return macros |
| } |
| |
| The return value is the initial symbol table of macros. |
| |
| * The big picture |
| |
| We know how to: |
| |
| - tokenize an input stream from text or `io.Reader` |
| - define a macro |
| - invoke a macro |
| |
| But how does it fit together? |
| Which implementation `TokenReader` does the parser see? |
| |
| Consider |
| |
| - `#include` names a file to process next |
| - macro invocation names a slice of tokens to process next |
| |
| It's a stack! Push new input, pop at EOF. |
| |
| * Stack |
| |
| // A Stack is a stack of TokenReaders. As the top TokenReader hits EOF, |
| // it resumes reading the next one down. |
| type Stack struct { |
| tr []TokenReader |
| } |
| |
| // Push adds tr to the top (end) of the input stack. (Popping happens automatically.) |
| func (s *Stack) Push(tr TokenReader) { |
| s.tr = append(s.tr, tr) |
| } |
| |
| func (s *Stack) Next() ScanToken { |
| tos := s.tr[len(s.tr)-1] |
| tok := tos.Next() |
| for tok == scanner.EOF && len(s.tr) > 1 { |
| // Pop the topmost item from the stack and resume with the next one down. |
| s.tr = s.tr[:len(s.tr)-1] |
| tok = s.Next() |
| } |
| return tok |
| } |
| |
| * The Input type |
| |
| // Input is the main input: a stack of readers and some macro definitions. |
| // It also handles #include processing (by pushing onto the input stack) |
| // and parses and instantiates macro definitions. |
| type Input struct { |
| Stack |
| includes []string // Directories in which to look for #includes |
| macros map[string]*Macro |
| text string // Text of last token returned by Next. |
| ... |
| } |
| |
| Note the embedding of `Stack`: `Input` is a wrapping of the `Stack` implementation of `TokenReader`. |
| The parser uses a single instance of `Input` as its `TokenReader`. |
| |
| * Example: #include processing |
| |
| Some error handling elided for brevity. |
| |
| func (in *Input) include() { |
| // Find and parse file name, which is next token, a string. |
| tok := in.Stack.Next() |
| name, _ := strconv.Unquote(in.Stack.Text()) |
| in.expectNewline("#include") // Checks that a newline comes now. |
| // Push tokenizer for file onto stack. |
| fd, err := os.Open(name) |
| if err != nil { |
| for _, dir := range in.includes { |
| fd, err = os.Open(filepath.Join(dir, name)) |
| if err == nil { |
| break |
| } |
| } |
| } |
| in.Push(NewTokenizer(name, fd, fd)) |
| } |
| |
| * Macro definition |
| |
| Macro definition is similar but more complex because of the variety of forms. |
| Must deal with constants, empty values, macros with arguments, etc. |
| |
| The end result is to build a `Macro` value and install it in `Input.macros`. |
| |
| * The final piece: Input.Next |
| |
| Here is the implementation of a CPP input stream using these types. |
| (Error handling mostly elided for brevity.) |
| |
| func (in *Input) Next() ScanToken { |
| // If we cannot generate a token after 100 macro invocations, we're in trouble. |
| for nesting := 0; nesting < 100; { |
| tok := in.Stack.Next() |
| switch tok { |
| case '#': |
| in.hash() |
| case scanner.Ident: |
| // Is it a macro name? |
| name := in.Stack.Text() |
| macro := in.macros[name] |
| if macro != nil { |
| nesting++ |
| in.invokeMacro(macro) |
| continue |
| } |
| fallthrough |
| default: |
| // Continued on next slide. |
| |
| * Input.Next part 2 |
| |
| func (in *Input) Next() ScanToken { |
| // Continued from previous slide. |
| default: |
| if tok == scanner.EOF && len(in.ifdefStack) > 0 { |
| // We're skipping text but have run out of input with no #endif. |
| in.Error("unclosed #ifdef or #ifndef") |
| } |
| if in.enabled() { |
| in.text = in.Stack.Text() |
| return tok |
| } |
| } |
| } |
| in.Error("recursive macro invocation") |
| return 0 |
| } |
| |
| * Initializing and running the lexer |
| |
| // NewInput returns an Input from the given path. |
| func NewInput(name string) *Input { |
| return &Input{ |
| // include directories: look in source dir, then -I directories. |
| includes: append([]string{filepath.Dir(name)}, IFlag...), |
| macros: predefine(DFlag), |
| } |
| } |
| |
| To run, call `in.Push` to put the input file (or `os.Stdin`) on the stack. |
| |
| Then the lexer runs until the `Stack` is empty. |
| |
| * Summary |
| |
| Interfaces give programs structure. |
| |
| Interfaces encourage design by composition. |
| |
| - We have an interface that is implemented by a stack of itself! |
| |
| Interfaces enable and enforce clean divisions between components. |
| |
| - The simple idea of a `TokenReader` let us implement `#include` files, `#define` macros, command-line flags, `#ifdef` and more with one simple interface. |
| |
| And a final tip of the hat to `text/scanner` under it all. |