2016: add token.slide
Slides for my talk at the Go meetup in Sydney on September 15 2016.
Change-Id: Ib34d261a3c8fc4f698e9115953191d35e5ddacb8
Reviewed-on: https://go-review.googlesource.com/29345
Reviewed-by: Andrew Gerrand <adg@golang.org>
diff --git a/2016/token.slide b/2016/token.slide
new file mode 100644
index 0000000..049f3b9
--- /dev/null
+++ b/2016/token.slide
@@ -0,0 +1,471 @@
+Stacks of Tokens
+A study in interfaces
+
+Sydney Go Meetup
+15 September 2016
+
+Rob Pike
+Google
+@rob_pike
+[[http://golang.org/s/plusrob][+RobPikeTheHuman]]
+http://golang.org/
+
+* Background
+
+Spoke at Gophercon this year: [[https://talks.golang.org/2016/asm.slide]]
+
+That talk was about system design and portability.
+
+Today's talk is about its lexer.
+
+Spoke about lexing before: [[https://talks.golang.org/2011/lex.slide]]
+
+That talk showed a way to use concurrency to build a lexer.
+
+Today's talk is about interfaces.
+
+* Lexer
+
+A lexer, also called a scanner, breaks the input stream into distinct "tokens":
+
+- identifiers
+- numbers
+- quoted strings
+- operators
+- miscellaneous characters such as comma, colon.
+
+Each token has a _type_ and a _value_.
+
+* Example
+
+ MOVQ R0, 4(R1)
+
+The lexer turns this into the stream
+
+- identifier `MOVQ`
+- identifier `R0`
+- character `,`
+- number `4`
+- character `(`
+- identifier `R1`
+- character `)`
+
+Spaces are ignored.
+
+The parser then reads these tokens to parse the input into a _parse_tree_.
+
+* Go's text/scanner package
+
+There is a nice, efficient lexer package in the Go standard library:
+
+- [[https://golang.org/pkg/text/scanner/][`text/scanner`]]
+
+It can do this job just fine.
+
+But.... that is not enough for the assembler because of
+
+* Backwards compatibility
+
+The new Go assembler had to be totally compatible with the ones it replaces, which used YACC and were written in C. (See [[https://talks.golang.org/2015/gogo.slide][]].)
+
+Each assembler (one per architecture) contained these lines at the end of `lex.c`:
+
+ #include "../cc/lexbody"
+ #include "../cc/macbody"
+
+This gave the assemblers the same lexer as the C compiler.
+The differences between C tokens and Go tokens are minor and can be handled, but....
+
+
+The C lexer brings in something problematic.
+
+* The C preprocessor
+
+The old assemblers had a C preprocessor built in!
+An old-fashioned one, without `#if` and token pasting, but still:
+
+ #include "file"
+ #define MAXBUF 512
+ #define MULDIV(a, b, c) ((a)*(b)/(c))
+ #ifdef MAXBUF
+ #endif
+
+The `text/scanner` package can't handle this.
+But we need to do it to be compatible.
+
+This talk is about how to use Go's interfaces to do it.
+
+* An observation
+
+What is standard input? An input source.
+
+- read the input
+
+What is an included file? An input source.
+
+- read the file
+
+What is a macro invocation? An input source.
+
+- read the macro definition
+
+Sounds a lot like `io.Reader`.
+
+* Token reader
+
+We don't want bytes, we want tokens. (Why?)
+
+Instead of
+
+ type Reader interface {
+ Read(p []byte) (n int, err error)
+ }
+
+we want something like
+
+ type TokenReader interface {
+ ReadToken() (Token, error)
+ }
+
+In practice the parser needs something different from `Read`, but the basic idea works.
+
+We build a lexer around an interface that reads tokens.
+
+* The observation in practice
+
+What is standard input? An input source.
+
+- get tokens from the input
+
+What is an included file? An input source.
+
+- get tokens from the file
+
+What is a macro invocation? An input source.
+
+- get tokens from the macro definition
+
+Each of these implements the `TokenReader` interface.
+
+* TokenReader
+
+ type TokenReader interface {
+ // Next returns the next token.
+ Next() ScanToken
+ // The following methods all refer to the most recent token returned by Next.
+ // Text returns the original string representation of the token.
+ Text() string
+ // File reports the source file name of the token.
+ File() string
+ // Line reports the source line number of the token.
+ Line() int
+ // Col reports the source column number of the token.
+ Col() int
+ }
+
+Parser calls `Next`, then can ask about the token: what is, where it is.
+`ScanToken` is just `text/scanner.Token` with tweaks.
+
+Note: No `Peek`. This has no lookahead.
+
+* Tokenizer
+
+`Tokenizer`, the foundational `TokenReader`, turns a `text/scanner.Scanner` into a `TokenReader`.
+
+ // A Tokenizer is a simple wrapping of text/scanner.Scanner, configured
+ // for our purposes and made a TokenReader. It forms the lowest level,
+ // turning text from readers into tokens.
+ type Tokenizer struct {
+ tok ScanToken // Most recent token.
+ s *scanner.Scanner
+ line int
+ fileName string
+ }
+
+ func NewTokenizer(name string, r io.Reader, file *os.File) *Tokenizer
+
+Either the reader or the file may be nil.
+
+`Tokenizer` implements `TokenReader`
+
+* Tokenizer.Next
+
+To give the flavor:
+
+ func (t *Tokenizer) Next() ScanToken {
+ s := t.s
+ for {
+ t.tok = ScanToken(s.Scan())
+ if t.tok != scanner.Comment {
+ break
+ }
+ length := strings.Count(s.TokenText(), "\n")
+ t.line += length
+ histLine += length
+ // For now, just discard all comments.
+ }
+ // Special processing for '\n' etc. elided.
+ return t.tok
+ }
+
+* Tokenizer.Text
+
+ func (t *Tokenizer) Text() string {
+ switch t.tok {
+ case LSH: // Special handling of odd tokens used by ARM.
+ return "<<"
+ case RSH:
+ return ">>"
+ case ARR:
+ return "->"
+ case ROT:
+ return "@>"
+ }
+ return t.s.TokenText()
+ }
+
+`LSH` etc. are the reason for `ScanToken`: the set of tokens is a superset of the underlying type `text/scanner.Token`.
+
+* Macro definitions
+
+It's easy to see how files work: `NewTokenizer` can do that.
+
+What about a macro definition?
+
+ #define A(arg) 27+(arg)
+
+Becomes the tokens
+
+ 27 + ( arg )
+
+When we encounter `A(x)`, we substitute the argument and get
+
+ 27 + ( x )
+
+Use a slice of tokens and store them in a `Macro` struct.
+
+ type Macro struct {
+ name string // The #define name.
+ args []string // Formal arguments.
+ tokens []Token // Body of macro.
+ }
+
+* Slice
+
+ // A Slice reads from a slice of Tokens.
+ type Slice struct {
+ tokens []Token
+ fileName string
+ line int
+ pos int
+ }
+
+Implements `TokenReader`.
+
+ func (s *Slice) Next() ScanToken {
+ s.pos++
+ if s.pos >= len(s.tokens) {
+ return scanner.EOF
+ }
+ return s.tokens[s.pos].ScanToken
+ }
+
+To invoke a macro, substitute the _actuals_ for the _formals_ and make a `Slice`.
+
+* Command-line flags
+
+A command-line flag `-D` can define a macro before execution:
+
+ go tool asm -D 'macro=value' file.s
+
+That's easy!
+
+ var DFlag MultiFlag
+ flag.Var(&DFlag, "D", "predefined symbol D=identifier...")
+
+ type MultiFlag []string // Implements flag.Value, allows multiple settings.
+
+ predefine(DFlag)
+
+* Predefined macros
+
+ // predefine installs the macros set by the -D flag on the command line.
+ func predefine(defines MultiFlag) map[string]*Macro {
+ macros := make(map[string]*Macro)
+ for _, name := range defines {
+ value := "1"
+ i := strings.IndexRune(name, '=')
+ if i > 0 {
+ name, value = name[:i], name[i+1:]
+ }
+ // Various error checks elided.
+ macros[name] = &Macro{
+ name: name,
+ args: nil, // No arguments allowed.
+ tokens: Tokenize(value), // Turn the value into tokens.
+ }
+ }
+ return macros
+ }
+
+The return value is the initial symbol table of macros.
+
+* The big picture
+
+We know how to:
+
+- tokenize an input stream from text or `io.Reader`
+- define a macro
+- invoke a macro
+
+But how does it fit together?
+Which implementation `TokenReader` does the parser see?
+
+Consider
+
+- `#include` names a file to process next
+- macro invocation names a slice of tokens to process next
+
+It's a stack! Push new input, pop at EOF.
+
+* Stack
+
+ // A Stack is a stack of TokenReaders. As the top TokenReader hits EOF,
+ // it resumes reading the next one down.
+ type Stack struct {
+ tr []TokenReader
+ }
+
+ // Push adds tr to the top (end) of the input stack. (Popping happens automatically.)
+ func (s *Stack) Push(tr TokenReader) {
+ s.tr = append(s.tr, tr)
+ }
+
+ func (s *Stack) Next() ScanToken {
+ tos := s.tr[len(s.tr)-1]
+ tok := tos.Next()
+ for tok == scanner.EOF && len(s.tr) > 1 {
+ // Pop the topmost item from the stack and resume with the next one down.
+ s.tr = s.tr[:len(s.tr)-1]
+ tok = s.Next()
+ }
+ return tok
+ }
+
+* The Input type
+
+ // Input is the main input: a stack of readers and some macro definitions.
+ // It also handles #include processing (by pushing onto the input stack)
+ // and parses and instantiates macro definitions.
+ type Input struct {
+ Stack
+ includes []string // Directories in which to look for #includes
+ macros map[string]*Macro
+ text string // Text of last token returned by Next.
+ ...
+ }
+
+Note the embedding of `Stack`: `Input` is a wrapping of the `Stack` implementation of `TokenReader`.
+The parser uses a single instance of `Input` as its `TokenReader`.
+
+* Example: #include processing
+
+Some error handling elided for brevity.
+
+ func (in *Input) include() {
+ // Find and parse file name, which is next token, a string.
+ tok := in.Stack.Next()
+ name, _ := strconv.Unquote(in.Stack.Text())
+ in.expectNewline("#include") // Checks that a newline comes now.
+ // Push tokenizer for file onto stack.
+ fd, err := os.Open(name)
+ if err != nil {
+ for _, dir := range in.includes {
+ fd, err = os.Open(filepath.Join(dir, name))
+ if err == nil {
+ break
+ }
+ }
+ }
+ in.Push(NewTokenizer(name, fd, fd))
+ }
+
+* Macro definition
+
+Macro definition is similar but more complex because of the variety of forms.
+Must deal with constants, empty values, macros with arguments, etc.
+
+The end result is to build a `Macro` value and install it in `Input.macros`.
+
+* The final piece: Input.Next
+
+Here is the implementation of a CPP input stream using these types.
+(Error handling mostly elided for brevity.)
+
+ func (in *Input) Next() ScanToken {
+ // If we cannot generate a token after 100 macro invocations, we're in trouble.
+ for nesting := 0; nesting < 100; {
+ tok := in.Stack.Next()
+ switch tok {
+ case '#':
+ in.hash()
+ case scanner.Ident:
+ // Is it a macro name?
+ name := in.Stack.Text()
+ macro := in.macros[name]
+ if macro != nil {
+ nesting++
+ in.invokeMacro(macro)
+ continue
+ }
+ fallthrough
+ default:
+ // Continued on next slide.
+
+* Input.Next part 2
+
+ func (in *Input) Next() ScanToken {
+ // Continued from previous slide.
+ default:
+ if tok == scanner.EOF && len(in.ifdefStack) > 0 {
+ // We're skipping text but have run out of input with no #endif.
+ in.Error("unclosed #ifdef or #ifndef")
+ }
+ if in.enabled() {
+ in.text = in.Stack.Text()
+ return tok
+ }
+ }
+ }
+ in.Error("recursive macro invocation")
+ return 0
+ }
+
+* Initializing and running the lexer
+
+ // NewInput returns an Input from the given path.
+ func NewInput(name string) *Input {
+ return &Input{
+ // include directories: look in source dir, then -I directories.
+ includes: append([]string{filepath.Dir(name)}, IFlag...),
+ macros: predefine(DFlag),
+ }
+ }
+
+To run, call `in.Push` to put the input file (or `os.Stdin`) on the stack.
+
+Then the lexer runs until the `Stack` is empty.
+
+* Summary
+
+Interfaces give programs structure.
+
+Interfaces encourage design by composition.
+
+- We have an interface that is implemented by a stack of itself!
+
+Interfaces enable and enforce clean divisions between components.
+
+- The simple idea of a `TokenReader` let us implement `#include` files, `#define` macros, command-line flags, `#ifdef` and more with one simple interface.
+
+And a final tip of the hat to `text/scanner` under it all.