content/strings.article - blog - Git at Google

 Strings, bytes, runes and characters in Go
 23 Oct 2013
 Tags: strings, bytes, runes, characters

 Rob Pike

 * Introduction

 The [[https://blog.golang.org/slices][previous blog post]] explained how slices
 work in Go, using a number of examples to illustrate the mechanism behind
 their implementation.
 Building on that background, this post discusses strings in Go.
 At first, strings might seem too simple a topic for a blog post, but to use
 them well requires understanding not only how they work,
 but also the difference between a byte, a character, and a rune,
 the difference between Unicode and UTF-8,
 the difference between a string and a string literal,
 and other even more subtle distinctions.

 One way to approach this topic is to think of it as an answer to the frequently
 asked question, "When I index a Go string at position _n_, why don't I get the
 _nth_ character?"
 As you'll see, this question leads us to many details about how text works
 in the modern world.

 An excellent introduction to some of these issues, independent of Go,
 is Joel Spolsky's famous blog post,
 [[http://www.joelonsoftware.com/articles/Unicode.html][The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)]].
 Many of the points he raises will be echoed here.

 * What is a string?

 Let's start with some basics.

 In Go, a string is in effect a read-only slice of bytes.
 If you're at all uncertain about what a slice of bytes is or how it works,
 please read the [[https://blog.golang.org/slices][previous blog post]];
 we'll assume here that you have.

 It's important to state right up front that a string holds _arbitrary_ bytes.
 It is not required to hold Unicode text, UTF-8 text, or any other predefined format.
 As far as the content of a string is concerned, it is exactly equivalent to a
 slice of bytes.

 Here is a string literal (more about those soon) that uses the
 `\xNN` notation to define a string constant holding some peculiar byte values.
 (Of course, bytes range from hexadecimal values 00 through FF, inclusive.)

 .code strings/basic.go /const sample/

 * Printing strings

 Because some of the bytes in our sample string are not valid ASCII, not even
 valid UTF-8, printing the string directly will produce ugly output.
 The simple print statement

 .code strings/basic.go /println/,/println/

 produces this mess (whose exact appearance varies with the environment):

 	��=� ⌘

 To find out what that string really holds, we need to take it apart and examine the pieces.
 There are several ways to do this.
 The most obvious is to loop over its contents and pull out the bytes
 individually, as in this `for` loop:

 .code strings/basic.go /byte loop/,/byte loop/

 As implied up front, indexing a string accesses individual bytes, not
 characters. We'll return to that topic in detail below. For now, let's
 stick with just the bytes.
 This is the output from the byte-by-byte loop:

 	bd b2 3d bc 20 e2 8c 98

 Notice how the individual bytes match the
 hexadecimal escapes that defined the string.

 A shorter way to generate presentable output for a messy string
 is to use the `%x` (hexadecimal) format verb of `fmt.Printf`.
 It just dumps out the sequential bytes of the string as hexadecimal
 digits, two per byte.

 .code strings/basic.go /percent x/,/percent x/

 Compare its output to that above:

 	bdb23dbc20e28c98

 A nice trick is to use the "space" flag in that format, putting a
 space between the `%` and the `x`. Compare the format string
 used here to the one above,

 .code strings/basic.go /percent space x/,/percent space x/

 and notice how the bytes come
 out with spaces between, making the result a little less imposing:

 	bd b2 3d bc 20 e2 8c 98

 There's more. The `%q` (quoted) verb will escape any non-printable
 byte sequences in a string so the output is unambiguous.

 .code strings/basic.go /percent q/,/percent q/

 This technique is handy when much of the string is
 intelligible as text but there are peculiarities to root out; it produces:

 	"\xbd\xb2=\xbc ⌘"

 If we squint at that, we can see that buried in the noise is one ASCII equals sign,
 along with a regular space, and at the end appears the well-known Swedish "Place of Interest"
 symbol.
 That symbol has Unicode value U+2318, encoded as UTF-8 by the bytes
 after the space (hex value `20`): `e2` `8c` `98`.

 If we are unfamiliar or confused by strange values in the string,
 we can use the "plus" flag to the `%q` verb. This flag causes the output to escape
 not only non-printable sequences, but also any non-ASCII bytes, all
 while interpreting UTF-8.
 The result is that it exposes the Unicode values of properly formatted UTF-8
 that represents non-ASCII data in the string:

 .code strings/basic.go /percent plus q/,/percent plus q/

 With that format, the Unicode value of the Swedish symbol shows up as a
 `\u` escape:

 	"\xbd\xb2=\xbc \u2318"

 These printing techiques are good to know when debugging
 the contents of strings, and will be handy in the discussion that follows.
 It's worth pointing out as well that all these methods behave exactly the
 same for byte slices as they do for strings.

 Here's the full set of printing options we've listed, presented as
 a complete program you can run (and edit) right in the browser:

 .play -edit strings/basic.go /package/,/^}/

 [Exercise: Modify the examples above to use a slice of bytes
 instead of a string. Hint: Use a conversion to create the slice.]

 [Exercise: Loop over the string using the `%q` format on each byte.
 What does the output tell you?]

 * UTF-8 and string literals

 As we saw, indexing a string yields its bytes, not its characters: a string is just a
 bunch of bytes.
 That means that when we store a character value in a string,
 we store its byte-at-a-time representation.
 Let's look at a more controlled example to see how that happens.

 Here's a simple program that prints a string constant with a single character
 three different ways, once as a plain string, once as an ASCII-only quoted
 string, and once as individual bytes in hexadecimal.
 To avoid any confusion, we create a "raw string", enclosed by back quotes,
 so it can contain only literal text. (Regular strings, enclosed by double
 quotes, can contain escape sequences as we showed above.)

 .play -edit strings/utf8.go /^func/,/^}/

 The output is:

 	plain string: ⌘
 	quoted string: "\u2318"
 	hex bytes: e2 8c 98

 which reminds us that the Unicode character value U+2318, the "Place
 of Interest" symbol ⌘, is represented by the bytes `e2` `8c` `98`, and
 that those bytes are the UTF-8 encoding of the hexadecimal
 value 2318.

 It may be obvious or it may be subtle, depending on your familiarity with
 UTF-8, but it's worth taking a moment to explain how the UTF-8 representation
 of the string was created.
 The simple fact is: it was created when the source code was written.

 Source code in Go is _defined_ to be UTF-8 text; no other representation is
 allowed. That implies that when, in the source code, we write the text

 	`⌘`

 the text editor used to create the program places the UTF-8 encoding
 of the symbol ⌘ into the source text.
 When we print out the hexadecimal bytes, we're just dumping the
 data the editor placed in the file.

 In short, Go source code is UTF-8, so
 _the_source_code_for_the_string_literal_is_UTF-8_text_.
 If that string literal contains no escape sequences, which a raw
 string cannot, the constructed string will hold exactly the
 source text  between the quotes.
 Thus by definition and
 by construction the raw string will always contain a valid UTF-8
 representation of its contents.
 Similarly, unless it contains UTF-8-breaking escapes like those
 from the previous section, a regular string literal will also always
 contain valid UTF-8.

 Some people think Go strings are always UTF-8, but they
 are not: only string literals are UTF-8.
 As we showed in the previous section, string _values_ can contain arbitrary
 bytes;
 as we showed in this one, string _literals_ always contain UTF-8 text
 as long as they have no byte-level escapes.

 To summarize, strings can contain arbitrary bytes, but when constructed
 from string literals, those bytes are (almost always) UTF-8.

 * Code points, characters, and runes

 We've been very careful so far in how we use the words "byte" and "character".
 That's partly because strings hold bytes, and partly because the idea of "character"
 is a little hard to define.
 The Unicode standard uses the term "code point" to refer to the item represented
 by a single value.
 The code point U+2318, with hexadecimal value 2318, represents the symbol ⌘.
 (For lots more information about that code point, see
 [[http://unicode.org/cldr/utility/character.jsp?a=2318][its Unicode page]].)

 To pick a more prosaic example, the Unicode code point U+0061 is the lower
 case Latin letter 'A': a.

 But what about the lower case grave-accented letter 'A', à?
 That's a character, and it's also a code point (U+00E0), but it has other
 representations.
 For example we can use the "combining" grave accent code point, U+0300,
 and attach it to the lower case letter a, U+0061, to create the same character à.
 In general, a character may be represented by a number of different
 sequences of code points, and therefore different sequences of UTF-8 bytes.

 The concept of character in computing is therefore ambiguous, or at least
 confusing, so we use it with care.
 To make things dependable, there are _normalization_ techniques that guarantee that
 a given character is always represented by the same code points, but that
 subject takes us too far off the topic for now.
 A later blog post will explain how the Go libraries address normalization.

 "Code point" is a bit of a mouthful, so Go introduces a shorter term for the
 concept: _rune_.
 The term appears in the libraries and source code, and means exactly
 the same as "code point", with one interesting addition.

 The Go language defines the word `rune` as an alias for the type `int32`, so
 programs can be clear when an integer value represents a code point.
 Moreover, what you might think of as a character constant is called a
 _rune_constant_ in Go.
 The type and value of the expression

 	'⌘'

 is `rune` with integer value `0x2318`.

 To summarize, here are the salient points:

 - Go source code is always UTF-8.
 - A string holds arbitrary bytes.
 - A string literal, absent byte-level escapes, always holds valid UTF-8 sequences.
 - Those sequences represent Unicode code points, called runes.
 - No guarantee is made in Go that characters in strings are normalized.

 * Range loops

 Besides the axiomatic detail that Go source code is UTF-8,
 there's really only one way that Go treats UTF-8 specially, and that is when using
 a `for` `range` loop on a string.

 We've seen what happens with a regular `for` loop.
 A `for` `range` loop, by contrast, decodes one UTF-8-encoded rune on each
 iteration.
 Each time around the loop, the index of the loop is the starting position of the
 current rune, measured in bytes, and the code point is its value.
 Here's an example using yet another handy `Printf` format, `%#U`, which shows
 the code point's Unicode value and its printed representation:

 .play -edit strings/range.go /const/,/}/

 The output shows how each code point occupies multiple bytes:

 	U+65E5 '日' starts at byte position 0
 	U+672C '本' starts at byte position 3
 	U+8A9E '語' starts at byte position 6

 [Exercise: Put an invalid UTF-8 byte sequence into the string. (How?)
 What happens to the iterations of the loop?]

 * Libraries

 Go's standard library provides strong support for interpreting UTF-8 text.
 If a `for` `range` loop isn't sufficient for your purposes,
 chances are the facility you need is provided by a package in the library.

 The most important such package is
 [[https://golang.org/pkg/unicode/utf8/][`unicode/utf8`]],
 which contains
 helper routines to validate, disassemble, and reassemble UTF-8 strings.
 Here is a program equivalent to the `for` `range` example above,
 but using the `DecodeRuneInString` function from that package to
 do the work.
 The return values from the function are the rune and its width in
 UTF-8-encoded bytes.

 .play -edit strings/encoding.go /const/,/}/

 Run it to see that it performs the same.
 The `for` `range` loop and `DecodeRuneInString` are defined to produce
 exactly the same iteration sequence.

 Look at the
 [[https://golang.org/pkg/unicode/utf8/][documentation]]
 for the `unicode/utf8` package to see what
 other facilities it provides.

 * Conclusion

 To answer the question posed at the beginning: Strings are built from bytes
 so indexing them yields bytes, not characters.
 A string might not even hold characters.
 In fact, the definition of "character" is ambiguous and it would
 be a mistake to try to resolve the ambiguity by defining that strings are made
 of characters.

 There's much more to say about Unicode, UTF-8, and the world of multilingual
 text processing, but it can wait for another post.
 For now, we hope you have a better understanding of how Go strings behave
 and that, although they may contain arbitrary bytes, UTF-8 is a central part
 of their design.
	Strings, bytes, runes and characters in Go
	23 Oct 2013
	Tags: strings, bytes, runes, characters

	Rob Pike

	* Introduction

	The [[https://blog.golang.org/slices][previous blog post]] explained how slices
	work in Go, using a number of examples to illustrate the mechanism behind
	their implementation.
	Building on that background, this post discusses strings in Go.
	At first, strings might seem too simple a topic for a blog post, but to use
	them well requires understanding not only how they work,
	but also the difference between a byte, a character, and a rune,
	the difference between Unicode and UTF-8,
	the difference between a string and a string literal,
	and other even more subtle distinctions.

	One way to approach this topic is to think of it as an answer to the frequently
	asked question, "When I index a Go string at position _n_, why don't I get the
	_nth_ character?"
	As you'll see, this question leads us to many details about how text works
	in the modern world.

	An excellent introduction to some of these issues, independent of Go,
	is Joel Spolsky's famous blog post,
	[[http://www.joelonsoftware.com/articles/Unicode.html][The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)]].
	Many of the points he raises will be echoed here.

	* What is a string?

	Let's start with some basics.

	In Go, a string is in effect a read-only slice of bytes.
	If you're at all uncertain about what a slice of bytes is or how it works,
	please read the [[https://blog.golang.org/slices][previous blog post]];
	we'll assume here that you have.

	It's important to state right up front that a string holds _arbitrary_ bytes.
	It is not required to hold Unicode text, UTF-8 text, or any other predefined format.
	As far as the content of a string is concerned, it is exactly equivalent to a
	slice of bytes.

	Here is a string literal (more about those soon) that uses the
	`\xNN` notation to define a string constant holding some peculiar byte values.
	(Of course, bytes range from hexadecimal values 00 through FF, inclusive.)

	.code strings/basic.go /const sample/

	* Printing strings

	Because some of the bytes in our sample string are not valid ASCII, not even
	valid UTF-8, printing the string directly will produce ugly output.
	The simple print statement

	.code strings/basic.go /println/,/println/

	produces this mess (whose exact appearance varies with the environment):

	��=� ⌘

	To find out what that string really holds, we need to take it apart and examine the pieces.
	There are several ways to do this.
	The most obvious is to loop over its contents and pull out the bytes
	individually, as in this `for` loop:

	.code strings/basic.go /byte loop/,/byte loop/

	As implied up front, indexing a string accesses individual bytes, not
	characters. We'll return to that topic in detail below. For now, let's
	stick with just the bytes.
	This is the output from the byte-by-byte loop:

	bd b2 3d bc 20 e2 8c 98

	Notice how the individual bytes match the
	hexadecimal escapes that defined the string.

	A shorter way to generate presentable output for a messy string
	is to use the `%x` (hexadecimal) format verb of `fmt.Printf`.
	It just dumps out the sequential bytes of the string as hexadecimal
	digits, two per byte.

	.code strings/basic.go /percent x/,/percent x/

	Compare its output to that above:

	bdb23dbc20e28c98

	A nice trick is to use the "space" flag in that format, putting a
	space between the `%` and the `x`. Compare the format string
	used here to the one above,

	.code strings/basic.go /percent space x/,/percent space x/

	and notice how the bytes come
	out with spaces between, making the result a little less imposing:

	bd b2 3d bc 20 e2 8c 98

	There's more. The `%q` (quoted) verb will escape any non-printable
	byte sequences in a string so the output is unambiguous.

	.code strings/basic.go /percent q/,/percent q/

	This technique is handy when much of the string is
	intelligible as text but there are peculiarities to root out; it produces:

	"\xbd\xb2=\xbc ⌘"

	If we squint at that, we can see that buried in the noise is one ASCII equals sign,
	along with a regular space, and at the end appears the well-known Swedish "Place of Interest"
	symbol.
	That symbol has Unicode value U+2318, encoded as UTF-8 by the bytes
	after the space (hex value `20`): `e2` `8c` `98`.

	If we are unfamiliar or confused by strange values in the string,
	we can use the "plus" flag to the `%q` verb. This flag causes the output to escape
	not only non-printable sequences, but also any non-ASCII bytes, all
	while interpreting UTF-8.
	The result is that it exposes the Unicode values of properly formatted UTF-8
	that represents non-ASCII data in the string:

	.code strings/basic.go /percent plus q/,/percent plus q/

	With that format, the Unicode value of the Swedish symbol shows up as a
	`\u` escape:

	"\xbd\xb2=\xbc \u2318"

	These printing techiques are good to know when debugging
	the contents of strings, and will be handy in the discussion that follows.
	It's worth pointing out as well that all these methods behave exactly the
	same for byte slices as they do for strings.

	Here's the full set of printing options we've listed, presented as
	a complete program you can run (and edit) right in the browser:

	.play -edit strings/basic.go /package/,/^}/

	[Exercise: Modify the examples above to use a slice of bytes
	instead of a string. Hint: Use a conversion to create the slice.]

	[Exercise: Loop over the string using the `%q` format on each byte.
	What does the output tell you?]

	* UTF-8 and string literals

	As we saw, indexing a string yields its bytes, not its characters: a string is just a
	bunch of bytes.
	That means that when we store a character value in a string,
	we store its byte-at-a-time representation.
	Let's look at a more controlled example to see how that happens.

	Here's a simple program that prints a string constant with a single character
	three different ways, once as a plain string, once as an ASCII-only quoted
	string, and once as individual bytes in hexadecimal.
	To avoid any confusion, we create a "raw string", enclosed by back quotes,
	so it can contain only literal text. (Regular strings, enclosed by double
	quotes, can contain escape sequences as we showed above.)

	.play -edit strings/utf8.go /^func/,/^}/

	The output is:

	plain string: ⌘
	quoted string: "\u2318"
	hex bytes: e2 8c 98

	which reminds us that the Unicode character value U+2318, the "Place
	of Interest" symbol ⌘, is represented by the bytes `e2` `8c` `98`, and
	that those bytes are the UTF-8 encoding of the hexadecimal
	value 2318.

	It may be obvious or it may be subtle, depending on your familiarity with
	UTF-8, but it's worth taking a moment to explain how the UTF-8 representation
	of the string was created.
	The simple fact is: it was created when the source code was written.

	Source code in Go is _defined_ to be UTF-8 text; no other representation is
	allowed. That implies that when, in the source code, we write the text

	`⌘`

	the text editor used to create the program places the UTF-8 encoding
	of the symbol ⌘ into the source text.
	When we print out the hexadecimal bytes, we're just dumping the
	data the editor placed in the file.

	In short, Go source code is UTF-8, so
	_the_source_code_for_the_string_literal_is_UTF-8_text_.
	If that string literal contains no escape sequences, which a raw
	string cannot, the constructed string will hold exactly the
	source text between the quotes.
	Thus by definition and
	by construction the raw string will always contain a valid UTF-8
	representation of its contents.
	Similarly, unless it contains UTF-8-breaking escapes like those
	from the previous section, a regular string literal will also always
	contain valid UTF-8.

	Some people think Go strings are always UTF-8, but they
	are not: only string literals are UTF-8.
	As we showed in the previous section, string _values_ can contain arbitrary
	bytes;
	as we showed in this one, string _literals_ always contain UTF-8 text
	as long as they have no byte-level escapes.

	To summarize, strings can contain arbitrary bytes, but when constructed
	from string literals, those bytes are (almost always) UTF-8.

	* Code points, characters, and runes

	We've been very careful so far in how we use the words "byte" and "character".
	That's partly because strings hold bytes, and partly because the idea of "character"
	is a little hard to define.
	The Unicode standard uses the term "code point" to refer to the item represented
	by a single value.
	The code point U+2318, with hexadecimal value 2318, represents the symbol ⌘.
	(For lots more information about that code point, see
	[[http://unicode.org/cldr/utility/character.jsp?a=2318][its Unicode page]].)

	To pick a more prosaic example, the Unicode code point U+0061 is the lower
	case Latin letter 'A': a.

	But what about the lower case grave-accented letter 'A', à?
	That's a character, and it's also a code point (U+00E0), but it has other
	representations.
	For example we can use the "combining" grave accent code point, U+0300,
	and attach it to the lower case letter a, U+0061, to create the same character à.
	In general, a character may be represented by a number of different
	sequences of code points, and therefore different sequences of UTF-8 bytes.

	The concept of character in computing is therefore ambiguous, or at least
	confusing, so we use it with care.
	To make things dependable, there are _normalization_ techniques that guarantee that
	a given character is always represented by the same code points, but that
	subject takes us too far off the topic for now.
	A later blog post will explain how the Go libraries address normalization.

	"Code point" is a bit of a mouthful, so Go introduces a shorter term for the
	concept: _rune_.
	The term appears in the libraries and source code, and means exactly
	the same as "code point", with one interesting addition.

	The Go language defines the word `rune` as an alias for the type `int32`, so
	programs can be clear when an integer value represents a code point.
	Moreover, what you might think of as a character constant is called a
	_rune_constant_ in Go.
	The type and value of the expression

	'⌘'

	is `rune` with integer value `0x2318`.

	To summarize, here are the salient points:

	- Go source code is always UTF-8.
	- A string holds arbitrary bytes.
	- A string literal, absent byte-level escapes, always holds valid UTF-8 sequences.
	- Those sequences represent Unicode code points, called runes.
	- No guarantee is made in Go that characters in strings are normalized.

	* Range loops

	Besides the axiomatic detail that Go source code is UTF-8,
	there's really only one way that Go treats UTF-8 specially, and that is when using
	a `for` `range` loop on a string.

	We've seen what happens with a regular `for` loop.
	A `for` `range` loop, by contrast, decodes one UTF-8-encoded rune on each
	iteration.
	Each time around the loop, the index of the loop is the starting position of the
	current rune, measured in bytes, and the code point is its value.
	Here's an example using yet another handy `Printf` format, `%#U`, which shows
	the code point's Unicode value and its printed representation:

	.play -edit strings/range.go /const/,/}/

	The output shows how each code point occupies multiple bytes:

	U+65E5 '日' starts at byte position 0
	U+672C '本' starts at byte position 3
	U+8A9E '語' starts at byte position 6

	[Exercise: Put an invalid UTF-8 byte sequence into the string. (How?)
	What happens to the iterations of the loop?]

	* Libraries

	Go's standard library provides strong support for interpreting UTF-8 text.
	If a `for` `range` loop isn't sufficient for your purposes,
	chances are the facility you need is provided by a package in the library.

	The most important such package is
	[[https://golang.org/pkg/unicode/utf8/][`unicode/utf8`]],
	which contains
	helper routines to validate, disassemble, and reassemble UTF-8 strings.
	Here is a program equivalent to the `for` `range` example above,
	but using the `DecodeRuneInString` function from that package to
	do the work.
	The return values from the function are the rune and its width in
	UTF-8-encoded bytes.

	.play -edit strings/encoding.go /const/,/}/

	Run it to see that it performs the same.
	The `for` `range` loop and `DecodeRuneInString` are defined to produce
	exactly the same iteration sequence.

	Look at the
	[[https://golang.org/pkg/unicode/utf8/][documentation]]
	for the `unicode/utf8` package to see what
	other facilities it provides.

	* Conclusion

	To answer the question posed at the beginning: Strings are built from bytes
	so indexing them yields bytes, not characters.
	A string might not even hold characters.
	In fact, the definition of "character" is ambiguous and it would
	be a mistake to try to resolve the ambiguity by defining that strings are made
	of characters.

	There's much more to say about Unicode, UTF-8, and the world of multilingual
	text processing, but it can wait for another post.
	For now, we hope you have a better understanding of how Go strings behave
	and that, although they may contain arbitrary bytes, UTF-8 is a central part
	of their design.