Andrew Gerrand | 5bc444d | 2014-12-10 11:35:11 +1100 | [diff] [blame] | 1 | Strings are **not** required to be UTF-8. Go source code **is** required |
| 2 | to be UTF-8. There is a complex path between the two. |
| 3 | |
| 4 | In short, there are three kinds of strings. They are: |
| 5 | |
| 6 | 1. the substring of the source that lexes into a string literal. |
| 7 | 1. a string literal. |
| 8 | 1. a value of type string. |
| 9 | |
| 10 | Only the first is required to be UTF-8. The second is required to be |
| 11 | written in UTF-8, but its contents are interpreted various ways |
| 12 | and may encode arbitrary bytes. The third can contain any bytes at |
| 13 | all. |
| 14 | |
| 15 | Try this on: |
| 16 | |
| 17 | ``` |
| 18 | var s string = "\xFF語" |
| 19 | ``` |
| 20 | Source substring: ` "\xFF語" `, UTF-8 encoded. The data: |
| 21 | |
| 22 | ``` |
| 23 | 22 |
| 24 | 5c |
| 25 | 78 |
| 26 | 46 |
| 27 | 46 |
| 28 | e8 |
| 29 | aa |
| 30 | 9e |
| 31 | 22 |
| 32 | ``` |
| 33 | |
| 34 | String literal: ` \xFF語 ` (between the quotes). The data: |
| 35 | |
| 36 | ``` |
| 37 | 5c |
| 38 | 78 |
| 39 | 46 |
| 40 | 46 |
| 41 | e8 |
| 42 | aa |
| 43 | 9e |
| 44 | ``` |
| 45 | |
| 46 | The string value (unprintable; this is a UTF-8 stream). The data: |
| 47 | |
| 48 | ``` |
| 49 | ff |
| 50 | e8 |
| 51 | aa |
| 52 | 9e |
| 53 | ``` |
| 54 | |
| 55 | And for record, the characters (code points): |
| 56 | ``` |
| 57 | <erroneous byte FF, will appear as U+FFFD if you range over the string value> |
| 58 | 語 U+8a9e |
| 59 | ``` |