doc/articles/gobs_of_data.html - go - Git at Google

 <!--{
 "Title": "Gobs of data",
 "Template": true
 }-->

 <p>
 To transmit a data structure across a network or to store it in a file, it must
 be encoded and then decoded again. There are many encodings available, of
 course: <a href="http://www.json.org/">JSON</a>,
 <a href="http://www.w3.org/XML/">XML</a>, Google's
 <a href="http://code.google.com/p/protobuf">protocol buffers</a>, and more.
 And now there's another, provided by Go's <a href="/pkg/encoding/gob/">gob</a>
 package.
 </p>

 <p>
 Why define a new encoding? It's a lot of work and redundant at that. Why not
 just use one of the existing formats? Well, for one thing, we do! Go has
 <a href="/pkg/">packages</a> supporting all the encodings just mentioned (the
 <a href="http://code.google.com/p/goprotobuf">protocol buffer package</a> is in
 a separate repository but it's one of the most frequently downloaded). And for
 many purposes, including communicating with tools and systems written in other
 languages, they're the right choice.
 </p>

 <p>
 But for a Go-specific environment, such as communicating between two servers
 written in Go, there's an opportunity to build something much easier to use and
 possibly more efficient.
 </p>

 <p>
 Gobs work with the language in a way that an externally-defined,
 language-independent encoding cannot. At the same time, there are lessons to be
 learned from the existing systems.
 </p>

 <p>
 <b>Goals</b>
 </p>

 <p>
 The gob package was designed with a number of goals in mind.
 </p>

 <p>
 First, and most obvious, it had to be very easy to use. First, because Go has
 reflection, there is no need for a separate interface definition language or
 "protocol compiler". The data structure itself is all the package should need
 to figure out how to encode and decode it. On the other hand, this approach
 means that gobs will never work as well with other languages, but that's OK:
 gobs are unashamedly Go-centric.
 </p>

 <p>
 Efficiency is also important. Textual representations, exemplified by XML and
 JSON, are too slow to put at the center of an efficient communications network.
 A binary encoding is necessary.
 </p>

 <p>
 Gob streams must be self-describing. Each gob stream, read from the beginning,
 contains sufficient information that the entire stream can be parsed by an
 agent that knows nothing a priori about its contents. This property means that
 you will always be able to decode a gob stream stored in a file, even long
 after you've forgotten what data it represents.
 </p>

 <p>
 There were also some things to learn from our experiences with Google protocol
 buffers.
 </p>

 <p>
 <b>Protocol buffer misfeatures</b>
 </p>

 <p>
 Protocol buffers had a major effect on the design of gobs, but have three
 features that were deliberately avoided. (Leaving aside the property that
 protocol buffers aren't self-describing: if you don't know the data definition
 used to encode a protocol buffer, you might not be able to parse it.)
 </p>

 <p>
 First, protocol buffers only work on the data type we call a struct in Go. You
 can't encode an integer or array at the top level, only a struct with fields
 inside it. That seems a pointless restriction, at least in Go. If all you want
 to send is an array of integers, why should you have to put it into a
 struct first?
 </p>

 <p>
 Next, a protocol buffer definition may specify that fields <code>T.x</code> and
 <code>T.y</code> are required to be present whenever a value of type
 <code>T</code> is encoded or decoded.  Although such required fields may seem
 like a good idea, they are costly to implement because the codec must maintain a
 separate data structure while encoding and decoding, to be able to report when
 required fields are missing.  They're also a maintenance problem. Over time, one
 may want to modify the data definition to remove a required field, but that may
 cause existing clients of the data to crash. It's better not to have them in the
 encoding at all.  (Protocol buffers also have optional fields. But if we don't
 have required fields, all fields are optional and that's that. There will be
 more to say about optional fields a little later.)
 </p>

 <p>
 The third protocol buffer misfeature is default values. If a protocol buffer
 omits the value for a "defaulted" field, then the decoded structure behaves as
 if the field were set to that value. This idea works nicely when you have
 getter and setter methods to control access to the field, but is harder to
 handle cleanly when the container is just a plain idiomatic struct. Required
 fields are also tricky to implement: where does one define the default values,
 what types do they have (is text UTF-8? uninterpreted bytes? how many bits in a
 float?) and despite the apparent simplicity, there were a number of
 complications in their design and implementation for protocol buffers. We
 decided to leave them out of gobs and fall back to Go's trivial but effective
 defaulting rule: unless you set something otherwise, it has the "zero value"
 for that type - and it doesn't need to be transmitted.
 </p>

 <p>
 So gobs end up looking like a sort of generalized, simplified protocol buffer.
 How do they work?
 </p>

 <p>
 <b>Values</b>
 </p>

 <p>
 The encoded gob data isn't about <code>int8</code>s and <code>uint16</code>s.
 Instead, somewhat analogous to constants in Go, its integer values are abstract,
 sizeless numbers, either signed or unsigned. When you encode an
 <code>int8</code>, its value is transmitted as an unsized, variable-length
 integer. When you encode an <code>int64</code>, its value is also transmitted as
 an unsized, variable-length integer. (Signed and unsigned are treated
 distinctly, but the same unsized-ness applies to unsigned values too.) If both
 have the value 7, the bits sent on the wire will be identical. When the receiver
 decodes that value, it puts it into the receiver's variable, which may be of
 arbitrary integer type. Thus an encoder may send a 7 that came from an
 <code>int8</code>, but the receiver may store it in an <code>int64</code>. This
 is fine: the value is an integer and as a long as it fits, everything works. (If
 it doesn't fit, an error results.) This decoupling from the size of the variable
 gives some flexibility to the encoding: we can expand the type of the integer
 variable as the software evolves, but still be able to decode old data.
 </p>

 <p>
 This flexibility also applies to pointers. Before transmission, all pointers are
 flattened. Values of type <code>int8</code>, <code>*int8</code>,
 <code>**int8</code>, <code>****int8</code>, etc. are all transmitted as an
 integer value, which may then be stored in <code>int</code> of any size, or
 <code>*int</code>, or <code>******int</code>, etc. Again, this allows for
 flexibility.
 </p>

 <p>
 Flexibility also happens because, when decoding a struct, only those fields
 that are sent by the encoder are stored in the destination. Given the value
 </p>

 {{code "/doc/progs/gobs1.go" `/type T/` `/STOP/`}}

 <p>
 the encoding of <code>t</code> sends only the 7 and 8. Because it's zero, the
 value of <code>Y</code> isn't even sent; there's no need to send a zero value.
 </p>

 <p>
 The receiver could instead decode the value into this structure:
 </p>

 {{code "/doc/progs/gobs1.go" `/type U/` `/STOP/`}}

 <p>
 and acquire a value of <code>u</code> with only <code>X</code> set (to the
 address of an <code>int8</code> variable set to 7); the <code>Z</code> field is
 ignored - where would you put it? When decoding structs, fields are matched by
 name and compatible type, and only fields that exist in both are affected. This
 simple approach finesses the "optional field" problem: as the type
 <code>T</code> evolves by adding fields, out of date receivers will still
 function with the part of the type they recognize. Thus gobs provide the
 important result of optional fields - extensibility - without any additional
 mechanism or notation.
 </p>

 <p>
 From integers we can build all the other types: bytes, strings, arrays, slices,
 maps, even floats. Floating-point values are represented by their IEEE 754
 floating-point bit pattern, stored as an integer, which works fine as long as
 you know their type, which we always do. By the way, that integer is sent in
 byte-reversed order because common values of floating-point numbers, such as
 small integers, have a lot of zeros at the low end that we can avoid
 transmitting.
 </p>

 <p>
 One nice feature of gobs that Go makes possible is that they allow you to define
 your own encoding by having your type satisfy the
 <a href="/pkg/encoding/gob/#GobEncoder">GobEncoder</a> and
 <a href="/pkg/encoding/gob/#GobDecoder">GobDecoder</a> interfaces, in a manner
 analogous to the <a href="/pkg/encoding/json/">JSON</a> package's
 <a href="/pkg/encoding/json/#Marshaler">Marshaler</a> and
 <a href="/pkg/encoding/json/#Unmarshaler">Unmarshaler</a> and also to the
 <a href="/pkg/fmt/#Stringer">Stringer</a> interface from
 <a href="/pkg/fmt/">package fmt</a>. This facility makes it possible to
 represent special features, enforce constraints, or hide secrets when you
 transmit data. See the <a href="/pkg/encoding/gob/">documentation</a> for
 details.
 </p>

 <p>
 <b>Types on the wire</b>
 </p>

 <p>
 The first time you send a given type, the gob package includes in the data
 stream a description of that type. In fact, what happens is that the encoder is
 used to encode, in the standard gob encoding format, an internal struct that
 describes the type and gives it a unique number. (Basic types, plus the layout
 of the type description structure, are predefined by the software for
 bootstrapping.) After the type is described, it can be referenced by its type
 number.
 </p>

 <p>
 Thus when we send our first type <code>T</code>, the gob encoder sends a
 description of <code>T</code> and tags it with a type number, say 127. All
 values, including the first, are then prefixed by that number, so a stream of
 <code>T</code> values looks like:
 </p>

 <pre>
 ("define type id" 127, definition of type T)(127, T value)(127, T value), ...
 </pre>

 <p>
 These type numbers make it possible to describe recursive types and send values
 of those types. Thus gobs can encode types such as trees:
 </p>

 {{code "/doc/progs/gobs1.go" `/type Node/` `/STOP/`}}

 <p>
 (It's an exercise for the reader to discover how the zero-defaulting rule makes
 this work, even though gobs don't represent pointers.)
 </p>

 <p>
 With the type information, a gob stream is fully self-describing except for the
 set of bootstrap types, which is a well-defined starting point.
 </p>

 <p>
 <b>Compiling a machine</b>
 </p>

 <p>
 The first time you encode a value of a given type, the gob package builds a
 little interpreted machine specific to that data type. It uses reflection on
 the type to construct that machine, but once the machine is built it does not
 depend on reflection. The machine uses package unsafe and some trickery to
 convert the data into the encoded bytes at high speed. It could use reflection
 and avoid unsafe, but would be significantly slower. (A similar high-speed
 approach is taken by the protocol buffer support for Go, whose design was
 influenced by the implementation of gobs.) Subsequent values of the same type
 use the already-compiled machine, so they can be encoded right away.
 </p>

 <p>
 Decoding is similar but harder. When you decode a value, the gob package holds
 a byte slice representing a value of a given encoder-defined type to decode,
 plus a Go value into which to decode it. The gob package builds a machine for
 that pair: the gob type sent on the wire crossed with the Go type provided for
 decoding. Once that decoding machine is built, though, it's again a
 reflectionless engine that uses unsafe methods to get maximum speed.
 </p>

 <p>
 <b>Use</b>
 </p>

 <p>
 There's a lot going on under the hood, but the result is an efficient,
 easy-to-use encoding system for transmitting data. Here's a complete example
 showing differing encoded and decoded types. Note how easy it is to send and
 receive values; all you need to do is present values and variables to the
 <a href="/pkg/encoding/gob/">gob package</a> and it does all the work.
 </p>

 {{code "/doc/progs/gobs2.go" `/package main/` `$`}}

 <p>
 You can compile and run this example code in the
 <a href="http://play.golang.org/p/_-OJV-rwMq">Go Playground</a>.
 </p>

 <p>
 The <a href="/pkg/net/rpc/">rpc package</a> builds on gobs to turn this
 encode/decode automation into transport for method calls across the network.
 That's a subject for another article.
 </p>

 <p>
 <b>Details</b>
 </p>

 <p>
 The <a href="/pkg/encoding/gob/">gob package documentation</a>, especially the
 file <a href="/src/pkg/encoding/gob/doc.go">doc.go</a>, expands on many of the
 details described here and includes a full worked example showing how the
 encoding represents data. If you are interested in the innards of the gob
 implementation, that's a good place to start.
 </p>
	<!--{
	"Title": "Gobs of data",
	"Template": true
	}-->

	<p>
	To transmit a data structure across a network or to store it in a file, it must
	be encoded and then decoded again. There are many encodings available, of
	course: <a href="http://www.json.org/">JSON</a>,
	<a href="http://www.w3.org/XML/">XML</a>, Google's
	<a href="http://code.google.com/p/protobuf">protocol buffers</a>, and more.
	And now there's another, provided by Go's <a href="/pkg/encoding/gob/">gob</a>
	package.
	</p>

	<p>
	Why define a new encoding? It's a lot of work and redundant at that. Why not
	just use one of the existing formats? Well, for one thing, we do! Go has
	<a href="/pkg/">packages</a> supporting all the encodings just mentioned (the
	<a href="http://code.google.com/p/goprotobuf">protocol buffer package</a> is in
	a separate repository but it's one of the most frequently downloaded). And for
	many purposes, including communicating with tools and systems written in other
	languages, they're the right choice.
	</p>

	<p>
	But for a Go-specific environment, such as communicating between two servers
	written in Go, there's an opportunity to build something much easier to use and
	possibly more efficient.
	</p>

	<p>
	Gobs work with the language in a way that an externally-defined,
	language-independent encoding cannot. At the same time, there are lessons to be
	learned from the existing systems.
	</p>

	<p>
	<b>Goals</b>
	</p>

	<p>
	The gob package was designed with a number of goals in mind.
	</p>

	<p>
	First, and most obvious, it had to be very easy to use. First, because Go has
	reflection, there is no need for a separate interface definition language or
	"protocol compiler". The data structure itself is all the package should need
	to figure out how to encode and decode it. On the other hand, this approach
	means that gobs will never work as well with other languages, but that's OK:
	gobs are unashamedly Go-centric.
	</p>

	<p>
	Efficiency is also important. Textual representations, exemplified by XML and
	JSON, are too slow to put at the center of an efficient communications network.
	A binary encoding is necessary.
	</p>

	<p>
	Gob streams must be self-describing. Each gob stream, read from the beginning,
	contains sufficient information that the entire stream can be parsed by an
	agent that knows nothing a priori about its contents. This property means that
	you will always be able to decode a gob stream stored in a file, even long
	after you've forgotten what data it represents.
	</p>

	<p>
	There were also some things to learn from our experiences with Google protocol
	buffers.
	</p>

	<p>
	<b>Protocol buffer misfeatures</b>
	</p>

	<p>
	Protocol buffers had a major effect on the design of gobs, but have three
	features that were deliberately avoided. (Leaving aside the property that
	protocol buffers aren't self-describing: if you don't know the data definition
	used to encode a protocol buffer, you might not be able to parse it.)
	</p>

	<p>
	First, protocol buffers only work on the data type we call a struct in Go. You
	can't encode an integer or array at the top level, only a struct with fields
	inside it. That seems a pointless restriction, at least in Go. If all you want
	to send is an array of integers, why should you have to put it into a
	struct first?
	</p>

	<p>
	Next, a protocol buffer definition may specify that fields <code>T.x</code> and
	<code>T.y</code> are required to be present whenever a value of type
	<code>T</code> is encoded or decoded. Although such required fields may seem
	like a good idea, they are costly to implement because the codec must maintain a
	separate data structure while encoding and decoding, to be able to report when
	required fields are missing. They're also a maintenance problem. Over time, one
	may want to modify the data definition to remove a required field, but that may
	cause existing clients of the data to crash. It's better not to have them in the
	encoding at all. (Protocol buffers also have optional fields. But if we don't
	have required fields, all fields are optional and that's that. There will be
	more to say about optional fields a little later.)
	</p>

	<p>
	The third protocol buffer misfeature is default values. If a protocol buffer
	omits the value for a "defaulted" field, then the decoded structure behaves as
	if the field were set to that value. This idea works nicely when you have
	getter and setter methods to control access to the field, but is harder to
	handle cleanly when the container is just a plain idiomatic struct. Required
	fields are also tricky to implement: where does one define the default values,
	what types do they have (is text UTF-8? uninterpreted bytes? how many bits in a
	float?) and despite the apparent simplicity, there were a number of
	complications in their design and implementation for protocol buffers. We
	decided to leave them out of gobs and fall back to Go's trivial but effective
	defaulting rule: unless you set something otherwise, it has the "zero value"
	for that type - and it doesn't need to be transmitted.
	</p>

	<p>
	So gobs end up looking like a sort of generalized, simplified protocol buffer.
	How do they work?
	</p>

	<p>
	<b>Values</b>
	</p>

	<p>
	The encoded gob data isn't about <code>int8</code>s and <code>uint16</code>s.
	Instead, somewhat analogous to constants in Go, its integer values are abstract,
	sizeless numbers, either signed or unsigned. When you encode an
	<code>int8</code>, its value is transmitted as an unsized, variable-length
	integer. When you encode an <code>int64</code>, its value is also transmitted as
	an unsized, variable-length integer. (Signed and unsigned are treated
	distinctly, but the same unsized-ness applies to unsigned values too.) If both
	have the value 7, the bits sent on the wire will be identical. When the receiver
	decodes that value, it puts it into the receiver's variable, which may be of
	arbitrary integer type. Thus an encoder may send a 7 that came from an
	<code>int8</code>, but the receiver may store it in an <code>int64</code>. This
	is fine: the value is an integer and as a long as it fits, everything works. (If
	it doesn't fit, an error results.) This decoupling from the size of the variable
	gives some flexibility to the encoding: we can expand the type of the integer
	variable as the software evolves, but still be able to decode old data.
	</p>

	<p>
	This flexibility also applies to pointers. Before transmission, all pointers are
	flattened. Values of type <code>int8</code>, <code>*int8</code>,
	<code>int8</code>, <code>**int8</code>, etc. are all transmitted as an
	integer value, which may then be stored in <code>int</code> of any size, or
	<code>int</code>, or <code>*****int</code>, etc. Again, this allows for
	flexibility.
	</p>

	<p>
	Flexibility also happens because, when decoding a struct, only those fields
	that are sent by the encoder are stored in the destination. Given the value
	</p>

	{{code "/doc/progs/gobs1.go" `/type T/` `/STOP/`}}

	<p>
	the encoding of <code>t</code> sends only the 7 and 8. Because it's zero, the
	value of <code>Y</code> isn't even sent; there's no need to send a zero value.
	</p>

	<p>
	The receiver could instead decode the value into this structure:
	</p>

	{{code "/doc/progs/gobs1.go" `/type U/` `/STOP/`}}

	<p>
	and acquire a value of <code>u</code> with only <code>X</code> set (to the
	address of an <code>int8</code> variable set to 7); the <code>Z</code> field is
	ignored - where would you put it? When decoding structs, fields are matched by
	name and compatible type, and only fields that exist in both are affected. This
	simple approach finesses the "optional field" problem: as the type
	<code>T</code> evolves by adding fields, out of date receivers will still
	function with the part of the type they recognize. Thus gobs provide the
	important result of optional fields - extensibility - without any additional
	mechanism or notation.
	</p>

	<p>
	From integers we can build all the other types: bytes, strings, arrays, slices,
	maps, even floats. Floating-point values are represented by their IEEE 754
	floating-point bit pattern, stored as an integer, which works fine as long as
	you know their type, which we always do. By the way, that integer is sent in
	byte-reversed order because common values of floating-point numbers, such as
	small integers, have a lot of zeros at the low end that we can avoid
	transmitting.
	</p>

	<p>
	One nice feature of gobs that Go makes possible is that they allow you to define
	your own encoding by having your type satisfy the
	<a href="/pkg/encoding/gob/#GobEncoder">GobEncoder</a> and
	<a href="/pkg/encoding/gob/#GobDecoder">GobDecoder</a> interfaces, in a manner
	analogous to the <a href="/pkg/encoding/json/">JSON</a> package's
	<a href="/pkg/encoding/json/#Marshaler">Marshaler</a> and
	<a href="/pkg/encoding/json/#Unmarshaler">Unmarshaler</a> and also to the
	<a href="/pkg/fmt/#Stringer">Stringer</a> interface from
	<a href="/pkg/fmt/">package fmt</a>. This facility makes it possible to
	represent special features, enforce constraints, or hide secrets when you
	transmit data. See the <a href="/pkg/encoding/gob/">documentation</a> for
	details.
	</p>

	<p>
	<b>Types on the wire</b>
	</p>

	<p>
	The first time you send a given type, the gob package includes in the data
	stream a description of that type. In fact, what happens is that the encoder is
	used to encode, in the standard gob encoding format, an internal struct that
	describes the type and gives it a unique number. (Basic types, plus the layout
	of the type description structure, are predefined by the software for
	bootstrapping.) After the type is described, it can be referenced by its type
	number.
	</p>

	<p>
	Thus when we send our first type <code>T</code>, the gob encoder sends a
	description of <code>T</code> and tags it with a type number, say 127. All
	values, including the first, are then prefixed by that number, so a stream of
	<code>T</code> values looks like:
	</p>

	<pre>
	("define type id" 127, definition of type T)(127, T value)(127, T value), ...
	</pre>

	<p>
	These type numbers make it possible to describe recursive types and send values
	of those types. Thus gobs can encode types such as trees:
	</p>

	{{code "/doc/progs/gobs1.go" `/type Node/` `/STOP/`}}

	<p>
	(It's an exercise for the reader to discover how the zero-defaulting rule makes
	this work, even though gobs don't represent pointers.)
	</p>

	<p>
	With the type information, a gob stream is fully self-describing except for the
	set of bootstrap types, which is a well-defined starting point.
	</p>

	<p>
	<b>Compiling a machine</b>
	</p>

	<p>
	The first time you encode a value of a given type, the gob package builds a
	little interpreted machine specific to that data type. It uses reflection on
	the type to construct that machine, but once the machine is built it does not
	depend on reflection. The machine uses package unsafe and some trickery to
	convert the data into the encoded bytes at high speed. It could use reflection
	and avoid unsafe, but would be significantly slower. (A similar high-speed
	approach is taken by the protocol buffer support for Go, whose design was
	influenced by the implementation of gobs.) Subsequent values of the same type
	use the already-compiled machine, so they can be encoded right away.
	</p>

	<p>
	Decoding is similar but harder. When you decode a value, the gob package holds
	a byte slice representing a value of a given encoder-defined type to decode,
	plus a Go value into which to decode it. The gob package builds a machine for
	that pair: the gob type sent on the wire crossed with the Go type provided for
	decoding. Once that decoding machine is built, though, it's again a
	reflectionless engine that uses unsafe methods to get maximum speed.
	</p>

	<p>
	<b>Use</b>
	</p>

	<p>
	There's a lot going on under the hood, but the result is an efficient,
	easy-to-use encoding system for transmitting data. Here's a complete example
	showing differing encoded and decoded types. Note how easy it is to send and
	receive values; all you need to do is present values and variables to the
	<a href="/pkg/encoding/gob/">gob package</a> and it does all the work.
	</p>

	{{code "/doc/progs/gobs2.go" `/package main/` `$`}}

	<p>
	You can compile and run this example code in the
	<a href="http://play.golang.org/p/_-OJV-rwMq">Go Playground</a>.
	</p>

	<p>
	The <a href="/pkg/net/rpc/">rpc package</a> builds on gobs to turn this
	encode/decode automation into transport for method calls across the network.
	That's a subject for another article.
	</p>

	<p>
	<b>Details</b>
	</p>

	<p>
	The <a href="/pkg/encoding/gob/">gob package documentation</a>, especially the
	file <a href="/src/pkg/encoding/gob/doc.go">doc.go</a>, expands on many of the
	details described here and includes a full worked example showing how the
	encoding represents data. If you are interested in the innards of the gob
	implementation, that's a good place to start.
	</p>