blob: be6dbe619286c5bd06d4e183ec87646517cd9b0a [file] [log] [blame] [view]
# Proposal: Zip-based Go package archives
Author: Russ Cox
Last updated: February 2016
Discussion at https://golang.org/issue/14386.
## Abstract
Go package archives (the `*.a` files manipulated by `go tool pack`) use the old Unix ar archive format.
I propose to change both Go package archives and Go object files to use the more standard zip archive format.
In contrast to ar archives, zip archives admit efficient random access to individual files within the archive
and also allow decisions about compression on a per-file basis.
The result for Go will be cleaner access to the parts of a package archive
and the ability later to add compression of individual parts
as appropriate.
## Background
### Go object files and package archives
The Go toolchain stores compiled packages in archives
written in the Unix ar format used by traditional C toolchains.
Before continuing, two notes on terminology:
- An archive (a `*.a` file), such as an ar or zip file, is a file that contains other files.
To avoid confusion, this design document uses the term _archive_
for the archive file itself and reserves the term _file_ exclusively
for other kinds of files, including the files inside the archive.
- An _object file_ (a `*.o` file) holds machine code corresponding to a source file;
the linker merges multiple object files into a final executable.
Examples of object files include the ELF, Mach-O, and PE
object files used by Linux, OS X, and Windows systems, respectively.
We refer to these as _system object files_.
Go uses its own object file format, which we refer to as _Go object files_;
that format is unchanged by this proposal.
In a traditional C toolchain, an archive contains a file
named `__.SYMDEF` and then one or more object files (`.o` files)
containing compiled code; each object file corresponds to a different C or assembly source file.
The `__.SYMDEF` file is a symbol index a mapping from symbol name (such as `printf`)
to the specific object file containing that symbol (such as `print.o`).
A traditional C linker reads the symbol index to learn which of the
object files it needs to read from the archive; it can completely ignore
the others.
Go has diverged over time from the C toolchain way of using ar archives.
A Go package archive contains
package metadata in a file named `__.PKGDEF`,
one or more Go object files,
and zero or more system object files.
The Go object files are generated by the compiler
(one for all the Go source code in the package)
and by the assembler
(one for each assembly source file).
The system object files are generated by the system C compiler
(one for each `*.c` file in the package directory, plus a few for
C source files generated by cgo),
or (less commonly) are direct copies of `*.syso` files in the package source directory.
Because the Go linker does dead code elimination at a symbol level rather than
at the object file level, a traditional C symbol index is not useful
and not included in the Go package archive.
Long before Go 1, the Go compiler read a single Go source file
and wrote a single Go object file, much like a C compiler.
Each object file contained a fragment of package metadata
contributed by that file.
After running the compiler separately on each Go source file
in a package, the build system (`make`) invoked the archiver
(`6ar`, even on non-amd64 systems) to create an archive
containing all the object files.
As part of creating the archive, the archiver copied and merged
the metadata fragments from the many Go object files into
the single `__.PKGDEF` file.
This had the effect of storing the package metdata in the archive twice,
although the different copies ended up being read by different tools.
The copy in `__.PKGDEF` was read by later compilations importing
the package, and the fragmented copy spread across the Go object files
was read by the linker (which needed to read the object files anyway)
and used to detect version skew (a common problem due to the
use of per-directory makefiles).
By the time of Go 1, the Go compiler read all the Go source files for a package
together and wrote a single Go object file.
As before, that object file contained (now complete) package metadata,
and the archiver (now `go tool pack`) extracted that metadata into the
`__.PKGDEF` file.
The package still contained two copies of the package metadata.
Equally embarassing, most package archives
(those for Go packages with no assembly or C)
contained only a single `*.o` file, making the archiving step
a mostly unnecessary, trivial copy of the data through the file system.
Go 1.3 added a new `-pack` option to the Go compiler, directing it to
write a Go package archive containing `__.PKGDEF` and a `_go_.o` _without_
package metadata.
The go command used this option to create the initial package archive.
If the package had no assembly or C sources, there was no need for
any more work on the archive.
If the package did have assembly or C sources, those additional objects
needed to be appended to the archive, which could be done without
copying or rewriting the existing data.
Adopting `-pack` eliminated the duplicate copy of the package metadata,
and it also removed from the linker the job of detecting version skew,
since the package metadata was no longer in the object files the linker read.
The package metadata itself contains multiple sections used by different programs:
a unique build ID, needed by the go command;
the package name, needed by the compiler during import, but also needed by the linker;
detailed information about exported API, needed by the compiler during import;
and directives related to cgo, needed by the linker.
The entire metadata format is textual, with sections separated by `$$` lines.
Today, the situation is not much different from that of Go 1.3.
There are two main problems.
First, the individual package metadata sections are difficult to
access independently, because of the use of ad-hoc framing
inside the standard ar-format framing.
The inner framing is necessary in the current system in part
because metadata is still sometimes (when not using `-pack`)
stored in Go object files,
and those object files have no outer framing.
The line-oriented nature of the inner framing is a hurdle
for converting to a more compact binary format for the export data.
In a cleaner design, the different metadata sections would be stored
in different files in the Go package archive, eliminating the inner framing.
Cleaner separation would allow different tools to access only the
data they needed without needing to process unrelated data.
The go command, the compiler, and the linker all read `__.PKGDEF`,
but only the compiler needs all of it.
Distributed build systems can also benefit from splitting a package
archive into two halves, one used by the compiler to satisfy imports
and one used by the linker to generate the final executable.
The build system can then ship just the compiler-relevant
data to machines running the compiler and just the linker-relevant
data to machines running the linker.
In particular, the compiler does not need the Go object files,
and the linker does not need the Go export data;
both savings can be large.
Second, there is no simple way to enable compression for certain
files in the Go package archive.
It could be worthwhile to compress the Go export data
and Go object files, to save disk space as well as I/O time
(not just disk I/O but potentially also network I/O,
when using network file systems or distributed build systems).
### Archive file formats
The ar archive format is simplistic: it begins with a distinguishing 8-byte header (`!<arch>\n`)
and then contains a sequence of files.
Each file has its own 60-byte header giving the file name (up to 16 bytes),
modification time, user and group IDs, permission bits, and size.
That header is followed by size bytes of data.
If size is odd, the data is followed by a single padding byte
so that file headers are always 16-bit aligned within the archive.
There is no table of contents: to find the names of all files in the archive,
one must read the entire archive (perhaps seeking past file content).
There is no compression.
Additional file entries can simply be appended to the end of an existing archive.
The zip archive format is much more capable, but only a little more complex.
A zip archive consists of a sequence of files followed by a table of contents.
Each file is stored as a header giving metadata such as the file name and data encoding,
followed by encoded file data,
followed by a file trailer.
The two standard encodings are store (raw, uncompressed)
and deflate (compressed using the same algorithm as zlib and gzip).
The table of contents at the end of the zip archive is a contiguous list of file headers
including offsets to the actual file data, making it efficient to access a particular
file in the archive.
As mentioned above, the zip format supports but does not require compression.
Appending to a zip archive is simple, although not as trivial as appending to an ar archive.
The table of contents must be saved, then new file entries are written
starting where the table of contents used to be, and then a new, expanded
table of contents is written.
Importantly, the existing files are left in place during this process,
making it about as efficient as adding to an ar format archive.
## Proposal
To address the problems described above,
I propose to change Go package archives
to use the zip format instead of the current ar format,
at the same time separating the current `__.PKGDEF` metadata file
into multiple files according to what tools use process the data.
To avoid the need to preserve the current custom framing in
Go object files, I propose to stop writing Go object files at all,
except inside Go package archives.
The toolchain would still generate `*.o` files at the times it does today,
but the bytes inside those files would be identical to those inside
a Go package archive.
Although the bytes stored in the `*.a` and `*.o` files would be changing,
there would be no visible changes in the rest of the toolchain.
In particular, the file names would stay the same,
as would the commands used to manipulate and inspect archives.
The only differences would be in the encoding used within the file.
A Go package archive would be a zip-format archive containing the following files:
_go_/header
_go_/export
_go_/cgo
_go_/*.obj
_go_/*.sysobj
The `_go_/header` file is required, must be first in the archive, must be uncompressed,
and is of bounded size.
The header content is a sequence of at most four textual metadata lines. For example:
go object darwin amd64 devel +8b5a9bd Tue Feb 2 22:46:19 2016 -0500 X:none
build id "4fe8e8c8bc1ea2d7c03bd08cf3025e302ff33742"
main
safe
The `go object` line must be first and identifies the operating system, architecture,
and toolchain version (including enabled experiments) of the package archive.
This line is today the first line of `__.PKGDEF`, and its uses remain the same:
the compiler and linker both refuse to use package archives with an unexpected
`go object` line.
The remaining lines are optional, but whichever ones are present must appear
in the order given here.
The `build id` line specifies the build ID, an opaque hash used by the build system
(typically the go command) as a version identifier, to help detect when a package
must be rebuilt.
This line is today the second line of `__.PKGDEF`, when present.
The `main` line is present if the package archive is `package main`, making it a valid
top-level input for the linker. The command `go tool link x.a` will refuse to build
a binary from `x.a` if that package's header does not have a `main` line.
The `safe` line is present if the code was compiled with `-u`, indicating that it has
been checked for safety. When the linker is invoked with `-u`, it refuses to use
any unsafe package archives during the link.
This mode is experimental and carried forward from earlier versions of Go.
The `main` and `safe` lines are today derived from the first line of the export data,
which echoes the package statement from the Go source code, followed by the
word `safe` for safe packages.
The new header omits the name of non-main packages entirely in order
to ensure that the header size is bounded no matter how long a package name
appears in the package's source code.
More header lines may be added to the end of this list in the future,
always being careful to keep the overall header size bounded.
The `_go_/export` file is usually required (details below), must be second in the archive,
and holds a description of the package's exported API
for use by a later compilation importing the package.
The format of the export data is not specified here,
but as mentioned above part of the motivation for this design
is to make it possible to use a binary export data format
and to apply compression to it.
The export data corresponds to the top of the `__.PKGDEF` file,
excluding the initial `go object` and `build id` lines and stopping at the first `$$` line.
The `_go_/cgo` file is optional and holds cgo-related directives for the linker.
The format of these directives is not specified here.
This data corresponds to the end of the Go object file metadata,
specifically the lines between the third `$$` line and the terminating `!` line.
Each of the `_go_/*.obj` files is a traditional Go object file,
holding machine code, data, and relocations processed by the linker.
Each of the `_go_/*.sysobj` files is a system object file,
either generated during the build by the system C compiler
or copied verbatim from a `*.syso` file in the package source directory
(see the [go command documentation](https://golang.org/cmd/go/#hdr-File_types)
for more about `*.syso` files).
It is valid today and remains valid in this proposal for
multiple files within an archive to have the same name.
This simplifies the generation and combination of package files.
## Rationale
### Zip format
As discussed in the background section, the most fundamental problem
with the current archive format as used by Go is that all package metadata is
combined into the single `__.PKGDEF` file.
This is done for many reasons, all addressed by the use of zip files.
One reason for the single `__.PKGDEF` file is that there is
no efficient random access to files inside ar archives.
The first file in the archive is the only one that can be accessed
with a fixed number of disk I/O operations, and so it is often
given a distinguished role.
The zip format has a contiguous table of contents,
making it possible to access any file in the archive in a
fixed number of disk I/O operations.
This reduces the pressure to keep all important data in the first file.
It is still possible, however, to read a zip file from the beginning of the file,
without first consulting the table of contents.
The requirements that `_go_/header` be first, be uncompressed,
and be bounded in size exist precisely to make it possible to read
the package archive header by reading nothing but a prefix of the file
(say, the first kilobyte).
The requirement that `_go_/export` be second also makes it possible
for a compiler to read the header and export data without using
any disk I/O to read the table of contents.
As mentioned above, another reason for the single `__.PKGDEF` file
is that the metadata is stored not just in Go package archives but also
in Go object files, as written by `go tool compile` (without `-pack`) or `go tool asm`,
and those object files have no archive framing available.
Changing `*.o` files to reuse the Go package archive format
eliminates the need for a separate framing solution for metadata in `*.o` files.
Zip also makes it possible to make different compression decisions for
different files within an archive. This is important primarily because
we would like the option of compressing the export data and Go object files
but likely cannot compress the system object files, because reading them
requires having random access to the data.
It is also useful to be able to arrange that the header can be read
without the overhead of decompression.
We could take the current archive format and add a table of contents
and support for per-file compression methods, but why reinvent the wheel?
Zip is a standard format, and the Go standard library already supports it well.
The only other standard archive format in the Go standard library is the Unix tar format.
Tar is a marvel: it adds significant complexity to the ar format
without addressing any of the architectural problems that make ar unsuitable for our purposes.
In some circles, zip has a bad reputation.
I speculate that this is due to zip's historically strong association with MS-DOS
and historically weak support on Unix systems.
Reputation aside, the zip format is clearly documented,
is well designed, and avoids the architectural problems of ar and tar.
It is perhaps worth noting that Java .jar files also use the zip format internally,
and that seems to be working well for them.
### File names
The names of files within the package archives all begin with `_go_/`.
This is done so that Go packages are easier to distinguish from other zip files
and also so that an accidental `unzip x.a` is easier to clean up.
Distinguishing Go object files from system object files by name is new in this proposal.
Today, tools assume that `__.PKGDEF` is the only non-object file in the
package archive, and each file must be inspected to find out what kind
of object file it is (Go object or system object).
The suffixes make it possible to know both that a particular file
is an object file and what kind it is, without reading the file data.
The suffixes also isolate tools from each other, making it easier
to extend the archive with new data in new files.
For example, if some other part of the toolchain needs to add a
new file to the archive, the linker will automatically ignore it
(assuming the file name does not end in `.obj` or `.sysobj`).
### Compression
Go export data and Go object files can both be quite large.
I ran experiment on a large program at Google, built with Go 1.5.
I gathered all the package archives linked into that program
corresponding to Go source code generated from protocol buffer definitions
(which can be quite large), and I ran the standard `gzip -1`
(fastest, least compression) on those files.
That resulted in a 7.5x space savings for the packages.
Clearly there are significant space improvements available
with only modest attempts at compression.
I ran another experiment on the main repo toward the end of the Go 1.6 cycle.
I changed the existing package archive format to force
compression of `__.PKGDEF` and all Go object files,
using Go's compress/gzip at compression level 1,
when writing them to the package archive,
and I changed all readers to know to decompress them when
reading them back out of the archive.
This resulted in a 4X space savings for packages on disk:
the $GOROOT/pkg tree after make.bash shrunk from about 64 MB to about 16 MB.
The cost was an approximately 10% slowdown in make.bash time:
the roughly two minutes make.bash normally took on my laptop
was extended by about 10 seconds.
My experiment was not as efficient in its use of compression as it could be.
For example, the linker went to the trouble to open and decompress
the beginning of `__.PKGDEF` just to read the few bits it actually needed.
Independently, Klaus Post has been working on improving the
speed of Go's compress/flate package (used by archive/zip,
compress/gzip, and compress/zlib) at all compression levels,
as well as the efficiency of the decompressor.
He has also replaced compression level 1 by a port of
the logic from Google's Snappy (formerly Zippy) algorithm,
which was designed specifically for compression speed.
Unlike Snappy, though, his port produces DEFLATE-compatible
output, so it can be used by a compressor without requiring
a non-standard decompressor on the other side.
From the combination of a more careful separation of data within
the package archive and Klaus's work on compression speed,
I expect the slowdown in make.bash due to
compression can be reduced to under 5% (for a 4X space savings!).
Of course, if the cost of compression is determined to be not paid for
by the space savings it brings, it is possible to use zip with no
compression at all.
The other benefits of the zip format still make this a worthwhile cleanup.
## Compatibility
The toolchain is not subject to the [compatibility guidelines](https://golang.org/doc/go1compat).
Even so, this change is intended to be invisible to any use case that does not actually open a
package archive or object files and read the raw bytes contained within.
## Implementation
The implementation proceeds in these steps:
1. Implementation of a new package `cmd/internal/pkg`
for manipulating zip-format package archives.
2. Replacement of the ar-format archives with zip-format archives,
but still containing the old files (`__.PKGDEF` followed by any number
of object files of unspecified type).
3. Implementation of the new file structure within the archives:
the separate metadata files and the forced suffixes for Go object files
and system object files.
4. Addition of compression.
Steps 1, 2, and 3 should have no performance impact on build times.
We will measure the speed of make.bash to confirm this.
These steps depend on some extensions to the archive/zip package suggested by Roger Peppe.
He has implemented these and intends to send them early in the Go 1.7 cycle.
Step 4 will have a performance impact on build times.
It must be measured to make a proper engineering decision
about whether and how much to compress.
This step depends on the compress/flate performance improvements by Klaus Post described above.
He has implemented these and intends to send them early in the Go 1.7 cycle.
I will do this work early in the Go 1.7 cycle, immediately following Roger's and Klaus's work.
I have a rough but working prototype of steps 1, 2, and 3 already.
Enabling compression in the zip writer is a few lines of code beyond that.
Part of the motivation for doing this early in Go 1.7 is to make it possible
for Robert Griesemer to gather performance data for his new binary
export data format and enable that for Go 1.7 as well.
The binary export code is currently bottlenecked by the need to escape
and unescape the data to avoid generating a terminating `\n$$` sequence.