crfs: start of a README / design doc of sorts

Updates golang/go#30829

Change-Id: I8790dfcd30e3fb4d68b6e4cb9f8baf44c45d2cd6
Reviewed-on: https://go-review.googlesource.com/c/build/+/167392
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
diff --git a/crfs/README.md b/crfs/README.md
new file mode 100644
index 0000000..d99def8
--- /dev/null
+++ b/crfs/README.md
@@ -0,0 +1,143 @@
+# CRFS: Container Registry Filesystem
+
+Discussion: https://github.com/golang/go/issues/30829
+
+## Overview
+
+**CRFS** is a read-only FUSE filesystem that lets you mount a
+container image, served directly from a container registry (such as
+[gcr.io](https://gcr.io/)), without pulling it all locally first.
+
+## Background
+
+Starting a container should be fast. Currently, however, starting a
+container in many environments requires doing a `pull` operation from
+a container registry to read the entire container image from the
+registry and write the entire container image to the local machine's
+disk. It's pretty silly (and wasteful) that a read operation becomes a
+write operation. For small containers, this problem is rarely noticed.
+For larger containers, though, the pull operation quickly becomes the
+slowest part of launching a container, especially on a cold node.
+Contrast this with launching a VM on major cloud providers: even with
+a VM image that's hundreds of gigabytes, the VM boots in seconds.
+That's because the hypervisors' block devices are reading from the
+network on demand. The cloud providers all have great internal
+networks. Why aren't we using those great internal networks to read
+our container images on demand?
+
+## Why does Go want this?
+
+Go's continuous build system tests Go on [many operating systems and
+architectures](https://build.golang.org/), using a mix of containers
+(mostly for Linux) and VMs (for other operating systems). We
+prioritize fast builds, targetting 5 minute turnaround for pre-submit
+tests when testing new changes. For isolation and other reasons, we
+run all our containers in a single-use fresh VMs. Generally our
+containers do start quickly, but some of our containers are very large
+and take a long time to start. To work around that, we've automated
+the creation of VM images where our heavy containers are pre-pulled.
+This is all a silly workaround. It'd be much better if we could just
+read the bytes over the network from the right place, without the all
+the hoops.
+
+## Tar files
+
+One reason that reading the bytes directly from the source on demand
+is somewhat non-trivial is that container images are, somewhat
+regrettably, represented by *tar.gz* files, and tar files are
+unindexed, and gzip streams are not seekable. This means that trying
+to read 1KB out of a file named `/var/lib/foo/data` still involves
+pulling hundreds of gigabytes to uncompress the stream, to decode the
+entire tar file until you find the entry you're looking for. You can't
+look it up by its path name.
+
+## Introducing Stargz
+
+Fortunately, we can fix the fact that *tar.gz* files are unindexed and
+unseekable, while still making the file a valid *tar.gz* file by
+taking advantage of the properties of both tar files and gzip
+compression in that you can concatenate tar files together to make
+valid tar files, and you can concatenate multiple gzip streams
+together and have a valid gzip stream.
+
+We introduce a format, **Stargz**, a **S**eekable
+**tar.gz** format that's still a valid tar.gz file for everything else
+that's unaware of these details.
+
+In summary:
+
+* That traditional `*.tar.gz` format is: `GZIP(TAR(file1 + file2 + file3))`
+* Stargz's format is: `GZIP(TAR(file1)) + GZIP(TAR(file2)) + GZIP(TAR(file3_chunk1)) + GZIP(TAR(file3_chunk2)) + GZIP(TAR(index of earlier files in magic file))`, where the trailing ZIP-like index contains offsets for each file/chunk's GZIP header in the overall **stargz** file.
+
+This makes images a few percent larger (due to more gzip headers and
+loss of compression context between files), but it's plenty
+acceptable.
+
+## Converting images
+
+If you're using `docker push` to push to a registry, you can't use
+CRFS to mount the image. Maybe one day `docker push` will push
+*stargz* files (or something with similar properties) by default, but
+not yet. So for now we need to convert the storage image layers from
+*tar.gz* into *stargz*. There is a tool that does that. **TODO: examples**
+
+## Operation
+
+When mounting an image, the FUSE filesystem makes does a couple Docker
+Registry HTTP API requests to the container registry to get the
+metadata for the container and all its layers.
+
+It then does HTTP Range requests to read just the **stargz** index out
+of the end of each of the layers. The index is stored similar to how
+the ZIP format's TOC is stored, storing a pointer to the index at the
+very end of the file. Generally it takes 1 HTTP request to read the
+index, but no more than 2. In any case, we're assuming a fast network
+(GCE VMs to gcr.io, or similar) with low latency to the container
+registry. Each layer needs these 1 or 2 HTTP requests, but they can
+all be done in parallel.
+
+From that, we keep the index in memory, so `readdir`, `stat`, and
+friends are all served from memory. For reading data, the index
+contains the offset of each file's `GZIP(TAR(file data))` range of the
+overall *stargz* file. To make it possible to efficiently read a small
+amount of data from large files, there can actually be multiple
+**stargz** index entries for large files. (e.g. a new gzip stream
+every 16MB of a large file).
+
+## Union/overlay filesystems
+
+CRFS can do the aufs/overlay2-ish unification of multiple read-only
+*stargz* layers, but it will stop short of trying to unify a writable
+filesystem layer atop. For that, you can just use the traditional
+Linux filesystems.
+
+## Using with Docker, without modifying Docker
+
+Ideally container runtimes would support something like this whole
+scheme natively, but in the meantime a workaround is that when
+converting an image into *stargz* format, the converter tool can also
+produce an image variant that only has metadata (environment,
+entrypoints, etc) and no file contents. Then you can bind mount in the
+contents from the CRFS FUSE filesystem.
+
+That is, the convert tool can do:
+
+**Input**: `gcr.io/your-proj/container:v2`
+
+**Output**: `gcr.io/your-proj/container:v2meta` + `gcr.io/your-proj/container:v2stargz`
+
+What you actually run on Docker or Kubernetes then is the `v2meta`
+version, so your container host's `docker pull` or equivalent only
+pulls a few KB. The gigabytes of remaining data is read lazily via
+CRFS from the `v2stargz` layer directly from the container registry.
+
+## Status
+
+WIP. Enough parts are implemented & tested for me to realize this
+isn't crazy. I'm publishing this document first for discussion while I
+finish things up. Maybe somebody will point me to an existing
+implementation, which would be great.
+
+## Discussion
+
+See https://github.com/golang/go/issues/30829