Seeking around in an HTTP object

Imagine there’s a giant tar file on a HTTP server, and you want to know what’s inside it. You don’t know if it’s got what you are looking for, and you don’t want to download the whole thing. Is it possible to do something like “tar tvf http://example.com/giant.tar”?

This is not a theoretical problem just to demonstrate something in Go. In fact, I wasn’t looking to write an article at all, except that I needed to know the structure of the bulk patent downloads from the USPTO, hosted at Google.

So, how to go about this task? We’ve got a couple things to check and then we can plan a way forward. First, do the Google servers support the Range header? That’s easy enough to check using curl and an HTTP HEAD request:

$ curl -I https://storage.googleapis.com/patents/redbook/grants/2015/I20150106.tar
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2UrryncnFxyuHvzmUvVADjSRQJXRogilQ-lE9-pFZwJwkU5jRYC27PBdIgFe4f18Xgol-YOBxi5QqGM5yhGgso2Rd_NsZQ
Expires: Sun, 17 Jan 2016 07:52:14 GMT
Date: Sun, 17 Jan 2016 06:52:14 GMT
Cache-Control: public, max-age=3600
Last-Modified: Sat, 07 Feb 2015 11:40:47 GMT
ETag: "558992439ef9046c834f9dab709e6b1a"
x-goog-generation: 1423309247028000
x-goog-metageneration: 1
x-goog-stored-content-encoding: identity
x-goog-stored-content-length: 2600867840
Content-Type: application/octet-stream
x-goog-hash: crc32c=GwtHNA==
x-goog-hash: md5=VYmSQ575BGyDT52rcJ5rGg==
x-goog-storage-class: STANDARD
Accept-Ranges: bytes
Content-Length: 2600867840
Server: UploadServer
Alternate-Protocol: 443:quic,p=1
Alt-Svc: quic=":443"; ma=604800; v="30,29,28,27,26,25"

Note the “Accept-Ranges” header in there, which says that we can send byte ranges to it. Range headers let you implement the HTTP equivalent of the operating system’s random-access reads (i.e. the io.ReaderAt interface).

So it would theoretically be possible to pick and choose which bytes we download from the web server, in order to download only the parts of a tarfile that have the metadata (table of contents) in it.

As an aside: The fact that Google is serving raw tarfiles here, and not tar.gz files, makes this possible. A tar.gz file does not have metadata available at predictable places, because the location of file n’s metadata depends on how well or badly file n-1’s data was compressed. Why are Google serving tar files? Because the majority of the content of the files are pre-compressed, so compressing the tar file wouldn’t gain much. And maybe they were thinking about my exact situation… though that’s less likely.

Now we need an implementation of the TAR file format that will let us replace the “read next table of contents header” part with an implementation of read that reads only the metadata, using an HTTP GET with a Range header. And that is where Go’s archive/tar package comes in!

However, we immediately find a problem. The tar.NewReader method takes an io.Reader. The problem with the io.Reader is that it does not let us get random access to the resource, like io.ReaderAt does. It is implemented that way because it makes the tar package more adaptable. In particular, you hook the Go tar package directly up to the compress/gzip package and read tar.gz files — as long as you are content to read them sequentially and not jump around in them, as we wish to.

So what to do? Use the source, Luke! Go dig into the Next method, and look around. That’s where we’d expect it to go find the next piece of metadata. Within a few lines, we find an intriguing function call, to skipUnread. And there, we find something very interesting:

// skipUnread skips any unread bytes in the existing file entry, as well as any alignment padding.
func (tr *Reader) skipUnread() {
    nr := tr.numBytes() + tr.pad // number of bytes to skip
    tr.curr, tr.pad = nil, 0
    if sr, ok := tr.r.(io.Seeker); ok {
        if _, err := sr.Seek(nr, os.SEEK_CUR); err == nil {
            return
        }
    }
    _, tr.err = io.CopyN(ioutil.Discard, tr.r, nr)
}

The type assertion in there says, “if the io.Reader is actually capable of seeking as well, then instead of reading and discarding, we seek directly to the right place”. Eureka! We just need to send an io.Reader into tar.NewReader that also satisfies io.Seeker (thus, it is an io.ReadSeeker).

So, now go check out packageĀ github.com/jeffallen/seekinghttp and see how we accomplish that.

This package implements not only io.ReadSeeker, but also io.ReaderAt. Why? Because after investigating a remote tarfile, I realized I also wanted to look at a remote zip file, and to do that I needed to call zip.NewReader, which takes both an io.ReaderAt and the size of the file. Why? Because the zipfile format has its table of contents at the end of the file, so the first thing it does is to seek to the end of the file to find the TOC.

A command-line tool to get the table of contents of tar and zip files remotely is in remote-archive-ls.

Leave a Reply