The SRE book

I gave a Lightning Talk at SREcon16 and I was lucky enough to win the SRE book from Google while I was there.

Here are some notes of things I was thinking while reading it.

First, this is a phenomenal piece of work, that really marks a special point in time: the dawn of the possibility of wide adoption of SRE principles. I say, “possibility” because after getting exposed to the deepest details of what makes SRE work, I think that there are lots of organizations that won’t be willing or able to make it work.

Even though I’ve been in IT for decades at this point, I’d fallen into the trap, as an outside observer of Google, to imagine that there was some magic bullet they possessed that made it possible to deliver such enormous services with such high reliability. If someone had asked me to explain that viewpoint before reading this book, I would have bashfully admitted that it was ridiculous to imagine that there are silver bullets in IT operations. Fred Brooks already taught us that.

Now that I’ve read the SRE book, I’ve figured out what the silver bullet is: it’s sweat.

Over and over while reading it, I thought to myself, “well, yeah, I knew that was the solution to the problems I was facing growing Tellme in 2001, but we just weren’t in a position to put in that work”. I’d also think while reading, “man, I can see how that would totally work, but you’d really need an immense amount of goodwill, dedication, and good leadership to make it happen; I’ve been in teams where there’s not a critical mass who sweat the details to make that work.”

So that’s my 10000 meter take-away from the SRE book: Wow, man, that looks like a lot of work. It is as if Thomas Edison came back to life and restated his maxim: “Reliability is 99.999% perspiration and 0.0001% downtime”.

But that’s not the end of the story. The other thing I felt time and time again reading the book was a sense of longing to get back in the game. It made me ready to sweat. The book, told as it is from passionate, proud, smart people who have been sweating in the trenches, is as intoxicating as a Crossfit Promotional Video.

To outsiders, who think the understand IT operations or Software Development based just on the English-language definitions of the constituent terms, SRE might look easy. But what I really liked about SRE book was how time and again through the book, it talked about how the values of SRE inform the successful approach to a certain problem. When a team needs to introspect on its values in order to choose a way forward, you are no longer in the realm of technology: that’s about culture.

In my last job, from the first moment, my colleagues looked to me to guide the culture of the team. My title was not “tech lead”, but there were some behaviors I knew we needed to be encouraging and I knew how to model them. Reading the SRE book triggered the same instincts in me again. A lot of the info in the SRE book I already had learned in my own way, from my own experiences. But lots of the information was a new take on the old problems I knew about, and inspired me to say, “wow, yes, of course that’s the answer, I’d like to be in a team that was acting like that!”

But the fact that integrating SRE into an organization is a cultural, not technical, affair dooms it to partial, spotty uptake. There will be organizations that don’t have the right kind of cultural flexability and leadership who is able to bring people around to SRE. They will carry on with what they are doing, but they will pay the price by forgoing the benefits that Google has shown that SRE can bring to an organization. Their dev teams and ops teams will forever be locked in battle, and the only action item from their postmortems will continue to be “we need more change control meetings”.

I pity the fools.

A Kafkaesque Experiment

As part of my interview prep, last night I challenged myself to do the following:

  • Make a Kubernetes cluster (on Google Cloud Platform)
  • …running Dockerized Zookeeper (1) and Kafka (2)
  • …with Kafka reporting stats into Datadog
  • Send in synthetic load from a bunch of Go programs moving messages around on Kafka
  • Then run an experiment to kill the Kafka master and watch how the throughput/latencies change.

Since thats a lot of that stuff I’ve never touched before (though I’ve read up on it, and it uses all the same general concepts I’ve worked with for 15 years) it should not be too surprising that I didn’t get it done. Yet.

The surprising thing is where I got stuck. I found a nice pair of Docker containers for Zookeeper and Kafka. I got Zookeeper up and running, and I could see it’s name in the Kubernetes DNS. My two Kafkas were up and running, and they found the Zookeeper via service discovery. So far so good. But then something went wrong with the place where I was going to run clients from; it could not talk to either of the Kafkas via TCP, connection timed out. What’s more, I couldn’t be sure that both of my Kafkas were even being advertised by Kubernetes DNS.

(Shower thought after writing this: perhaps my client container was started before the Kafka one, and as a result, it didn’t have the correct container-to-container networking magic set up. It would be interesting to read up on how that works and then debug it to see if I can see the exact problem. Or it might go away the next time I start the containers, this time in the right order. But… how can order matter? This would make it very difficult to operate these things.)

Learning how to debug in the container environment is one of the hardest things. It’s like walking around in a brewery in the dark armed only with a keychain flashlight and your nose, looking for the beer leak.

I think it is time to take a break from container-ville and use small, local Kafka on my Mac to develop the synthetic load generator. That will also be interesting, because I’m hoping to be able to generate spiky, floody flows of messages using feedback from producers to consumers. It is actually something I’ve had in mind for years, and never had the right situation calling on me to finally try it out.

Update: Well the load generator was fun hacking/learning. The final step would be to put it all together. That may come in the future, but for now I’m busy with a trip to New York.

Interview Questions I Hope I Get

I have an interview coming up, and so my “keep in shape hacking time” has been recently devoted to interview preparation. I thought I would make a post about what’s in my head, both as a way to solidify it (no better way to learn something than by teaching it) and in case this interview goes bad, so that my next prospective employer can see what I’m thinking about.

If you, my current prospective employer are reading this, would you please not take advantage of this by removing these questions from your list? Come on guys, give me a break. If I’m going to be transparent in my thought processes, the least you can do is throw me a bone and ask at least one of these in person!

Question 1: Make something faster.

I think this would be an interesting question. Especially if it was literally asked in these words. The interviewee asks, “make what faster?” The interviewer says, “Whatever. Tell me about making something faster. Whatever you want to tell me about.”

This would be a fun question because it would test the interviewer’s ability to organize his/her thoughts, select the relevant ones, and present them. In fact, it is checking the interviewee’s ability to teach, which is a required skill for someone with like me (with a grey beard). You deliver the most value on a team not by pulling 80 hour weeks yourself, but by mentoring people so that the 4 people in your team manage (together) a 160 hour week, instead of the 60-80 hours of useful progress they might have managed without your (theoretically) sage advice.

My answer would be the following:

There are three possibilities for how to make something faster. First, you can not do it at all. Not doing it is an infinite speedup over doing it. How? Caching, usually. And hey, caching is easy. As easy as naming.

Second, you can do it better, by choosing a better algorithm, representation, or by doing less work. I might retell a story Átila told me about a challenge in which the right answer could be got with less than half the data sorted (he wrote his own quicksort that let him find the answer before finishing the sort, thereby blowing other contestants out of the water; they all did quicksort to completion, then looked for the answer in the sorted list). I also happen to have spotted a place in my prospective employer’s product where I could propose to change the representation, in order to reduce the compute power necessary to parse something. It takes an arrogant little prick to use the company’s own software as an example of what not to do, but depending on the feeling I have with the interviewer, I might do it.

Third, you can take the solution you have now, and optimize it. To do that, you measure, profile, tweak, and then measure again. I checked some code into the Go standard library that does that, and one of those would make a nice story.

Question 2: How would you add load shedding to the Go standard library’s HTTP server?

I thought of this question while reading chapter 22 of the new SRE book from Google and O’Reilly.

There are two layers where you could implement it. The “right” way to do it is the arrange that the server sends back a properly formatted error 503 via HTTP. Another way that you could do it would be to accept and close the connection immediately. Which you choose depends on who you think your clients are, and how you think they will respond to the two kinds of errors.

The “accept and close” technique should be capable of more “!QPS” (non-answered queries per second) because it can theoretically handle each reject without spawning a new goroutine. Goroutines are cheap, but they imply at least once mutex (to add the goroutine into the scheduler), some memory pressure (for the stack), and some context switches. Implementing this would be a simple matter of wrapping the net.Listener with our own, which intercepted Accept to do “accept and close” when load shedding is needed.

I just went to read the source, and the way Accept is called, it must return a valid connection and no error. So our load-shedding Accept needs to actually be a loop, where we accept and close until the logic decides it is time to finally let a connection through (i.e. “if load shedding then allow 1 in 100 connections” or whatever), and then we need to return with that connection.

So how to implement the other one? What you want is an http.HandlerFunc wrapper. Except: what if libraries you don’t control are registering handlers? How can you be sure your load shedding wrapper is always the first one called, so that it can send an error 503? I thought I had a solution to this, but it turns out that http.ServeMux is not an interface, but a concrete type. So it is not possible to trap http.ServeMux.Handle, and ensure that the load shedder is always on the front. This one is going to take some more thinking.

Of course there’s always monkey patching in Go which could be used to arrange for HandleFunc and friends to always arrange to put a load shedder on the front of the call chain. But not in production. Please, God, not in production.

Question 3: A coding challenge I found on Glassdoor

This one is a legitimate coding challenge, and I don’t want to post either the question or the answer here, because that’s just not cool. But it did lead me to an interesting observation…

After a little workout in the coding dojo (I slayed the challenge!) I hit the showers. And in the shower, I had an idea: what if you could interpret the []byte that you get from go-fuzz as input to this code challenge? Then I could fuzz my routine, looking for ways to crash it. I quickly got it working, but ran up against the fundamental problem of fuzzing. When your input is random, for lots of problems, it is hard to check the output to know if it is the right output for the given input. You have one routine right there that could check it, but it is the routine-under-test, and thus using it would be cheating (and pointless). The bottom line, which I already knew, is that fuzzers can only find crashes, not incorrect behavior.

git log ––grep “Résumé”

For a while now, it’s become clear that a useful and important piece of data about how a future colleague might work out is their open source contributions. While the conditions of open source work are often somewhat different than paid work, a person’s manner of expressing themselves (both interpersonally, on issue trackers for example and in code) is likely to tell you more about their personality than you can learn in the fake environment of an interview.

I am currently starting a job search for a full time job in the Lausanne, Switzerland area (including as far as Geneva and Neuchâtel) with a start date of September 1, 2016. Touching up my résumé is of course on my todo list. But it is more fun to look at and talk about code, so as a productive procrastination exercise, here’s a guided tour of my open source contributions.

I currently work with Vanadium, a new way of making secure RPC calls with interesting naming, delegation and network traversal properties. I believe wide-scale adoption of Vanadium would make IoT devices less broken, which seems like an important investment we need to be making in infrastructure right now.

I have been an occasional contributor to the Go standard library since 2011. Some examples I am particularly proud of include:

  • I profiled the bzip2 decoder and realized that the implementation was cache thrashing. The fixed code is smaller and faster.
  • I noticed that the HTTP implementation (both client and server) was generating more trash than was necessary, and prototyped several solutions. The chosen solution was is in this checkin. This was, for a while, the most intensely exercised code I’ve written in my career; it saved one allocation per header for most headers processed by the billions of HTTP transactions that Go programs do all around the world. It was later replaced with a simpler implementation that was made possible by an optimization in the compiler. The same allocation reduction technique was carried over into Go’s HTTP/2 implementation too, it appears.
  • I improved the TLS Client Authentication to actually work. I needed this for a project at work.

I wrote an MQTT server in Go in order to learn the protocol for work. It is by no means the fastest, but several people have found that the simplicity and idiomaticity of its implementation is attractive. It ships as part of the device software for the Ninja Sphere.

  • I found an interesting change to reduce network overhead in the MQTT serializer I used. Copies = good, more syscalls and more packets = bad.

I worked on control software for a drone in Go for a while. It still crashes instead of flying, but it was an interesting way to learn about control systems, soft realtime systems, websockets, and Go on ARM.

I have written articles about Go on this blog since I started using it in 2010. Some interesting ones in particular are:

I am also active on the Golang sub-reddit, where I like answering newbie requests for code reviews. I also posed a challenge for the Golang Challenge.

My earliest open source project that’s still interesting to look at today was a more scalable version of MRTG called Cricket. Today tools like OpenTSDB and Graphite are used for the same job, it seems. Cricket still works to the best of my knowledge, but development is dead because of lack of interest from users. Cricket is written in Perl.

My earliest open source project was Digger, from Bunyip. If you can unearth archeological evidence of it from the 1996 strata of the Internet geology, you’re a better digger than I am. It was written in C.

Seeking around in an HTTP object

Imagine there’s a giant tar file on a HTTP server, and you want to know what’s inside it. You don’t know if it’s got what you are looking for, and you don’t want to download the whole thing. Is it possible to do something like “tar tvf http://example.com/giant.tar”?

This is not a theoretical problem just to demonstrate something in Go. In fact, I wasn’t looking to write an article at all, except that I needed to know the structure of the bulk patent downloads from the USPTO, hosted at Google.

So, how to go about this task? We’ve got a couple things to check and then we can plan a way forward. First, do the Google servers support the Range header? That’s easy enough to check using curl and an HTTP HEAD request:

$ curl -I https://storage.googleapis.com/patents/redbook/grants/2015/I20150106.tar
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2UrryncnFxyuHvzmUvVADjSRQJXRogilQ-lE9-pFZwJwkU5jRYC27PBdIgFe4f18Xgol-YOBxi5QqGM5yhGgso2Rd_NsZQ
Expires: Sun, 17 Jan 2016 07:52:14 GMT
Date: Sun, 17 Jan 2016 06:52:14 GMT
Cache-Control: public, max-age=3600
Last-Modified: Sat, 07 Feb 2015 11:40:47 GMT
ETag: "558992439ef9046c834f9dab709e6b1a"
x-goog-generation: 1423309247028000
x-goog-metageneration: 1
x-goog-stored-content-encoding: identity
x-goog-stored-content-length: 2600867840
Content-Type: application/octet-stream
x-goog-hash: crc32c=GwtHNA==
x-goog-hash: md5=VYmSQ575BGyDT52rcJ5rGg==
x-goog-storage-class: STANDARD
Accept-Ranges: bytes
Content-Length: 2600867840
Server: UploadServer
Alternate-Protocol: 443:quic,p=1
Alt-Svc: quic=":443"; ma=604800; v="30,29,28,27,26,25"

Note the “Accept-Ranges” header in there, which says that we can send byte ranges to it. Range headers let you implement the HTTP equivalent of the operating system’s random-access reads (i.e. the io.ReaderAt interface).

So it would theoretically be possible to pick and choose which bytes we download from the web server, in order to download only the parts of a tarfile that have the metadata (table of contents) in it.

As an aside: The fact that Google is serving raw tarfiles here, and not tar.gz files, makes this possible. A tar.gz file does not have metadata available at predictable places, because the location of file n’s metadata depends on how well or badly file n-1’s data was compressed. Why are Google serving tar files? Because the majority of the content of the files are pre-compressed, so compressing the tar file wouldn’t gain much. And maybe they were thinking about my exact situation… though that’s less likely.

Now we need an implementation of the TAR file format that will let us replace the “read next table of contents header” part with an implementation of read that reads only the metadata, using an HTTP GET with a Range header. And that is where Go’s archive/tar package comes in!

However, we immediately find a problem. The tar.NewReader method takes an io.Reader. The problem with the io.Reader is that it does not let us get random access to the resource, like io.ReaderAt does. It is implemented that way because it makes the tar package more adaptable. In particular, you hook the Go tar package directly up to the compress/gzip package and read tar.gz files — as long as you are content to read them sequentially and not jump around in them, as we wish to.

So what to do? Use the source, Luke! Go dig into the Next method, and look around. That’s where we’d expect it to go find the next piece of metadata. Within a few lines, we find an intriguing function call, to skipUnread. And there, we find something very interesting:

// skipUnread skips any unread bytes in the existing file entry, as well as any alignment padding.
func (tr *Reader) skipUnread() {
    nr := tr.numBytes() + tr.pad // number of bytes to skip
    tr.curr, tr.pad = nil, 0
    if sr, ok := tr.r.(io.Seeker); ok {
        if _, err := sr.Seek(nr, os.SEEK_CUR); err == nil {
            return
        }
    }
    _, tr.err = io.CopyN(ioutil.Discard, tr.r, nr)
}

The type assertion in there says, “if the io.Reader is actually capable of seeking as well, then instead of reading and discarding, we seek directly to the right place”. Eureka! We just need to send an io.Reader into tar.NewReader that also satisfies io.Seeker (thus, it is an io.ReadSeeker).

So, now go check out package github.com/jeffallen/seekinghttp and see how we accomplish that.

This package implements not only io.ReadSeeker, but also io.ReaderAt. Why? Because after investigating a remote tarfile, I realized I also wanted to look at a remote zip file, and to do that I needed to call zip.NewReader, which takes both an io.ReaderAt and the size of the file. Why? Because the zipfile format has its table of contents at the end of the file, so the first thing it does is to seek to the end of the file to find the TOC.

A command-line tool to get the table of contents of tar and zip files remotely is in remote-archive-ls.

Dynamic DNS circa 2016

In the old days, if you had an ISP that changed your IP address all the time but you wanted to run a server, you used dynamic DNS, i.e. a hacky script talking to a hacky API on an hacky DNS provider.

These days, if you bring up a cloud server from time to time to work, it is likely to get a different IP address. But you might want a DNS record pointing at it so that it is convenient to talk to.

Same problem, different century.

Here’s my solution for a GCE server and a domain fronted by CloudFlare.

It has nella.org hard coded in it, so YMWDV (your mileage will definitely vary).

The most important thing when go-fuzzing

The most important thing to know, when you are using go-fuzz, is that the cover metric should be increasing.

I didn’t know that and I wasted one 12 hour run of fuzzing because my fuzzing function was misbehaving in a way that made it return the same useless error for every input no matter what. That meant that no matter what go-fuzz mutated in the input, it could not find a way to explore more code, and could not find any interesting bugs. It was trying to tell me this by not incrementing the cover metric it was reporting.

Do not do like I did. Watch that cover is going up before leaving your go-fuzz to go spend hours and hours wasting time.

Learning Swift, sans Xcode

Say you are learning Swift. And like a good fanboi, the first thing you do is update to the latest and greatest because that’s like what you do when you are a nerd.

But you live in Osh, Kyrgyzstan. You have bitchin’ FTTH from Unilink, but access outside of Kyrgyzstan is still limited by the great firewall that Putin has put up in Moscow or whatever. I don’t know, but it’s slow as hell.

So you want to learn Swift, but Xcode is out of commission because it is upgrading. Well, sort of out of commission. It gives an “I’m upgrading” message, but it is still there in /Applications/Xcode.app.

So you can use this script to call Swift as an interpreter, and then you can learn Swift in Emacs, where you should be programming anyway, YOU FOOL.

#!/bin/sh

xc="/Applications/Xcode.app/Contents/"

$xc/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/swift \
 -frontend \
 -interpret $1 \
 -target x86_64-apple-darwin15.2.0 \
 -enable-objc-interop \
 -sdk $xc/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk \
 -module-name `basename $1 .m`
rc=$?

echo
echo "Exit code: $rc"