The SRE book

I gave a Lightning Talk at SREcon16 and I was lucky enough to win the SRE book from Google while I was there.

Here are some notes of things I was thinking while reading it.

First, this is a phenomenal piece of work, that really marks a special point in time: the dawn of the possibility of wide adoption of SRE principles. I say, “possibility” because after getting exposed to the deepest details of what makes SRE work, I think that there are lots of organizations that won’t be willing or able to make it work.

Even though I’ve been in IT for decades at this point, I’d fallen into the trap, as an outside observer of Google, to imagine that there was some magic bullet they possessed that made it possible to deliver such enormous services with such high reliability. If someone had asked me to explain that viewpoint before reading this book, I would have bashfully admitted that it was ridiculous to imagine that there are silver bullets in IT operations. Fred Brooks already taught us that.

Now that I’ve read the SRE book, I’ve figured out what the silver bullet is: it’s sweat.

Over and over while reading it, I thought to myself, “well, yeah, I knew that was the solution to the problems I was facing growing Tellme in 2001, but we just weren’t in a position to put in that work”. I’d also think while reading, “man, I can see how that would totally work, but you’d really need an immense amount of goodwill, dedication, and good leadership to make it happen; I’ve been in teams where there’s not a critical mass of team members who sweat the details to make that work.”

So that’s my 10000 meter take-away from the SRE book: Wow, man, that looks like a lot of work. It is as if Thomas Edison came back to life and restated his maxim: “Reliability is 99.999% perspiration and 0.0001% downtime”.

But that’s not the end of the story. The other thing I felt time and time again reading the book was a sense of longing to get back in the game. It made me ready to sweat. The book, told as it is from passionate, proud, smart people who have been sweating in the trenches, is as intoxicating as a Crossfit Promotional Video.

To outsiders, who think the understand IT operations or Software Development based just on the English-language definitions of the constituent terms, SRE might look easy. But what I really liked about SRE book was how time and again through the book, it talked about how the values of SRE inform the successful approach to a certain problem. When a team needs to introspect on its values in order to choose a way forward, you are no longer in the realm of technology: that’s about culture.

In my last job, from the first moment, my colleagues looked to me to guide the culture of the team. My title was not “tech lead”, but there were some behaviors I knew we needed to be encouraging and I knew how to model them. Reading the SRE book triggered the same instincts in me again. A lot of the info in the SRE book I already had learned in my own way, from my own experiences. But lots of the information was a new take on the old problems I knew about, and inspired me to say, “wow, yes, of course that’s the answer, I’d like to be in a team that was acting like that!”

But the fact that integrating SRE into an organization is a cultural, not technical, affair dooms it to partial, spotty uptake. There will be organizations that don’t have the right kind of cultural flexibility and leadership who is able to bring people around to SRE. They will carry on with what they are doing, but they will pay the price by forgoing the benefits that Google has shown that SRE can bring to an organization. Their dev teams and ops teams will forever be locked in battle, and the only action item from their postmortems will continue to be “we need more change control meetings”.

I pity the fools.

A Kafkaesque Experiment

As part of my interview prep, last night I challenged myself to do the following:

  • Make a Kubernetes cluster (on Google Cloud Platform)
  • …running Dockerized Zookeeper (1) and Kafka (2)
  • …with Kafka reporting stats into Datadog
  • Send in synthetic load from a bunch of Go programs moving messages around on Kafka
  • Then run an experiment to kill the Kafka master and watch how the throughput/latencies change.

Since thats a lot of that stuff I’ve never touched before (though I’ve read up on it, and it uses all the same general concepts I’ve worked with for 15 years) it should not be too surprising that I didn’t get it done. Yet.

The surprising thing is where I got stuck. I found a nice pair of Docker containers for Zookeeper and Kafka. I got Zookeeper up and running, and I could see it’s name in the Kubernetes DNS. My two Kafkas were up and running, and they found the Zookeeper via service discovery. So far so good. But then something went wrong with the place where I was going to run clients from; it could not talk to either of the Kafkas via TCP, connection timed out. What’s more, I couldn’t be sure that both of my Kafkas were even being advertised by Kubernetes DNS.

(Shower thought after writing this: perhaps my client container was started before the Kafka one, and as a result, it didn’t have the correct container-to-container networking magic set up. It would be interesting to read up on how that works and then debug it to see if I can see the exact problem. Or it might go away the next time I start the containers, this time in the right order. But… how can order matter? This would make it very difficult to operate these things.)

Learning how to debug in the container environment is one of the hardest things. It’s like walking around in a brewery in the dark armed only with a keychain flashlight and your nose, looking for the beer leak.

I think it is time to take a break from container-ville and use small, local Kafka on my Mac to develop the synthetic load generator. That will also be interesting, because I’m hoping to be able to generate spiky, floody flows of messages using feedback from producers to consumers. It is actually something I’ve had in mind for years, and never had the right situation calling on me to finally try it out.

Update: Well the load generator was fun hacking/learning. The final step would be to put it all together. That may come in the future, but for now I’m busy with a trip to New York.

Interview Questions I Hope I Get

I have an interview coming up, and so my “keep in shape hacking time” has been recently devoted to interview preparation. I thought I would make a post about what’s in my head, both as a way to solidify it (no better way to learn something than by teaching it) and in case this interview goes bad, so that my next prospective employer can see what I’m thinking about.

If you, my current prospective employer are reading this, would you please not take advantage of this by removing these questions from your list? Come on guys, give me a break. If I’m going to be transparent in my thought processes, the least you can do is throw me a bone and ask at least one of these in person!

Question 1: Make something faster.

I think this would be an interesting question. Especially if it was literally asked in these words. The interviewee asks, “make what faster?” The interviewer says, “Whatever. Tell me about making something faster. Whatever you want to tell me about.”

This would be a fun question because it would test the interviewer’s ability to organize his/her thoughts, select the relevant ones, and present them. In fact, it is checking the interviewee’s ability to teach, which is a required skill for someone with like me (with a grey beard). You deliver the most value on a team not by pulling 80 hour weeks yourself, but by mentoring people so that the 4 people in your team manage (together) a 160 hour week, instead of the 60-80 hours of useful progress they might have managed without your (theoretically) sage advice.

My answer would be the following:

There are three possibilities for how to make something faster. First, you can not do it at all. Not doing it is an infinite speedup over doing it. How? Caching, usually. And hey, caching is easy. As easy as naming.

Second, you can do it better, by choosing a better algorithm, representation, or by doing less work. I might retell a story Átila told me about a challenge in which the right answer could be got with less than half the data sorted (he wrote his own quicksort that let him find the answer before finishing the sort, thereby blowing other contestants out of the water; they all did quicksort to completion, then looked for the answer in the sorted list). I also happen to have spotted a place in my prospective employer’s product where I could propose to change the representation, in order to reduce the compute power necessary to parse something. It takes an arrogant little prick to use the company’s own software as an example of what not to do, but depending on the feeling I have with the interviewer, I might do it.

Third, you can take the solution you have now, and optimize it. To do that, you measure, profile, tweak, and then measure again. I checked some code into the Go standard library that does that, and one of those would make a nice story.

Question 2: How would you add load shedding to the Go standard library’s HTTP server?

I thought of this question while reading chapter 22 of the new SRE book from Google and O’Reilly.

There are two layers where you could implement it. The “right” way to do it is the arrange that the server sends back a properly formatted error 503 via HTTP. Another way that you could do it would be to accept and close the connection immediately. Which you choose depends on who you think your clients are, and how you think they will respond to the two kinds of errors.

The “accept and close” technique should be capable of more “!QPS” (non-answered queries per second) because it can theoretically handle each reject without spawning a new goroutine. Goroutines are cheap, but they imply at least one mutex (to add the goroutine into the scheduler), some memory pressure (for the stack), and some context switches. Implementing this would be a simple matter of wrapping the net.Listener with our own, which intercepted Accept to do “accept and close” when load shedding is needed.

I just went to read the source, and the way Accept is called, it must return a valid connection and no error. So our load-shedding Accept needs to actually be a loop, where we accept and close until the logic decides it is time to finally let a connection through (i.e. “if load shedding then allow 1 in 100 connections” or whatever), and then we need to return with that connection.

So how to implement the other one? What you want is an http.HandlerFunc wrapper. Except: what if libraries you don’t control are registering handlers? How can you be sure your load shedding wrapper is always the first one called, so that it can send an error 503? I thought I had a solution to this, but it turns out that http.ServeMux is not an interface, but a concrete type. So it is not possible to trap http.ServeMux.Handle, and ensure that the load shedder is always on the front. This one is going to take some more thinking.

Of course there’s always monkey patching in Go which could be used to arrange for HandleFunc and friends to always arrange to put a load shedder on the front of the call chain. But not in production. Please, God, not in production.

Question 3: A coding challenge I found on Glassdoor

This one is a legitimate coding challenge, and I don’t want to post either the question or the answer here, because that’s just not cool. But it did lead me to an interesting observation…

After a little workout in the coding dojo (I slayed the challenge!) I hit the showers. And in the shower, I had an idea: what if you could interpret the []byte that you get from go-fuzz as input to this code challenge? Then I could fuzz my routine, looking for ways to crash it. I quickly got it working, but ran up against the fundamental problem of fuzzing. When your input is random, for lots of problems, it is hard to check the output to know if it is the right output for the given input. You have one routine right there that could check it, but it is the routine-under-test, and thus using it would be cheating (and pointless). The bottom line, which I already knew, is that fuzzers can only find crashes, not incorrect behavior.

git log ––grep “Résumé”

For a while now, it’s become clear that a useful and important piece of data about how a future colleague might work out is their open source contributions. While the conditions of open source work are often somewhat different than paid work, a person’s manner of expressing themselves (both interpersonally, on issue trackers for example and in code) is likely to tell you more about their personality than you can learn in the fake environment of an interview.

I am currently starting a job search for a full time job in the Lausanne, Switzerland area (including as far as Geneva and Neuchâtel) with a start date of September 1, 2016. Touching up my résumé is of course on my todo list. But it is more fun to look at and talk about code, so as a productive procrastination exercise, here’s a guided tour of my open source contributions.

I currently work with Vanadium, a new way of making secure RPC calls with interesting naming, delegation and network traversal properties. I believe wide-scale adoption of Vanadium would make IoT devices less broken, which seems like an important investment we need to be making in infrastructure right now.

I have been an occasional contributor to the Go standard library since 2011. Some examples I am particularly proud of include:

  • I profiled the bzip2 decoder and realized that the implementation was cache thrashing. The fixed code is smaller and faster.
  • I noticed that the HTTP implementation (both client and server) was generating more trash than was necessary, and prototyped several solutions. The chosen solution was is in this checkin. This was, for a while, the most intensely exercised code I’ve written in my career; it saved one allocation per header for most headers processed by the billions of HTTP transactions that Go programs do all around the world. It was later replaced with a simpler implementation that was made possible by an optimization in the compiler. The same allocation reduction technique was carried over into Go’s HTTP/2 implementation too, it appears.
  • I improved the TLS Client Authentication to actually work. I needed this for a project at work.

I wrote an MQTT server in Go in order to learn the protocol for work. It is by no means the fastest, but several people have found that the simplicity and idiomaticity of its implementation is attractive. It ships as part of the device software for the Ninja Sphere.

  • I found an interesting change to reduce network overhead in the MQTT serializer I used. Copies = good, more syscalls and more packets = bad.

I worked on control software for a drone in Go for a while. It still crashes instead of flying, but it was an interesting way to learn about control systems, soft realtime systems, websockets, and Go on ARM.

I have written articles about Go on this blog since I started using it in 2010. Some interesting ones in particular are:

I am also active on the Golang sub-reddit, where I like answering newbie requests for code reviews. I also posed a challenge for the Golang Challenge.

My earliest open source project that’s still interesting to look at today was a more scalable version of MRTG called Cricket. Today tools like OpenTSDB and Graphite are used for the same job, it seems. Cricket still works to the best of my knowledge, but development is dead because of lack of interest from users. Cricket is written in Perl.

My earliest open source project was Digger, from Bunyip. If you can unearth archeological evidence of it from the 1996 strata of the Internet geology, you’re a better digger than I am. It was written in C.

Dynamic DNS circa 2016

In the old days, if you had an ISP that changed your IP address all the time but you wanted to run a server, you used dynamic DNS, i.e. a hacky script talking to a hacky API on an hacky DNS provider.

These days, if you bring up a cloud server from time to time to work, it is likely to get a different IP address. But you might want a DNS record pointing at it so that it is convenient to talk to.

Same problem, different century.

Here’s my solution for a GCE server and a domain fronted by CloudFlare.

It has nella.org hard coded in it, so YMWDV (your mileage will definitely vary).

Learning Swift, sans Xcode

Say you are learning Swift. And like a good fanboi, the first thing you do is update to the latest and greatest because that’s like what you do when you are a nerd.

But you live in Osh, Kyrgyzstan. You have bitchin’ FTTH from Unilink, but access outside of Kyrgyzstan is still limited by the great firewall that Putin has put up in Moscow or whatever. I don’t know, but it’s slow as hell.

So you want to learn Swift, but Xcode is out of commission because it is upgrading. Well, sort of out of commission. It gives an “I’m upgrading” message, but it is still there in /Applications/Xcode.app.

So you can use this script to call Swift as an interpreter, and then you can learn Swift in Emacs, where you should be programming anyway, YOU FOOL.

#!/bin/sh

xc="/Applications/Xcode.app/Contents/"

$xc/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/swift \
 -frontend \
 -interpret $1 \
 -target x86_64-apple-darwin15.2.0 \
 -enable-objc-interop \
 -sdk $xc/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk \
 -module-name `basename $1 .m`
rc=$?

echo
echo "Exit code: $rc"

HTTP/2: Thanks Cloudflare and Go!

Look what happened today:

2015/12/04 11:38:07 fetching https://nella.org
2015/12/04 11:38:08 {200 OK 200 HTTP/2.0 2 0 map[Server:[cloudflare-nginx] Date:[Fri, 04 Dec 2015 05:38:08 GMT] Content-Type:[text/html] Set-Cookie:[__cfduid=d3a3ea49ee46eb6a6803e2eb7f597e26e1449207488; expires=Sat, 03-Dec-16 05:38:08 GMT; path=/; domain=.nella.org; HttpOnly] Vary:[Accept-Encoding] Cf-Ray:[24f529d18893372c-ARN]] 0xc8203bbf60 -1 [] false map[] 0xc8200be000 0xc8206cc420}

Thank you Go 1.6 and Cloudflare. You guys are bringing my website into the bright future of 2016 with no help at all from me. 🙂

Industrial-scale power storage and waste heat

There will, eventually, be a giant wind farm above my house. I say eventually because though Switzerland is not immune from NIMBYism, our court system deals efficiently enough with oppositions so that if something is allowed by law (zoning laws, eco-protection laws, etc) then it does go through. The opposition (and there’s always opposition) does a few court challenges, it goes up a couple layers, sometimes to the supreme court, and the court rather quickly says, “It’s legal, shut up. If you don’t like it, change the laws, don’t come begging us to do so.”

The opposition have two complaints. They choose which one to talk about depending on the context. If they are going for “shock and awe” they use Photoshopped pictures to show how “ugly” the windmills will be. I’m suspicious of their Photoshopping, because though I think the relative sizes are correct, I don’t think the visibility (i.e. brightness of the windmills themselves and the reduction in visibility due to natural haze) in their pictures is correct. Whatever, it’s true, they are giant industrial installations in areas previously used only for grazing and milking cows. But we should recall that raising and milking cows it itself a giant industrial operation with a twice-daily milk run with a diesel-powered truck through this scenic wonderland. Some of the milking barns even run off of polluting diesel generators…

If you tell them, “I don’t mind the windmills, they look like progress to me” (and I have!), then the opposition falls back onto their second line of defense. They say, “windmills do not produce energy when it is needed, so they can’t replace nuclear”.

So, first, that’s a straw-man. No one is talking about nuclear here; if we were we wouldn’t agree anyway because I’m pro-nuclear. I consider nuclear power to be green energy (and the founder of Greenpeace does as well). What I’m interested in is eliminating fossil fuels from electrical generation, and from transport use.

The Tesla battery technology for utilities is the missing part of the equation. They shift energy from the peak generation time to the time when the energy is needed, making it possible for windmills and solar to contribute to the baseline load that existing dirty electric plants provide. But they have so far been tested in giant, ugly, industrial installations, which even I would not like to see here in my backyard.

So that got me thinking about how the Mollendruz windmills could be hooked up to batteries.

Batteries heat up when they are charged, and part of what’s special about Tesla’s innovations is to integrate cooling and fire protection into the heart of their batter packs. So a utility-scale battery installation will create utility-scale waste heat. In the current utility scale batteries, this is appears to be dumped into the atmosphere. But remote heating is a mature and well respected technology in Switzerland. Wouldn’t it be interesting to put that waste heat to use heating our schools and government buildings?

I don’t know how near batteries need to be to windmills to be efficient. Windmills generate alternating current, because they are a rotative power source. And AC travels at lower loss than DC. So it seems that putting the batteries where the heat is needed would be ok.

But the batteries are still ugly. What can we do about that? If you are harvesting the heat from them, and they are already engineered to be installed into moving cars and houses, the utility scale batteries can probably be installed indoors. In Switzerland, when we need to put things indoors and we want the land to remain pretty, we put them underground. Near the Col du Mollendruz there’s an old military fort called Petra Felix. I wonder if there’s enough room inside of it to hold the batteries?