The SRE book

I gave a Lightning Talk at SREcon16 and I was lucky enough to win the SRE book from Google while I was there.

Here are some notes of things I was thinking while reading it.

First, this is a phenomenal piece of work, that really marks a special point in time: the dawn of the possibility of wide adoption of SRE principles. I say, “possibility” because after getting exposed to the deepest details of what makes SRE work, I think that there are lots of organizations that won’t be willing or able to make it work.

Even though I’ve been in IT for decades at this point, I’d fallen into the trap, as an outside observer of Google, to imagine that there was some magic bullet they possessed that made it possible to deliver such enormous services with such high reliability. If someone had asked me to explain that viewpoint before reading this book, I would have bashfully admitted that it was ridiculous to imagine that there are silver bullets in IT operations. Fred Brooks already taught us that.

Now that I’ve read the SRE book, I’ve figured out what the silver bullet is: it’s sweat.

Over and over while reading it, I thought to myself, “well, yeah, I knew that was the solution to the problems I was facing growing Tellme in 2001, but we just weren’t in a position to put in that work”. I’d also think while reading, “man, I can see how that would totally work, but you’d really need an immense amount of goodwill, dedication, and good leadership to make it happen; I’ve been in teams where there’s not a critical mass of team members who sweat the details to make that work.”

So that’s my 10000 meter take-away from the SRE book: Wow, man, that looks like a lot of work. It is as if Thomas Edison came back to life and restated his maxim: “Reliability is 99.999% perspiration and 0.0001% downtime”.

But that’s not the end of the story. The other thing I felt time and time again reading the book was a sense of longing to get back in the game. It made me ready to sweat. The book, told as it is from passionate, proud, smart people who have been sweating in the trenches, is as intoxicating as a Crossfit Promotional Video.

To outsiders, who think the understand IT operations or Software Development based just on the English-language definitions of the constituent terms, SRE might look easy. But what I really liked about SRE book was how time and again through the book, it talked about how the values of SRE inform the successful approach to a certain problem. When a team needs to introspect on its values in order to choose a way forward, you are no longer in the realm of technology: that’s about culture.

In my last job, from the first moment, my colleagues looked to me to guide the culture of the team. My title was not “tech lead”, but there were some behaviors I knew we needed to be encouraging and I knew how to model them. Reading the SRE book triggered the same instincts in me again. A lot of the info in the SRE book I already had learned in my own way, from my own experiences. But lots of the information was a new take on the old problems I knew about, and inspired me to say, “wow, yes, of course that’s the answer, I’d like to be in a team that was acting like that!”

But the fact that integrating SRE into an organization is a cultural, not technical, affair dooms it to partial, spotty uptake. There will be organizations that don’t have the right kind of cultural flexibility and leadership who is able to bring people around to SRE. They will carry on with what they are doing, but they will pay the price by forgoing the benefits that Google has shown that SRE can bring to an organization. Their dev teams and ops teams will forever be locked in battle, and the only action item from their postmortems will continue to be “we need more change control meetings”.

I pity the fools.

A Kafkaesque Experiment

As part of my interview prep, last night I challenged myself to do the following:

  • Make a Kubernetes cluster (on Google Cloud Platform)
  • …running Dockerized Zookeeper (1) and Kafka (2)
  • …with Kafka reporting stats into Datadog
  • Send in synthetic load from a bunch of Go programs moving messages around on Kafka
  • Then run an experiment to kill the Kafka master and watch how the throughput/latencies change.

Since thats a lot of that stuff I’ve never touched before (though I’ve read up on it, and it uses all the same general concepts I’ve worked with for 15 years) it should not be too surprising that I didn’t get it done. Yet.

The surprising thing is where I got stuck. I found a nice pair of Docker containers for Zookeeper and Kafka. I got Zookeeper up and running, and I could see it’s name in the Kubernetes DNS. My two Kafkas were up and running, and they found the Zookeeper via service discovery. So far so good. But then something went wrong with the place where I was going to run clients from; it could not talk to either of the Kafkas via TCP, connection timed out. What’s more, I couldn’t be sure that both of my Kafkas were even being advertised by Kubernetes DNS.

(Shower thought after writing this: perhaps my client container was started before the Kafka one, and as a result, it didn’t have the correct container-to-container networking magic set up. It would be interesting to read up on how that works and then debug it to see if I can see the exact problem. Or it might go away the next time I start the containers, this time in the right order. But… how can order matter? This would make it very difficult to operate these things.)

Learning how to debug in the container environment is one of the hardest things. It’s like walking around in a brewery in the dark armed only with a keychain flashlight and your nose, looking for the beer leak.

I think it is time to take a break from container-ville and use small, local Kafka on my Mac to develop the synthetic load generator. That will also be interesting, because I’m hoping to be able to generate spiky, floody flows of messages using feedback from producers to consumers. It is actually something I’ve had in mind for years, and never had the right situation calling on me to finally try it out.

Update: Well the load generator was fun hacking/learning. The final step would be to put it all together. That may come in the future, but for now I’m busy with a trip to New York.