I gave a Lightning Talk at SREcon16 and I was lucky enough to win the SRE book from Google while I was there.
Here are some notes of things I was thinking while reading it.
First, this is a phenomenal piece of work, that really marks a special point in time: the dawn of the possibility of wide adoption of SRE principles. I say, “possibility” because after getting exposed to the deepest details of what makes SRE work, I think that there are lots of organizations that won’t be willing or able to make it work.
Even though I’ve been in IT for decades at this point, I’d fallen into the trap, as an outside observer of Google, to imagine that there was some magic bullet they possessed that made it possible to deliver such enormous services with such high reliability. If someone had asked me to explain that viewpoint before reading this book, I would have bashfully admitted that it was ridiculous to imagine that there are silver bullets in IT operations. Fred Brooks already taught us that.
Now that I’ve read the SRE book, I’ve figured out what the silver bullet is: it’s sweat.
Over and over while reading it, I thought to myself, “well, yeah, I knew that was the solution to the problems I was facing growing Tellme in 2001, but we just weren’t in a position to put in that work”. I’d also think while reading, “man, I can see how that would totally work, but you’d really need an immense amount of goodwill, dedication, and good leadership to make it happen; I’ve been in teams where there’s not a critical mass of team members who sweat the details to make that work.”
So that’s my 10000 meter take-away from the SRE book: Wow, man, that looks like a lot of work. It is as if Thomas Edison came back to life and restated his maxim: “Reliability is 99.999% perspiration and 0.0001% downtime”.
But that’s not the end of the story. The other thing I felt time and time again reading the book was a sense of longing to get back in the game. It made me ready to sweat. The book, told as it is from passionate, proud, smart people who have been sweating in the trenches, is as intoxicating as a Crossfit Promotional Video.
To outsiders, who think the understand IT operations or Software Development based just on the English-language definitions of the constituent terms, SRE might look easy. But what I really liked about SRE book was how time and again through the book, it talked about how the values of SRE inform the successful approach to a certain problem. When a team needs to introspect on its values in order to choose a way forward, you are no longer in the realm of technology: that’s about culture.
In my last job, from the first moment, my colleagues looked to me to guide the culture of the team. My title was not “tech lead”, but there were some behaviors I knew we needed to be encouraging and I knew how to model them. Reading the SRE book triggered the same instincts in me again. A lot of the info in the SRE book I already had learned in my own way, from my own experiences. But lots of the information was a new take on the old problems I knew about, and inspired me to say, “wow, yes, of course that’s the answer, I’d like to be in a team that was acting like that!”
But the fact that integrating SRE into an organization is a cultural, not technical, affair dooms it to partial, spotty uptake. There will be organizations that don’t have the right kind of cultural flexibility and leadership who is able to bring people around to SRE. They will carry on with what they are doing, but they will pay the price by forgoing the benefits that Google has shown that SRE can bring to an organization. Their dev teams and ops teams will forever be locked in battle, and the only action item from their postmortems will continue to be “we need more change control meetings”.