My radical idea: How to be a Site Reliability Engineer
I want to pass on whatever I've learned about SRE. I need your help to make sure its useful
You can also find this on YouTube
One of the more exciting parts of my job has always been sharing an amount of experience I’ve either been able to gain first-hand or composite together from blogs, research and other people’s insights. I do this quite a lot internally wherever I’m working, and occasionally try and share this with the broader community (e.g. SLOConf, SRECon, Increment and so on)
As Covid-19 hit, the opportunities to go to conferences and meet new colleagues in person were (rightfully) dramatically limited, and we waited patiently for the world to reopen. In the meantime, I started playing with video equipment. This sponsored videos for my org about being on-call, observability and distributed tracing. The videos seemed to help those new to these topics and became part of the standard Zalando onboarding process.
I got to do this a few times, but the videos were expensive to make — about 100 hours each. As you can imagine, there are many demands on engineers’ time and making videos was not one of them ;D Still, I felt like we were under-leveraging the innovation and capabilities of our engineering community. If we could better empower them with some of our senior community's hard-won knowledge, we could make our environment more productive, healthier and happier. It sat in the back of my mind — two years, an itch that never went away. I kept hacking on making prettier and prettier video content. While I am no YouTube sensation, I don’t think there’s much else I could do without focus.
The inflection point
My work in the intervening period has been tremendously satisfying. I believe that I and the peers I worked with substantially improved the customer experience and the quality of life of our engineering community. However, recently due to some organizational restructuring, I faced a series of choices — a moment of inflection. I could do what I’ve done in the years previously, or I could try and scratch that itch.
I talked to my wife, colleagues and peers, and after some challenging reflection, I decided to try and scratch that itch. I’m going to try and make an SRE course.
I start next week. I have between now (August 28th) and November 30th to deliver. My work with my previous employer has ended, leaving me with no focus except to execute this course. By December 1st, I must have found and started on a new opportunity.
There’s no other opportunity that will present itself like this one. Let’s see what we can do!
Wait, I’m doing what?! That’s right! I have no idea, either. You can follow this adventure on this mailing list:
The course aims to empower engineers in their career between a “software engineer” and a “staff engineer” to understand how to make a reliable customer experience without sacrificing the health and well-being of the engineers who strive to make that experience work. After doing the course, I would expect learners to be able to:
Know what "Site Reliability Engineering" is, the areas that it covers and the broad approach it takes
Be able to make software “observable”
Be able to deploy software to "the cloud"
Understand how to detect and respond to critical production issues confidently and collaboratively
Understand the different operational tradeoffs (money, on-call health, developer velocity and reliability)
Know different strategies to consult with product delivery teams
To do the course, you will need some software development experience. Where the course makes specific references to a language, I’ll be using Go. We’ll cover a broad range of topics, such as how:
Do systems break?
Can we design systems?
Do we validate our software before release?
Do we release our software safely?
Can we modify our software safely at runtime with configuration?
Do we understand what our system is doing in production?
Do we respond to production incidents? Learn from them afterwards?
Can we articulate and set guidance on operational tradeoffs?
Can we work with other teams specifically on reliability topics?
The goal is to cover as much of this material as I can within 8 hours of your time (or less), making sure you get as much value as possible. So far, I intend to publish it on Udemy for the company-friendly price of 199 €.
Making this a reality
I am genuinely excited about seeing if I can make this work, but there’s still an enormous amount of challenge ahead. I’m going to need some help to make this a success.
There are a few ways you could help if you so desire:
You’ll notice that recently I’ve been posting material on this blog, on YouTube, on LinkedIn, and so forth. This is practice material — I am learning what I must do to make a compelling course. I will continue to do this over the next 12 weeks (and hopefully beyond), but I need feedback from you most here. Things like:
Is it valuable content for you? If yes, why? If not, why not?
Are you able to follow my line of thinking? Are there ways I could make this more straightforward for you to understand?
Is the way it’s presented useful? Do you prefer written content, video content, shorts or anything else I haven’t thought of?
Is the content accurate and comprehensive? Am I missing anything?
You should comment on the material directly wherever you see it. You should also be very direct. You will be doing me a favour; the internet is an unforgiving place, and I need your perspectives early to help me evolve ♥
If you find the content interesting, let’s chat about it! Ask questions, raise comments, disagree or otherwise engage with the material.
This solves two substantial problems:
It will help me connect with you. As I make this transition, these connections are just vital for me.
I will understand you more. The more I understand you, the better I can craft helpful content.
Lastly, if you find it useful, subscribe. When the course comes out, buy it. Ultimately, this will determine whether I will make more of this material or focus on other areas.
As a bonus, I will publish a coupon code for subscribers of this mailing list that will give you a substantial discount in case your org does not give this much love to learning and development.
What’s that? Cheap stuff?! but only for subscribers. Join now!
Let’s get to it
This is an urge I’ve had for a good couple of years now. The moment has come when I have the opportunity to pursue it, and while it is terrifying, I am committed to delivering on it. The material will cover everything designed to help you engineer reliable systems or drive reliability commitments within your organization.
I won’t be able to do it alone, and I will frequently stumble along the way. Your help will be vital to making this a success. You can do this simply by giving me feedback, engaging with the material or buying in when you see the opportunity.
The project ahead feels like an enormous mountain to scale. That said, the mountain gets no smaller, and we get nowhere by thinking about how to scale it. Let’s get to it!