Suddenly, you're in charge. Now what?

Some tools and techniques I've used in the past to create a common sense of direction and purpose.

Sep 01, 2023

This one is a pretty long read. You can find it broken up into chunks on YouTube on YouTube! More of a listener? I’ll upload that shortly.

I’ve found myself suddenly responsible for a large group of technical people a few times. I’ve done this in preparation for CyberWeek, driving reliability projects across the checkout and in the formation of my team. I want to talk today about some of the tools and techniques I use to create a shared understanding, sense of purpose and direction in the people I am working with to deliver … something or other.

The sharpest, hardest lesson I learned was with the creation of Embedded SRE. In 2019, we hired four engineers with a broad mandate: “Improve the reliability of the transactional experience”. We’d be four people working with over 160 engineers, managers, technical experts and other contributors. We somehow had to come to terms with this and develop some semi-reasonable plan to make that area more reliable. For the first five months, we essentially “span wheels” — we did some operational work, but nothing substantial.

Fortunately, the engineering director in this domain, Andrei Gherasim, came to my rescue. He and I collaborated on a document called “Mission and Strategy 2022”. This document covered the scope, purpose, engagement model and broad deliverables my team would deliver over the next nine months. We kept reviewing it until we had it in good shape, after which we started sharing it with the engineers, directors and VPs my team would ultimately engage with.

That document changed the course of my team. Suddenly, rather than needing to negotiate each commitment individually with the stakeholders, we had a transparent contribution model and a path to choose when (and when not) to intervene. We had expectations to fulfil ourselves, and leadership within each domain could fold us into their operational plan. That strategy worked beautifully — for about a month or so. Then, a new operational challenge popped up, and we suddenly had too much to deliver on. We half-delivered a bunch of stuff.

2023 rolled around, and we did a review of that strategy. We identified what assumptions we had initially that didn’t bear out and what underlying theory we had that turned out to be wrong. It was a hard one to go through. I wrote up a new strategy that factored in our challenges in 2022. This strategy was more successful than the previous one; we made systems substantially more reliable. The process had room for improvement, but this structured approach was vital for my learning and the team's evolution.

Recently, I’ve been writing up another strategy for the course. It has helped me shape my approach, and I’ve iterated on quite a few it's still on paper. Given this, I thought I’d share my general approach to disambiguating the unknown based on the last couple of years of figuring this out for my commitments. Who knows, maybe it’ll save you months of stress as it did me!

The Problem

To figure out how to write a strategy, we must first understand the problem we want to solve by writing a strategy. In his book “Good Strategy, Bad Strategy: The Difference and Why It Matters,” Richard Rumelt squares this away on page #1:

“The core of strategy work is always the same: discovering the critical factors in a situation and designing a way of coordinating and focusing actions to deal with those factors”

In my experience, an organisation (whether it’s a team, department or whole company) is organised around a small set of fundamental problems. Substack lets independent writers and podcasts directly publish to their audience and get paid through subscribers. Netflix enables people to watch the latest TV immediately on their devices. In the upcoming Site Reliability Engineering courses, I want to empower engineers to run software reliability through a reference approach to software operations.

There are usually many, many different ways to solve a problem. A large subset of these will get the job done at some point, and a smaller subset will be as efficient — if we execute that idea exclusively. To solve a problem, everyone needs to rally behind a single approach. The correct method for any organisation depends on its strengths, weaknesses, capabilities and challenges. So, the next step is to figure those out!

As a necessary caveat, I’m still learning to write “strategies” and lead people. It’s a fun adventure, but some positions might surprise you if you’re a more experienced leader. That’s fine — just correct me in the comments, and I’ll have learned more!

SWOT

In 1965, three colleagues (Robert, Otis and Arnold) designed SWOT to appraise an organisation's strengths, weaknesses, opportunities and threats. While it is a little older, it is still handy for inventorying what we need to consider when crafting our strategy. It is also a fantastic way to open our strategic planning to a larger community, sourcing perspectives from diverse people across job roles.

A SWOT analysis consists of a 4x4 grid. In this grid, we label a section for each component:

Strengths: Characteristics of the organisation that give it an advantage and can be leveraged
Weakness: Characteristics of the organisation that give it a disadvantage and need to be overcome
Opportunities: Something in the environment that can we can exploit for an advantage
Threats: Something in the environment that poses a risk to the organisation

For solo projects, a landscape document is acceptable; for collaborative work, a whiteboard (or virtual whiteboard) to which “sticky notes” can be attached works well. I have more valuable insights when doing something other than my everyday work (such as cleaning the house or walking to the shops), so I stick it somewhere accessible and add to it over a few days.

Let’s take, for example, my current goal to enable engineers to run software reliability through a reference approach to software operations. I have some strengths, such as extensive experience running software and helping 160 teams at a large eCommerce company do the same. I have a small social media following, access to multiple engineering communities, relationships with many engineering leaders in this domain across Berlin and some experience creating content. I also have some weaknesses, such as challenges focusing on a topic for an extended time, inexperience in producing educational content, no experience in marketing or monetising this content or the isolation that comes with working solo on a project. There is opportunities to make the course material as helpful as possible through collaboration with peers in industry and conferences and determine whether the problems I see running software are as substantial as I imagine. Lastly, there are threats to this project, including a minimal period and the parallel requirement to find a new job to start in December or January.

With the SWOT analysis, we have a good index of the organisational capabilities. This is great! The next is to start thinking about the actual strategy.

💡 You can try this in your workplace. You can do it alone or with a team; take a moment and figure out what strengths, weaknesses, opportunities, and threats exist within your organisation. Once you’ve written this up, overlay this with your current strategy or the work you’ve been doing over the last three months. Are you maximising your strengths? Overcoming your weaknesses? Are there opportunities in the next weeks you can take advantage of? Challenges you should be preparing yourself for?

Story #2: Site Reliability Engineering

I’ve seen a lot of different “strategy documents”. They can vary widely in their ability to empower an organisation to solve their core problem, but one example sticks out in my mind: The formation of the Site Reliability Engineering department. I want to share this story as it was especially pivotal for my understanding, and I want you to understand the power of a good strategy.

In 2019, the organisation at which I was working did a reorganisation of central technology teams. It brought a series of teams together, including the “incident management” team, the “logging” team, the “SRE enablement” team and the “visibility” team, all under a leader called “Luis Mineiro”. The reorganisation made sense — the teams were all oriented around ensuring engineers could understand and operate their software. The teams had worked together in the past but had very different ideas as to how the future for their teams should look like and little notion as to how they should work together for a shared purpose.

Enter: “The Site Reliability Engineering Strategy 2020”. Luis wrote this document in collaboration with the engineering managers of each of those teams and created a shared understanding of the vision and purpose of the department. It rallied the teams together to make their work substantially more cohesive and clarify what the department would do and what it would not. Consequently, the department was able to ship remarkable changes quickly, restructuring incident response, rebuilding the time-series system and delivering a new model for thinking about time-series data.

The question that didn’t occur to me until I created my team was why this strategy was so successful. Why did engineering managers buy in so heavily? Where did it enable decisions, and how? Unfortunately, I subsequently left the company and cannot read it to be sure!

Earlier, we outlined the fundamental problem a strategy needs to solve: discovering the critical factors in a situation and designing a way of focusing actions to deal with those factors. There are a few things we need to have to complete that design. Let’s get out a document and start making notes!

The Problem

The first is a shared understanding of the problem. It might be surprising to think about this — after all, we’re all here for the same purpose! However, people can have different perspectives on what we’re trying to achieve and why. Given this, the first thing to do is to summarise the problem we’ll be working to solve. The customers who have this problem, what that problem is, how it affects that customer and what the scope of the problem we’re considering is!

Let’s stick with the course example. We might start with, “We need to teach engineers site reliability engineering”. That feels like a problem, but in fact, it is only the absence of the solution we intend to build. Which engineers have this problem? How many of them are there? What is their experience? Why does this work even matter? Let’s rephrase the problem a little. “All software companies need to ensure an acceptable level of reliability, and there are effective mechanisms to define that acceptable level and make tradeoffs to ensure that it is met. However, these mechanisms are not widely known, and companies must discover them independently. Because engineers are developing these approaches in parallel, we end up with competing approaches that are inefficient or a solution that comes at the cost of a greater business outcome.”

Now, we have a clearer picture of why our work matters and a more extensive range to draw solutions from. Courses are one approach, but so might be building a product to commodify knowledge or defining a standard organisations should adopt. It also defines what we will not solve — we will not solve the definition of business requirements, analytics or other software requirements, just the reliability tradeoff.

Success Criteria

As a corollary to the problem, we need to know how we will progress against this problem. It should be something that does not limit the potential solutions or tolerate changes in approach as we figure out a more efficient way to solve our problem.

Let’s stick with the course example. Our fundamental goal is to help organisations improve reliability, so let's try and measure that! Unfortunately, we are unlikely to maintain access to organisations' data directly, but we can certainly ask them three months after completing the course whether their reliability has improved. In addition, as we intend to teach learners, we can ask them three months after completing this material whether or not it helped.

Once we have a way of validating our strategy, we need to set up a routine to check whether or not we’re delivering what we need to. A good default is to check in on these monthly, comparing the current the past month and if possible, the same position last year.

💡 Try this in your workplace. Write up your understanding of the problem your organisation is designed to solve and how you’re validating that it's solved. If you have a patient colleague, ask them to write theirs down in parallel. It doesn’t have to be long — 100 words is great! Then, compare. Did they overlap? What were the differences? What happens when you share it with your team?

organisational Theories

So far, we have an analysis of our capabilities, a clear problem definition and the criteria by which we’re going to measure if we’re making progress. The next step is to write up our theory as to what might be contributing to the problem, and what might be effective interventions to try and address it. It's the basis on which we’ll make “strategic decisions” later.

Sticking with our course example, the problem we want to solve is that organisations need to maintain an amount of reliability, but there’s no common path to do that. Engineers need to discover the solutions to this problem on their own. Some theories that might contribute to this include:

Giving engineers a good operations model will improve reliability. If the engineers know how to engineer reliable systems, they’ll choose to do so.
Multiple colleagues doing the same training are more likely than one colleague to drive change. Team members collaborate on new ideas much more efficiently if they’ve learned that new idea at the same time, and can collectively discuss how to implement it.
Providing a shared reliability model means a path to shared evolution. Rather than every engineer discovering operations for themselves, we can give them all a “standard reference”. Even if they disagree, they have something explicit to disagree with.

These theories will either be correct or not. But by making them explicit, we can be clear about why we’re making the strategic decisions that we will. Additionally, we can work to validate these theories as we execute the strategy, abandoning the theories that do not match reality (or abandoning the strategy if the theories are sufficiently incorrect).

We need to review these theories for evidence; a good model to do this is once every six months.

Capabilities & Constraints

We have our problem, measurement and organisational theories. We now have as much opportunity to be creative as we can be! Unfortunately, while it is fun to be boundlessly creative, we are all bound bythe capabilities we have access to or the constraints we need to meet — the same ones from the SWOT analysis earlier.

We should document them in the strategy unless the SWOT document is otherwise especially legible. We do not have to elaborate on them. We can simply restate them from the SWOT:

Capability: An instructor with over ten years of experience in writing software and maintaining and operating that software in a production environment
Capability: A series of social media profiles that can provide snacks of this content, with the hope people will choose to purchase the whole material.
Constraint: The time to write the courses is limited, with only 12 weeks (between September 1st and November 30th) available.

This is useful to reevaluate as we evaluate the rest of our strategy in 6, 12 or 24 months. If we develop new capabilities or meet new constraints, we can factor them in and adjust our strategic choices.

Tradeoffs

Wheow! We’ve done much investigative work so far. We’ve done an analysis of our strengths and weaknesses, defined and aligned on the problem and how we’ll measure it and created some theories around why that problem exists. We’ve been clear about our opportunities and constraints. It’s time to make some decisions.

Let’s stick with our course example. We have a hard constraint of time. We’re thus making a broad choice as to how to spend that time. Our first tradeoff might be:

We will focus on the delivery of limited material for rapid feedback with our community instead of publishing a large release at once. We will revise and expand content that appears especially helpful to learners.

In and of itself, the tradeoff is not a decision — rather, it's something that we should consider as we’re making other decisions. Other tradeoffs might be:

We will focus on providing value by creating compelling material, rather than on the delivery of that material itself. We will work actively with third-party providers where they provide a suitable mechanism to deliver.
We will focus on a limited subset of the material if that proves especially valuable rather than aim for material completeness

I would expect tradeoffs to be much longer lived than any decisions that we make as part of a strategy. Tradeoffs only tend to be adjusted when the underlying conditions of the organisation change, such as the time horizon on which the organisation is focused or the core problem the organisation is faced with changes.

💡 Try this in your workplace. All organisations are making tradeoffs all the time. See if you can look over several major decisions leaders have made in your organisation. Is there a common theme? Can you write down this theme and validate it with your peers? Can you figure out whether that tradeoff is by design or accidental?

Constraining Choices

Hooray! We’re about to make some strategic choices — the most exciting part of any strategy. However, this is also easily the most challenging part of a strategy to execute well. The primary challenge is not the decisions themselves but figuring out how to layer decisions to give an organisation direction but does not constrain its creativity.

Let’s think first about about a team of people. The value in dividing an organisation into “teams” is that each can make a series of decisions with some independence from other teams; that’s how the organisation can get a lot done quickly. Let’s think about an organisation of 100 people, organised into “two-pizza” teams of four to 5 people. There might be 15 delivery teams of 5 people and a manager. Those managers will need to be managed by an additional three managers (which we’ll call “department heads”), which, in turn, need another manager (“director”). We might have six senior technical team members (“staff engineers”) and one project manager.

That means we have at least 19 managers who need to make coordinated decisions across multiple layers of management, without counting any technical experts who inform these decisions or teams who are expected to own and deliver on them! There is a tradeoff between an organisation that is aligned and an organisation that is fast-moving. Our goal is to create a strategy that allows an organisation to be as aligned as necessary while moving as fast as necessary.

The trick is to make decisions that constrain choices just enough to ensure alignment but that allow the maximal creativity for the next management layer to take. This can be extremely challenging and requires balancing freedom and accountability in the management layer.

For the example of the course, it’s a little easier — it’s just me writing these courses! As organisations get larger and larger, the balance gets harder and harder to strike. The course decisions include:

Provide multiple, small courses teaching Site Reliability Engineering. This enables us to validate our organisational theories with relatively minimal work and meets the tradeoff of delivering a limited amount of material for rapid feedback.
Provide learners connections with communities of practice. This enables us to validate the theory that multiple learners executing in parallel will be more effective change agents than a single learner — even if those learners are not in the same organisation.

Thats it! We can figure out anything more specific than that over the coming weeks.

Try this in your workplace. See if you can figure out how your top-level leader balances autonomy and alignment in your management chain. At what layers are which decisions made? What are the controls that keep things aligned? Are there multiple layered strategies?

Review & FAQ

Gosh, wasn’t that a blast! We’ve now completed our strategy.

Unfortunately, the work around the strategy has only begun. A strategy is only ever solid if it prompts a shared understanding that you, as a leader, can use to drive organisational efficiency. The first step to doing this is getting it in front of people.

The first and counterintuitive step is to use yourself as the first reviewer. To start, put the strategy away. Hide the tab, put it in a draw, and rename it opaque. Then, a week later, you come back. Print it out, get a marker and start going through it. You’ll be surprised at how much you see that you have already evolved your thinking. Go through and correct it. Then, get a recorder, read the document aloud and then play it back to yourself going for a walk or similar. As before, you’ll notice things that you missed.

The next challenge is getting people to read it. Mostly, people will only read some way through a more extended strategy! Reduce your larger document to a 1-page strategy summary to get the entire buy-in. Once you’ve written this, start sharing it for others to read. Share it with

A close teammate
Your leader
Your team
Your stakeholders
Your organisation

As you share it with more and more people, you’ll start to find that common questions come up. You can answer these questions in a new section called “FAQ” at the bottom of the document. Common questions include:

Who are the customers of this strategy?
What gives you confidence in this strategy? What are the risks?

After you’ve distributed your strategy, call a meeting and invite whoever needs to approve the document. Allow your approvers one final read, ask for approval, and then you’re aligned once you have it. Time to start executing!

In Summary,

Disambiguating the unknown can be extremely challenging. I struggled with it when my manager tasked me with driving the Embedded SRE team. I first saw it solved elegantly during the creation of the Site Reliability Engineering department. However, taking a structured approach to understanding the problem is very helpful.

SWOT allows us to inventory our organisation's capabilities. Then, we can go through and ensure we have a shared understanding of the problem and theory as to why it exists, list our capabilities and constraints, make tradeoffs and then define the constraining choices.

Once we’ve written it up, we can condense it into a one-pager and share it with our colleagues. We can then use that to walk our now clear path into the unknown.

I recommend you try it! At worst, you’ll learn a tonne.

Simple, Beautiful Software Development

Discussion about this post