How does operational excellence help you with continuous improvement in a complex environment? We share our Operational Excellence Definition and Best Practices. This article is the first in a series on the well-architected framework and pillars.
Operational Excellence and well-architected
If you are new to the well-architected framework, look at our article on the well-architected pillars of SCORP or SCORPS; now, six well-architected pillars. Well-architected is interesting because AWS, Google, and Azure have their versions of well-architected. They’re all quite similar. We have had great success working through these pillars.
So, we will cover each pillar in this series, starting with operational excellence.
Operational Excellence is instrumental. It gives a frame of reference and a structure for asking better questions about our teams, systems, networks, processes, and practices. It has been hugely helpful in trying to evolve engineering practices and companies. Thousands of organizations have hardened, approved, and battle-tested Operational Excellence, giving it much credibility. So it’s not just our opinion, but good practice that works.
We like the ubiquity of Operational Excellence. Whether you’re an architect, an engineer, or a manager in one organization, it’ll make sense when you go to another.
It should be part of the continuous architecture.
Operational Excellence is not a yearly process to deliver compliance once a year as part of a well-architected review. It should be part of the continuous architecture. I always encourage people to get certification not for a bit of paper or a free water bottle but because you have to learn well-architected as part of certification.
So, starting with operational excellence, the AWS pillar breaks down into three areas. Each area has five or six questions. So the three areas (in the operational excellence pillar) are ‘Prepare, Operate, and Evolve.’
Operational Excellence Pillar – Prepare, Operate and Evolve
Operational excellence means many things to many people, but let’s start with ‘Prepare.’ It’s helpful to go into new areas and teams and ask these questions:
- Do you know who your users are?
- What is the purpose of your team?
- Do you know what your highest priority is?
Some are straightforward, fundamental questions.
Are you set up to meet challenges, business requirements that you will pursue, or the needs you’re trying to meet?
Asking simple questions can be revealing.
So simple questions like how do you determine what your priorities are can be very revealing. You can converse well in a safe space with the whole team involved. We know our priorities for this week and next, but we’re still determining what we will do for the month after. It’s an excellent conversation to tease out. Are you aligned with the strategic direction? Do you have a prioritization framework, or are you making it up ‘on the hoof’?
The Operational Excellence pillar needs the whole team involved in the conversation. Some questions require Management to be involved; some require the Tech Lead or an Engineer to understand the big picture and operations. We talk about consistency. This section recommends playbooks/runbooks and standards for preparing for your process: prepare for failure, or everything always fails.
Operational Excellence: Prepare
You must prepare to move on to post-implementation and hand off to a different team or place where you’re bringing on new engineers. Do you have the runbooks for the operations in a particular workload? Do you have the playbooks linked to observability in your dashboard so that when things go wrong, there’s a solid set of instructions to deal with that problem, and they don’t have to go in and unpack what you’ve built out? There’s a lot of good, solid foundational guidance. From an architecture perspective (we’re all architects), it’s table stakes for team consistency.
‘Prepare’ looks at tribal knowledge, like when you ask a question and the response is ‘Fred says.’ In other words: ‘I don’t know why we do that, but Fred says we do that.’ Another response could be: ‘Ask my manager’. But what happens when your manager isn’t there? We need leadership and empowerment within the team and written down for everyone. So, ‘Prepare’ checks team culture.
‘Prepare’ also checks simple stuff like: Do you have enough people to meet the challenges? Do you have assigned owners responsible for processes, practices, and operations? If you can get these foundations in place early, you evolve, go through the lifecycle, and apply the other well-architected pillars. Your chance for success dramatically improves because your operational excellence pillar has set the foundation.
Operational Excellence: Operate
The next section of the Operational Excellence Pillar is ‘Operate.’ So you start with ‘Prepare’ and then move to ‘Operate’. We like ‘Operate’ because there’s a lot of observability. Workload is an asset, but how do you understand the health of that asset and how to monitor it to ensure it’s working well?
‘Operate’ is about getting the team ready for production. A particular bugbear of mine is when teams need to think about how to validate in production and how to spot regression. What are the key performance indicators of the workload? When things go wrong, can they spot it, and have they thought about how to remediate or correct those sorts of things?
Things do go wrong.
You go back to ‘Prepare’ again. There’s always something going to go wrong, something you haven’t predicted, or an alternate path you have missed. So when those things happen, have you got the correct procedures for learning what that defect teaches so you can bake it in and toughen up your operation going forward? It’s a holistic way of thinking; you need those mechanisms to show you how your workload performs by product.
Having those information radiators and dashboards available is critical, not just for the team. If you have proper observability, you can show the C suite the team working on a particular capability, feature, or value stream and how it relates to our vision and strategy. That’s proper operational observability across everything, including the health of your workload and team. Door key metrics should be part of how you operate with a sustainable pace for the team.
Operational Excellence: Evolve
The last section of the Operational Excellence Pillar is ‘Evolve.’ You go through ‘Prepare,’ ‘Operate,’ and ‘Evolve’. And it’s quite simply about how you evolve operations, which doesn’t mean cutting costs and reducing the budget!
It’s about having a continuous improvement mindset with feedback loops in place. We’re big into mapping, and evolution is a cornerstone of Wardley mapping. You need to take these signals from your systems and your workloads on board and use them to evolve, improve, and get better to have observability and dashboards.
That’s the critical point. We’ve written about the SCORPS process as the driver of continuous improvement. Your operations will generate a lot of data and valuable information you can use as an engineer, manager, or architect to evolve your current setup. You should be constantly looking to learn.
There is always room for improvement.
The operational excellence pillar sets us up nicely because once you think through ‘Evolve,’ you’re evolving into the other pillars of Cost, Security, Reliability, Performance, and Sustainability. You can permanently save more money by making things faster, more reliable, cheaper, and more secure. People think operations are rolling, and it’s okay. But there are always things you can improve.
You set up for success and put the foundational building blocks in place to increase your chances of a successful development cycle.
So that’s the operational excellence pillar from well-architected. That’s the craic. We’ll be talking some more about the pillars. There are posts on this on TheServerlessEdge.com, Twitter @ServerlessEdge, LinkedIn and Medium.