This article is part of an ongoing series looking at Amazon’s Well-Architected Framework, a breakthrough collection of best practices for cloud-native organizations. This time, we’re digging into the Reliability Pillar. We have previous episodes and articles on other pillars, like operational excellence and security.
It is a common misconception that reliable systems never encounter infrastructure or service disruptions, misconfigurations, or network issues. The reality is that reliable systems will experience all of these issues. Your system must be intentionally architected to recover from failure quickly. All engineers must design their systems for reliability. Still, we show you how Serverless can free up thinking time for system design rather than trying to master non-functional components.
So, Reliability is a funny/peculiar pillar. It has four sections:
- workload architecture,
- change management, and
- failure management.
If you’re building traditionally, this is a considerable amount of work. You could spend a long time getting this right. DR/disaster recovery is traditionally very complicated. But we’re serverless heads. So many of these things are only partially taken care of by AWS, but AWS makes it much easier to do some of them. A lot of service quotas and constraints are baked into the foundations. From a change management perspective, you want to get into the continuous delivery mindset, so there’s a lot of monitoring if you use modern tools. From a failure management perspective, serverless is ephemeral and built for retries—designed so those areas are slightly easier to work with.
When you look at the foundation section of the AWS Reliability pillar, you’re looking at how to plan, not over-provision or overspend, but also scale up effectively. You put a lot of time into the foundation section, whereas if you are Serverless, you still need to look at foundations, but it’s less intensive. So you’re looking at things like DNS, how you write effectively, etc.
You must still worry about quotas and some of your accounts and resource constraints. If you follow the operational excellence and security pillar and have granular accounts, you’re hopefully stepping on only a few toes in the same account. So that’s less of an issue. It’s more about being aware of your quotas and the size of your workload and making sure those higher-level quotas are big enough for the demand that you’re going to have. It’s more about requesting a more significant allocation, which is much easier than ordering up another five racks or a couple of IFLs for your mainframe.
With a lot of serverless services, there are relatively tight quotas. But they’re not there because there’s no capacity. They are there to stop you from setting off like a million different lambdas or something. It’s no problem to get them extended. They protect you rather than being a lack of capacity.
You’re starting from a more mature reliability standpoint when you adopt the serverless and serviceful approach. So, from a change management point of view, adapting to change is built into serverless capabilities and how managed services operate. So you’re not worrying so much about it.
From a failure management point of view, much of that is baked in, especially if you’ve built an event-driven asynchronous workload using SNS, SQS, or Eventbridge. A lot of circuit breaker-type mentality, retry, and dead letter queues are coming out of the box now. Increasingly, they are maturing those capabilities to make it easier for teams to have a default, resilient driver and reliable capacity.
Influence of Serverless on Reliability
We’ve discussed how Serverless influences load architecture and how we architect workloads. For example, we take a micro path with the serverless architecture: microservice or micro front end. It’s opinionated, so there’s only so many ways to connect these things. It’s aimed at speed, cost-effectiveness, and reliability. If you use lambda, it’s 6 AZs wide regarding the HA side. It’s the same with DynamoDB or Aurora. There’s a lot of stuff that AWS has in the AWS Reliability Pillar that’s thought about for you, and you can benefit from that. And that influences how you assemble your workload in terms of the workload architecture as you’ve got to work within those constraints.
Interestingly, the reliability pillar has been there for several years, but they’ve just recently added this new section: workload architecture. And one of the questions is how do you design interactions in a distributed system to prevent failures? So, there’s a specific section on how to create distributed workloads. That’s more than a nod. It’s proof that many customers are moving towards distributed microservice and modern application stacks. There’s a lot of depth in that. To do that well, you need an upfront design. You need to think about your system before you design it. You can’t code your way out of it.
We love the reference to domain-driven design. We’re big fans of applying a domain-driven approach and using techniques like event storming to break down those boundaries of domains and understand the flow of events through your system. It lends itself to breaking it into more manageable domains and chunks you can test in isolation. And you can have different characteristics for each of those domains.
As we said, serverless makes some of these things more accessible at the top of the session, and you can spend time on domain design as you have not wasted it on retries. It’s an equal amount of effort and still challenging, but you’re putting precious time into system design instead of tuning a non-functional thing. It’s still hard to build the systems, but I firmly believe you get a better system at the end of the day.
Think about components and the effect of auto-scale
Using a serverless lens with the questions on the reliability pillar is vital because of serverless auto scales. Some of the questions are: if we auto-scale this serverless component, what load or pressures do you place on something that doesn’t auto-scale? It forces you to think about protection and where are the choke points. Should you be throttling your workload? Should you be setting constraints on the scalability of your serverless workload?
We’ve seen the denial of wallet-type questions. Do you want to scale indefinitely, as it could cost you a fortune? Those questions get to something we’re passionate about testing for resilience and continuous resiliency and having test days or game days that tease out where the choke points are. Where are your failure cases? Where are the downstream systems that can’t respond or can’t take the load as you pass to it?
We all agree that it is easier to do it serverless, and you must design the setup correctly. But you can get good feedback by testing for this stuff. And you’ve seen the maturity of the fault injection service that has come out. It’s easy to use, and we want to see it evolve and mature to be much more serverless-focused. But it’s a lot easier to test for resiliency. So you’re not guessing anymore. You have real automation, and you’ve also baked that into your CI CD pipeline.
We remember building a distributed system 20 years ago, and you had to put a lot of time and effort into doing some of the things you can do now by checking the box, e.g., the advanced testing and resilience practices you can put in.
So that’s the craic. And that’s the reliability pillar.