Mark, Mike and Dave share their thoughts on the AWS Reliability Pillar that forms part of the Well Architected Framework. This is the third part of a series of talks. All engineers need to design their systems for reliability but the team show how Serverless can free up thinking time for system design rather than trying to master non functional components.
We’re continuing our conversation about the well architected framework. I think at the end, we should announce what our favourite pillar is! We’re going to talk about the AWS Reliability Pillar today. We have a couple of previous episodes on other pillars like operational excellence and security. So Reliability is a funny/peculiar pillar. It has 4 sections:
- workload architecture,
- change management, and
- failure management.
If you’re building in a traditional way, this is a huge amount of work. You could spend a long time getting this right. DR/disaster recovery is very complicated traditionally. But we’re serverless heads. So a lot of these things are, I wouldn’t say they’re completely taken care of by AWS, but AWS makes it a lot easier to do some of these things. A lot of service quotas and constraints are baked into the foundations. From a change management perspective, you want to get into the continuous delivery kind of mindset, so there’s a lot of monitoring if you use the modern tools. From a failure management perspective, serverless is ephemeral, so it’s built for retries. You can design so that those areas are slightly easier to work with. What do you guys think?
I agree with that. When you look at the foundation section of the AWS Reliability pillar, you’re probably looking at how to plan, not over provision or overspend but also scale up effectively. You put a lot of time into the foundation section, whereas if you are in Serverless you still need to look at foundations, but I think it’s less intensive. So you’re looking at things like DNS, or how you write effectively, and things like that.
You still need to worry about quotas and some of the constraints and resource constraints of your account. If you follow the operational excellence and security pillar and you have granular accounts, you’re hopefully not stepping on too many toes in the same account. So that’s less of an issue. It’s more about being aware of your quotas and the size of your workload and making sure those higher level quotas are big enough for the demand that you’re going to have. It’s more about putting in the request for a bigger quota, which is a lot easier than ordering up another five racks or a couple of IFL’s for your mainframe.
With a lot of serverless services, there are fairly tight quotas. But they’re not there because there’s no capacity. They are there to stop you being crazy and setting off like a million different lambdas or something. It’s no problem to get them extended. They protect you rather than being a lack of capacity.
You’re starting from a more mature reliability standpoint, when you adopt the serverless and a serviceful approach. So from a change management point of view, adapting to change is built into serverless capabilities and how managed services operate. So you’re not worrying so much about it.
From a failure management point of view, a lot of that is baked in especially if you’ve built an event driven asynchronous workload, using SNS, SQS or Eventbridge. A lot of circuit breaker type mentality, retry and dead letter queues are coming out of the box now. Increasingly they are maturing those capabilities to make it easier for teams to have a default, resilient driver and reliable capability.
Influence of Serverless on Reliability
We’ve done talks on how Serverless influences load architecture. And here, serverless influences how we would architect workloads. For example, with serverless architecture we tend to take a micro path: micro service or micro front end. It’s opinionated so there’s only so many ways to connect these things. It’s aimed at speed, cost effectiveness and reliability. If you are using lambda, it’s 6 AZs wide, in terms of the HA side of things. It’s the same with DynamoDB or Aurora. There’s a lot of stuff that AWS has in the AWS Reliability Pillar thats thought about for you and you can benefit from that. And that influences how you actually assemble your workload in terms of the workload architecture as you’ve got to work within those constraints.
What’s interesting about the reliability pillar and well architected is that they have been there for eight years, but they’ve just recently added this new section, which is workload architecture. And one of the questions is how do you design interactions in a distributed system to prevent failures? So there’s a specific section on how to design distributed workloads. That’s more than a nod. It’s proof that a lot of customers are moving towards distributed micro service and modern application stacks. There’s a lot of depth in that. To do that well, you need an upfront design. You need to think about your system before you design it. You can’t code your way out of it.
Domain driven design
I love the reference to domain driven design. We’re big fans of applying a domain driven approach and using techniques like event storming, to really break down those boundaries of domains and understand the flow of events through your system. It lends itself to breaking it into more manageable domains and chunks that you can test in isolation. And you can have different characteristics for each of those domains.
Like we said, at the top of the session, serverless makes some of these things easier, but the time that you don’t spend on retries is put into domain design. It’s an equal amount of effort and it’s still challenging but you’re putting precious time into system design, as opposed to tuning a non functional thing. It’s still hard to build the systems, but I firmly believe you get a better system at the end of the day.
Think about components and the effect of auto scale
Using a serverless lense with the questions on the reliability pillar is important because serverless auto scales. Some of the questions look at: if we auto scale this serverless component, what sort of load or pressures are put on something that doesn’t auto scale? It forces you to think about protection and where are the choke points? Should you be throttling your workload? Should you be setting constraints on the scalability of your serverless workload?
I’ve seen the denial of wallet type questions. Do you really want to scale infinitely as it could cost you a fortune. Those types of questions get to something that we’re very passionate about, which is testing for resilience and continuous resiliency, and having test days or game days that tease out where the choke points are. Where are your failure cases? Where are the downstream systems that can’t respond or can’t take the load as you pass to it?
All of us agree that it is easier to do it serverless and it’s important that the setup is designed properly. But you can get good feedback by testing for this stuff. And you’ve seen the maturity of the fault injection service that has come out. It’s easy to use and I’m hoping to see it evolve and mature to be much more serverless focused as well. But it’s a lot easier to test for resiliency. So you’re not guessing anymore. You have real automation and you’ve baked that into your CI CD pipeline as well.
I remember building distributed system 20 years ago and you had to put a lot of time and effort to do some of the things you can do now by checking the box eg. the advanced testing and resilience practices you can put in.
So that’s the craic. And that’s the reliability pillar.
Transcribed by https://otter.ai