AWS Resilience Hub helps you to define, validate, and track the resilience of your applications on AWS. We were delighted to attend AWS Resiliency Day to learn about the tools and strategies available to build resilience into your workloads.
Serverless is becoming a regular thing. Many more people are discussing EDA, event-driven architecture, and well-architected architecture. There’s a maturity thing happening with a lot of the stuff in The Value Flywheel Effect book; there’s a lot of interest around modernization. We’re in the cloud. Now, how do we take it to the next level? For us, that is Serverless, EDA, and well-architected. A lot of people talk about resilience. As Werner Vogel says:
Things fail all the time.
Werner Vogels, CTO, Amazon Web Services.
So, how do you make sure your workloads are resilient?
With re: Invent early season, there are announcements on Serverless. EDA and well-architected services and capabilities. Building a well-architected, resilient, and reliable workload on AWS has always been challenging. The tools and capabilities are coming online to make it easy for teams. In the past, you needed to acquire knowledge to understand and deliver a resilient and reliable workload. The guidance is getting better to provide you with guard rails.
AWS Resiliency Day
Adrian Cockcroft has talked about continuous resilience for years. Adrian is always way ahead of the curve. When he started talking about continuous resilience, it was hard to do. It is still hard, but there are good tools available. We were at an AWS Resiliency Day in Belfast. There were discussions on chaos testing, resiliency hub, and correction of errors.
It was a great event. The speakers were fantastic, and the content was good. We covered well-architected frameworks and resilience modeling capabilities. In other words, the Resilience Hub tool looks promising. And the Chaos Engineering practices we’ve been advocating for, with AWS Fault Injection Simulator (FIS), are also good. We looked at theoretical stuff around disaster recovery and business continuity planning. And one thing I liked was the correction of errors, incident analysis, and post-incident responses. How do you incorporate these things into your working methods to ensure they don’t happen again? It was a good day, and you should go if you can.
Resiliency in the Drawing Office for the Titanic
AWS held their Resiliency Day at the Titanic Hotel, in the drawing room that the designers of the Titanic used. It is a big room with loads of windows, including an ample skylight, dating from over 100 years ago. It’s where architects sat and designed the ships built by Harland and Wolff. We were looking at Titanic artifacts and talking about Resilience. A lot of reminders of why this stuff’s important!
Speakers later in the day referenced it. The story of Titanic was a gift that kept on giving. One of the things we discussed is that it is never just one thing. Delivering a reliable and resilient workload system takes more than one thing. One of our daughters is researching Titanic for her school project. She is learning that it wasn’t just the fact that Titanic hit an iceberg.
The ship was on fire when it left the harbor. The guy with the keys to the binocular cupboard didn’t make the trip, so that they couldn’t see the iceberg. They didn’t have enough lifeboats because of time pressure to set sail. Why were they going too fast? It’s because the coal was on fire. They were sailing at 22 knots instead of an average speed. And there were other time pressures from White Star Line to beat the transatlantic speed record. It’s not just one thing. There are lots of different things that culminate into disaster. The small things that seem innocuous or inconsequential in isolation build up and result in tragic disaster.
Take a proactive approach to resilience.
Planning Resiliency for a workload or app must be proactive. Did White Star Line try to simulate any of these things with Titanic or practice their reactions? How would they detect when issues happened? Technology helps with Resilience, but a lot is just technique, process, and practice. How do you plan for unforeseen circumstances? Adrian Cockcroft discusses his Netflix experience with chaos engineering, chaos testing, chaos monkeys, and chaos gorillas. And it’s an exciting way of thinking about stuff. What happens if we lose this part of the system? What would we expect to see? How would we recover? They didn’t run scenarios with the Titanic, and maybe they had too much confidence in the unsinkable ship.
With software, we’re in an enviable position to run experiments and hypotheses for relatively little money. With ships, it would be hard to do that in the physical world. What happens if we hit an iceberg and the bunkers are on fire? You’re not going to simulate that at scale. We can do that in the cloud by running experiments, injecting these faults, and simulating what happens. We have engineering and well-architected practices for testing and gaining confidence in your system. What scenarios are you adding into example mapping sessions to cover what-ifs?
Correction of errors could have applied to Titanic.
It is the correction of errors. Titanic is no different from a modern project because there is a human element. There are immediate pressures on the company, competition with other companies, people being arrogant and vicious, not knowing what they don’t know, being in a hurry, or not having enough money. In other words, all the human elements come into play when doing something big.
When you do post-event analysis, it’s about something other than someone forgetting to do something. It’s about what part of the system prevents that from happening. Let’s not put a lock on the cupboard with the binoculars, because that will happen again. Correction of error is not about human error. It’s what system design allows that error to occur. Humans are humans, and stuff is going to happen. You must make it hard to do the wrong thing. The pressures are the same because White Star was a company with everyday stresses.
Even when you know the complexities of your system, there’s only so much you can do up front with those practices. Things still go wrong. But when things do go wrong, information is valuable. By being proactive, setting RTOs and RPOs, and running chaos tests, information is invaluable when something happens. It highlights weaknesses and allows you to tighten things up. From an engineering perspective, it’s exciting because scenarios are creative.
Failure is a learning experience.
There are critical elements for a psychologically safe environment. Failure is a learning experience, and it’s something that you can use to improve. You must schedule mechanisms to get people into the right headspace. Don’t expect exploratory testing or chaos engineering to happen. You must put space in your plan for exploratory testing, chaos engineering, or a game day. Put mechanisms in place have conversations and get everybody involved.
It might sound like it’s optional to do this stuff. But when your system goes down, you’re losing money every hour. And you realize that it wasn’t optional.
We see a proliferation of serverless adoption, serverless usage, and EDA, which go hand in hand with distributed microarchitectures. Resiliency and planning for disaster recovery are critical. It’s still emerging. And it’s a LeapFrog thing. You may still need to start this. But it’s a good time to start now. If you started this ten years ago, you would have had to figure it all out like serverless or cloud. If you’re in there initially, you must knock corners off and figure out how things work. It’s not a disaster to start late because you can use the latest and greatest. It’s like standing on the shoulders of giants.
So if you jump into Resiliency Hub and Well-architected, there’s a whole load of stuff you can use out of the box. As we have discussed, it’s never one thing. It reminds us of the book ‘The Perfect Storm’ and the boat that left Gloucester and headed into the big storm of 1991. The storm was not the only things that caused the ship to sink; it was a whole bunch of stuff. It was a perfect storm of events that ended in tragedy. It’s what you have to be resilient against.
Disaster Recovery Strategies
During the AWS Resiliency Day, we reviewed disaster recovery strategies from ‘Backup and Restore’ to ‘Pilot Light’ and ‘Warm Standby’ to ‘Multi-Site Active-Active.’ When you embrace a Serverless First, Well-architected mindset, your workloads are intrinsically mature on the disaster recovery spectrum compared to a traditional workload. You can bake in Resiliency and Reliability when you embrace serverless managed services like Multi-AZ and Multi Regions. You still need to plan for this stuff, but you will be further down the road than somebody trying to do it themselves.
When considering engineering excellence, you should be doing this as part of quality. Chaos tests build confidence, especially when you’re doing EDA architectures, integrating several components or services differently. What happens if one of these things goes down? You must simulate those things and game Day it. Let’s go in and take one or two of them out and see how the team reacts to see how resilient solutions are. We do this a lot. Every time you do it, you learn. You add to your arsenal and experience.
When I started working, I wrote a test to detect if someone pulled the network cable out of the machine. The machine had dual network cables. The second test was when we unplugged the device when it was running. It was a Telco system! This stuff is not new, but it’s evolving rapidly.
So that’s the craic. Like and subscribe to Serverless Craic. Look at The Serverless Edge blog and like or follow us @ServerlessEdge on X.