Multi-Account, Multi-Region: Is It Worth It?
The effort involved in designing, developing, and operating a cloud native application across multiple AWS accounts or Regions is substantial. You will need an in-depth understanding of all the implementation details. This means the answer to the ques‐ tion of whether it’s worth adopting a multi-account, multi-Region strategy to disaster prevention and recovery has to be: it depends. It will depend on your use case, your team’s ability to build and support this architecture, and, to a lesser extent, the geographical location of your business and users.
In assessing this approach, you should consider the likelihood of a disaster, the time to recover, and the potential impact on your business and users during this time. These aspects must then be traded off against the overhead of operating across cloud accounts and physical Regions.
The AWS Post-Event Summaries page states that when a service outage incident “has broad and significant customer impact that results in the failure of a significant percentage of control plane API calls, impacts a significant percentage of a service’s infrastruc‐ ture, resources or APIs or is the result of total power failure or significant network failure, AWS is committed to providing a pub‐ lic Post-Event Summary (PES) following the closure of the issue.”
You can also view the previous 12 months of service and Region health data via the AWS Health Dashboard.
Perhaps in the future cross-account and cross-Region application development and operation will become abstracted away from engineers. But until then, this strategy will always be a trade-off with the overhead involved.
Summary
In this chapter, you have learned about your role in operating your serverless appli‐ cation at scale in production. While AWS is responsible for the availability and resiliency of the managed services in your architecture, you are responsible for the configuration and usage of these services. It is crucial that you are aware of the service limits in place and how to monitor that your usage stays within these limits.
You have also seen how the observability of your system is key to understanding its behavior, especially considering the distributed nature of serverless, event-driven architectures. Just like your testing strategy, your observability strategy must be concentrated around your critical paths. You should adopt critical health dashboards and capability-based alerts to enable your team to immediately detect issues with your serverless application, and tracing should be preferred to logs to support your team’s debugging efforts when errors occur.
Finally, you saw that fault tolerance is a key attribute of serverless and how you can begin to leverage AWS to introduce automated recovery from failure to your microservices.