When Things Go Wrong – Operating Serverless

When Things Go Wrong

The sheer volume of variables and emergent complexity in the operation of a produc‐ tion serverless workload mean things will go wrong. This does not mean you should be complacent and resist trying to minimize the number of things that could go wrong, of course. You should optimize your development practices, testing strategy, delivery pipelines, and observability culture to ensure you eliminate bugs before they reach production if possible, but can catch and fix them rapidly if they do.

Alongside testing, delivery, and observability, the fourth quadrant of the serverless square of balance (first presented in Chapter 7) is recovery. Recovering from failures in production involves making your services and workflows fault tolerant. Fortu‐ nately, fault tolerance is usually a key feature of serverless managed services on AWS—and it’s one that you must leverage to get the most from serverless.

Accepting Failure and Budgeting for Errors

You were introduced to the application of service level objectives for targeting alerts previously in this chapter. The other aspect of SLOs is the concept of an error budget. Error budgets specify a threshold for the volume of errors that are permitted to occur in a particular feature, service, or product. For example, a service exposed to users via an API endpoint could have an error budget of 5% for a month. If the percentage of 5xx errors returned exceeds 5% of all responses from that API endpoint, then the error budget will be completely used up.

Error budgets can be used to give your engineering team the permission to release bugs into production. As you’ve seen in Chapters 6 and 7, the balance between stability and delivery speed is crucial to sustaining a resilient and useful application in the long term. Your engineers must have the ability to deliver code safely without being slowed down by excessive test suites. In this way, error budgets are a limiting force and can be used to establish reasonable thresholds for shipping bugs to users.

A surplus in your error budget gives you the confidence to go ahead and deliver new features or improvements to your users rather than trying to squash bugs. Conversely, if you are regularly exceeding your error budget this provides a clear indication that you need to shift your efforts to improving stability and performance.

Leave a Reply

Your email address will not be published. Required fields are marked *