Everything Fails All the Time: Fault Tolerance and Recovery – Operating Serverless

Everything Fails All the Time: Fault Tolerance and Recovery

As you design your serverless architecture and build your functions, workflows, and microservices, you should always code for failure. Coding for failure requires you keep in mind the mechanisms and strategies available to you for recovering from failures. This ranges from a try/catch in a Lambda function to retry and replication configuration in your infrastructure code. By coding for failure, you ensure that the execution and operation of your code in a production environment will be tolerant of bugs.

Coding for failure involves writing clear, maintainable, and debug‐ gable code. In serverless, this includes careful separation of con‐ cerns between your business logic and AWS managed services, keeping Lambda functions and Step Functions workflows simple, and leveraging service integrations. For advice on how to write better Lambda functions and infrastructure code, see Chapter 6.

There are two broad categories of failures:

Transient faults

These faults can be automatically retried and usually succeed in time. Examples include third-party service downtime and network connection issues.

Permanent faults

A transient fault becomes a permanent fault after the retries are exhausted. These faults are routed to dead letter queues for manual inspection or to trigger auto‐ mated reprocessing at a later time or after a known issue is resolved. Permanent faults can also include processes and requests that cannot be retried, such as synchronous processes like customer payments once the customer is no longer actively sending a request. Lastly, permanent faults also incorporate data loss or destruction of cloud resources where recovery is possible through a restoration process, using a backup or replica.

Most AWS managed services and SDK libraries offer built-in mechanisms to help your application recover from failures. The following list gives you some examples:

AWS SDK

You can configure the maximum number of retries to attempt and the backoff rate for each AWS SDK client your application uses. See Mark Brooker’s AWS blog post for more details about exponential backoff and jitter.

Step Functions

When using direct integrations with managed services, retries and backoff can be configured in the same way as when using the AWS SDK in your application code. See the Step Functions documentation for details.

EventBridge

Each EventBridge rule should be configured with an SQS dead letter queue to store undeliverable events for automated or manual retry at a later time. You should also consider attaching an event archive to allow for replaying events across a certain period of time. Events can also be sent across Regions, from sources in one AWS Region to destinations in another, to help synchronize data in cross-Region data stores. See “Multi-Account, Multi-Region: Is It Worth It?” on page 373 for more details on when to consider adopting multi-Region architectures.

DynamoDB

Tables can be replicated across AWS Regions using global tables, and point-in-time recovery can be used to restore tables after accidental operations are performed. Enabling the deletion protection setting will prevent inadvertent deletion of a table via the AWS Console or CloudFormation.

No matter how fault tolerant your application is, persistent errors will always occur with any application of significant complexity and business criticality. If you cannot prevent these errors with testing or recovery, you must rely on your ability to observe your system and to effectively identify (and remediate) the root cause. Next, you will be i

Leave a Reply Cancel reply