Observing the Health of Critical Paths
Monitoring the performance of a distributed, serverless application at a given point in time can be challenging. The sheer number of parts operating independently across services, stacks, regions, and accounts can be overwhelming. Rather than trying to monitor everything, focus on the most critical parts of your application; the parts that must be working at all times. Dig into anomalies on noncritical paths over time (refer to Chapter 7 for guidance on identifying your critical and noncritical paths).
Be mindful that a complex software system is never completely healthy. Distributed systems are unpredictable and prone to partial failures. That is the trade-off you make for the benefits. Sometimes this can even be viewed as a positive aspect. After all, you will always prefer part of your system to fail than all of it!
The documentation for each AWS service will usually have a monitoring section that will include hints about the key health metrics to track. The following is an overview of some core serverless services and their key metrics of scale and performance:
Lambda
The total count of account-level concurrent function executions is the key to determining whether your Lambda usage is scaling to meet your application’s traffic (see Chapter 6 for more details on account-level concurrency). Other function-level invocation metrics, like the counts of errors or throttles, are useful to understand your application’s behavior and the possibility of bugs. You should also monitor memory usage and duration to understand the performance of your functions (see Chapter 6 for information about how the Lambda Insights service can help you to analyze this data). For more information, see the Lambda documentation.
API Gateway
The core metrics for your APIs are the total count of requests received by your API and the total number of 400- and 500-series errors returned. Always use a percentage of 400 and 500 errors when configuring alarm thresholds to ensure you account for spikes in API requests. You should also monitor whether latency levels are remaining within the integration response limits. By default metrics are emitted per API stage, but they can be emitted per API method if you enable the detailed metrics setting. For more information, see the API Gateway documentation.
Step Functions
You can use the ExecutionsStarted and ExecutionsSucceeded metrics to monitor the expected behavior of your workflows and the ExecutionAborted,
ExecutionFailed, ExecutionThrottled, and ExecutionTimedOut metrics to detect issues with workflow execution. The ExecutionTime metric can be used to monitor the overall latency of your workflows. For more information, see the Step Functions documentation.
DynamoDB
You should consider setting alarms based on the ConsumedReadCapacityUnits,
ConsumedWriteCapacityUnits, and ThrottledRequests metrics to be alerted to issues at scale. The UserErrors metric is also useful for indicating bugs with DynamoDB SDK client requests, such as invalid parameters. For more informa‐ tion, see the DynamoDB documentation.
SQS
One of the key SQS metrics is ApproximateAgeOfOldestMessage. You should consider configuring an alarm for this metric with a threshold that allows you to take action before the message exceeds the maximum retention period of the queue. For more information, see the SQS documentation.
Kinesis Firehose
You can monitor the expected volume of data being processed by a stream using the IncomingBytes and IncomingRecords metrics. To ensure data is mov‐ ing through your stream efficiently, you should monitor the DataFreshness and Success metrics. Stream processing errors can be monitored using the ThrottledRecords metric. For more information, see the Kinesis Firehose documentation.
EventBridge
EventBridge emits various metrics that can be used to determine the perfor‐ mance of your rules and the delivery of events to targets. For example, you can keep track of the TriggeredRules metric to understand whether your rules are being triggered to expected levels based on the volume of upstream requests in your application. And the DeadLetterInvocations and FailedInvocations metrics can be used to understand whether your targets are failing to receive events.
Synthetic Monitoring for Low-Traffic Critical Paths
Synthetic monitoring involves sending artificial traffic to your application in a pro‐ duction environment. It is a form of what is known as testing in production, where the functionality of the system under test is verified post-deployment, under production conditions and against production services and databases. Synthetic monitoring can be used to implement a simple “heartbeat” check or to simulate full user journeys. Scripts are usually executed periodically. For example, a synthetic monitor might make an HTTP request to an API endpoint every 15 minutes to check if a 200 response is received.
The predominant use case for synthetic monitoring is low-traffic critical paths. For example, you may have an API resource that is only used between certain times or irregularly throughout the day that you want to ensure is operating correctly when it is eventually required. Or perhaps you have a daily or nightly batch process, such as an Amazon Macie pipeline or data lake export job, that you cannot afford to have fail when the time comes.
Typically, you would not use a synthetic monitor to track the health of your high-traffic critical paths as it is unlikely the synthetic traffic would surface any problems that real user traffic would not.
If you have a good use case for synthetic monitoring you should consider implement‐ ing and executing your test scripts with Amazon CloudWatch Synthetics, a fully managed synthetic monitoring solution.