Service level objectives
Service level objectives (SLOs) are targets for performance that provide an indication of how often your service can fail before the experience of your users is significantly degraded. SLOs are based on the realization that you cannot operate your product at 100% success all the time, and at some point, your users will encounter issues. The goal is to determine the amount of unreliability they will tolerate and, rather than striving for the impossible goal of perfection, to ensure the service operates at least at that level at all times.
SLOs are informed by service level indicators (SLIs). An SLI is essentially a binary measure of whether your service is performing well or not. For example, you could establish an SLI for your API’s response time of 6 seconds. All requests that are responded to within 6 seconds are considered “good” and any outside this limit are “bad.” You could then set an SLO of 99.98%, meaning that your objective is to respond to 99.98% of API requests within 6 seconds.
For a comprehensive guide to SLOs, we recommend Alex Hidalgo’s book Implementing Service Level Objectives (O’Reilly).
When establishing an SLO for a service or feature, you should set the threshold slightly lower than the level at which a user may begin to experience unacceptable levels of frustration or discontent. In this way, you can normalize a certain level of failure (among users and engineers alike) without reaching a point where confidence or trust is negatively impacted.
Decoupling what from why
In Observability Engineering, the authors state: “Decoupling ‘what’ from ‘why’ is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.” Your alerts should only tell you what is wrong. After receiving an alert, it is then up to you to discover why something went wrong, using your logs, traces, metrics, and dashboards.
If you attempt to use your alerts to tell you why something went wrong, you are likely to end up with a barrage of meaningless, unactionable alerts. This is usually the case when using system-level metrics and thresholds to trigger alarms.
Consider one of the metrics used to indicate the health of a Kinesis Firehose stream with an S3 target: DeliveryToS3.Success (the percentage of the total number of put requests to S3 that succeeded). An alarm based on this metric could indicate that something is wrong with data delivery to S3 from the Firehose stream. This could be due to issues with the target S3 bucket, the stream’s permissions, the S3 or Kinesis services themselves, and so on. However, it could also be an external issue, such as excessive demand on the stream. It also doesn’t tell you if there is an unintended drop in the number of records being put on the stream.
SLOs can be used to configure alarms and determine when you should receive an alert instead of specific, system-level metrics. An SLO-based alert will simply tell you that the critical service or capability being monitored is not working as expected. You can then utilize the core analysis loop (described later in this chapter) to debug from first principles and form a holistic view of the potential root causes.