Critical Health Dashboard
Just as you saw with testing in Chapter 7, operations can benefit from a focus on your critical paths. But even your critical paths will have aspects that are more important than others when it comes to assessing operational health and performance at scale.
You can apply the RED method to ascertain the critical health of your system and services:
Rate
The rate of requests being received
Errors
The number of requests that are failing
Duration
The time taken to respond to requests
When launching a new feature, service, or product, it can be useful to create a critical health dashboard with charts showing the rates, errors, and durations for your core components (see Figure 8-3). This dashboard should collect all of the key metrics across your application or microservice to provide a single view of system health. At a glance, you can then immediately answer the question, “Is everything working?” Of course, in a distributed serverless application there will always be plenty of nuance and hidden elements to overall health, but this can at least provide a place to start your assessment of critical health.
Figure 8-3. A critical health dashboard
After releasing a code change into production, you can use your dashboards to monitor the impact of the change and spot any immediate bugs—but you should rely on your alerts to catch emerging bugs and transient faults over time, rather than con‐ stantly watching your dashboards.
Use your critical health dashboard to spot potential anomalous performance. Look for spikes and curves in the charts, alarms that are being triggered, and status reports from third parties.
A critical health dashboard can also include information about the services in use, the key metrics being monitored, and links (or even dynamic data if there is a suitable API provided) to third-party status pages. Building out a comprehensive dashboard in this way will enable a wide pool of people in your organization to use it to make assessments, without needing deep working knowledge or experience of the system being observed.
Although a critical health dashboard can provide instant reassur‐ ance in certain high-pressure scenarios, like product launches and sales events, you should keep in mind that it is more useful (and accurate) to operate and observe your serverless application as a set of distinct applications.
If you follow the guidance provided throughout this book, you will take the utmost care to decouple your serverless microservices. This deliberate isolation should continue into operation, and you should resist trying to monitor decoupled services as one whole system.