Using Distributed Tracing to Understand the Whole System – Operating Serverless

Using Distributed Tracing to Understand the Whole System

The most common challenge with understanding the health of a serverless system is the inherent distribution of compute across microservices and managed services. This is also true when that health is diminished and the system is experiencing an issue that requires debugging and remediation. The traditional means of determining the root cause of an issue is to analyze application logs. In non-serverless applica‐ tions, logs are typically emitted from a single process and tell a linear, chronological story of a transaction in one stream, without needing to be augmented by any other information to provide an understanding of the complete picture.

In contrast, logs from disparate services can be difficult to correlate and order chronologically. Logs are also only as useful as the data they contain, and missing or incomplete logs may mean you only see part of the problem. In a serverless application, there are logs that you control—primarily from Lambda functions—and logs from managed services. Some managed services allow you to customize the logs they emit (such as API Gateway access logs) but most do not.

Prefer traces to logs

A robust tracing setup will always tell the full story across entire distributed systems, through owned and managed services. Logs can tell you what went wrong, but you first have to know where to look. Traces can tell you where something went wrong, and then you can dig deeper.

To achieve an effective level of serverless observability, you must move from a reactive approach to monitoring that relies on logs and dashboards to a proactive approach that fully leverages traces.

You should be aware that your application’s observability can be impacted by your architectural decisions. For example, you cannot initialize traces when using API Gateway HTTP APIs.

Leave a Reply Cancel reply