Identifying the Units of Scale
Most of the serverless managed services you will use in your architecture will provide automatic scaling and pay-per-use billing. This means you will be able to scale your application to respond to any level of demand while only paying for the resources you use, when you use them. For example, if one of your Lambda functions is invoked once in an hour, you will only pay for that single invocation. Likewise, if the function is not invoked at all during the month it will not appear on your bill at all. Conversely, if that function suddenly becomes very popular and is invoked thousands of times in a day, you will not have to change the configuration or code and will be billed accordingly.
Non-serverless scaling involves monitoring the CPU and bandwidth utilization of resources or the remaining disk space available and adding additional servers or clusters to cope with demand. While these concerns are alleviated by serverless, there are still units of scale in an autoscaling application; they are simply different units. With serverless, the units of scale will depend on the type of resource (e.g., compute, storage, application integration) and the quota stipulated by the managed service. For example, Lambda scales in terms of function execution concurrency (see Chapter 6 for more information), Kinesis streams have input and delivery buffer limits, DynamoDB restricts read and write capacity, and API Gateway tracks response latency.
Each AWS managed service enforces an implicit contract of usage: the service pro‐ vides features and performance guarantees in line with acceptable usage within doc‐ umented quotas. These constraints force you to be creative with your architectural decisions and designs and can, in many cases, promote sensible and optimal use of managed services.
Scaling Communication
Non-serverless scaling efforts typically involve monitoring and optimizing public API requests and responses. These APIs need to be available to continually accept requests at peak scale and respond with minimum latency.
A serverless application will leverage managed services such as API Gateway and Lambda to form a similar architecture. Lambda functions will still need to be opti‐ mized and API resources will still need to be monitored for availability and latency, but the focus shifts.
As seen in Chapter 6, serverless applications are increasingly becoming a mix of business logic and application integration infrastructure. In an architecture where managed services are used to transport messages, transform data, and emit events, the scalability of the system depends on its ability to handle this throughput efficiently. Serverless shifts scale away from APIs and compute to events and messages—that is, to communication.
You should become very familiar with the quota pages in the documentation for the managed services you are using—for example, the Step Functions quota page—and pay particular attention to whether a limit is soft or hard (i.e., whether it can be increased or not). Understanding service limits should be a part of your solution design process and cost estimation (see Chapter 9). Architecture diagrams can be annotated with expected usage and limits to be mindful of. Analysis of bills or production incidents must include inspection of related service limits and pricing, and any consequent optimizations can be driven by quotas.