Effective error handling strategies for Lambda invocation models
1. Errors in Lambda
Lambda, being a compute service, is a fundamental building block of any AWS serverless architecture. Although humans can invoke Lambda functions manually, other AWS services will generally trigger them as part of an architecture.
Implementing a solid retry strategy is essential for operating robust serverless applications. When AWS service triggers invoke our Lambda functions, we have less control over building a retry strategy than writing and running code on non-serverless compute resources.
But how can we control retry behaviours in these scenarios? Do these services have a built-in retry mechanism? If so, what configuration options are available? If not, what can we do to automate the retry process?
This post is a high-level overview of the error-handling options for the various Lambda invocation models.
2. Error handling scenarios for different invocation models
AWS services trigger Lambda differently, leading to various scenarios and configuration options.
Triggers can invoke Lambda in three different ways, known as invocation models. We would make a mistake if we used the same error-handling strategy for all invocation scenarios, as these models differ significantly in their nature.
Let’s explore the configuration options and strategies available for each.
2.1. Synchronous invocation
In synchronous or request-response invocation, the Lambda service forwards the request directly to the function. The function runs the code and returns the response to the Lambda service, which then passes it on to the calling client. Services like API Gateway, Application Load Balancer, and Cognito invoke Lambda this way.
For example, when API Gateway invokes a Lambda function, it will not retry the invocation if an error occurs. Instead, it returns the error to the client that called the API endpoint. Therefore, the client, whether a browser or another service’s compute resource, must implement the retry strategy.
In a browser, it’s a good idea to show an error message to the user, indicating whether it’s worth trying again. The error handling process can be automated with the retry mechanisms front-end frameworks and libraries usually have built-in.
For service-to-service architectures, we can implement various retry solutions, such as sending an error message to an SQS queue for Lambda to retrieve and process.
2.2. Asynchronous invocation
In event-based architectures, services like SNS, EventBridge, or S3 invoke Lambda asynchronously. Error handling configurations are more complex in this case.
With asynchronous invocations, Lambda places incoming messages in an internal queue and returns a 202 Accepted
status code to the event source. Another process in the Lambda service polls this internal queue and invokes the function with the messages.
Errors can occur in two scenarios.
First, the message doesn’t reach Lambda’s internal queue due to an error (e.g., network conditions). This scenario is shown on the left side of the dashed line. The Lambda service returns an error code instead of 202
. In this case, the retry policy of the event source (e.g., SNS) determines how message delivery is retried.
In the second scenario, SNS successfully delivers the message to Lambda, but the function encounters an invocation (the function doesn’t run) or function error (the function runs but throws an exception).
Lambda manages how frequently messages in the internal queue are processed. We can configure the Maximum age of event setting (up to 6 hours) to control how long Lambda should try to process the messages in case of invocation errors. Function errors can be controlled by setting the number of Retry attempts (between 0 and 2 times).
AWS also recommends setting up an on-failure event destination for failed messages. As of writing this, valid destinations can be an SNS topic, an SQS queue, a Lambda function, an EventBridge event bus, and, as a new feature, an S3 bucket.
2.3. Poll-based invocation
Queses and streams, such as SQS, Kinesis Data Streams and DynamoDB Streams are are polled by the Lambda service to retrieve messages.
In this invocation model, Lambda fetches messages in batches. By default, Lambda retries the entire batch until it succeeds or the records expire. The error-handling configuration options for queues and streams are different.
Queues
As a first option, we can configure a re-drive policy in SQS to send unprocessed messages to a dead-letter queue. The setting is called Maximum receives, and we can configure how many times we want the unprocessed records to appear in the queue before they are sent to the dead-letter queue (between 1 and 1000).
We can also enable partial batch success in Lambda to process only unprocessed records by turning on Report batch item failures. Otherwise, the Lambda function will consume all messages again, including those that were successfully processed before. It adds unnecessary complexity to our Lambda code and is also error-prone.
Streams
Lambda offers the Retry attempts option to configure how many times to retry processing a failed batch for streams.
Since failed records block the stream’s shard, AWS recommends enabling the BisectBatchOnFunctionError
and ReportBatchItemFailures
features to handle unprocessed records more efficiently.
The former is called Split batch on error in the console. It divides the failed batch into two smaller batches to isolate unprocessed records. The process continues until the unprocessed errors are isolated. The latter is referred to as checkpointing, which helps Lambda track which records have been successfully processed. This way, Lambda can skip them and focus on the failed ones.
With these settings, Lambda sends unprocessed messages to an on-failure destination (an SNS topic or an SQS queue) after maximum retry attempts.
3. Summary
Proper error handling is crucial for robust and resilient serverless applications. The error handling options and configurations vary depending on the Lambda invocation model—synchronous, asynchronous, or poll-based.
4. References, further reading
HTTP response status codes - A list of status codes and what they mean
Amazon SNS message delivery retries - SNS retry policy
How EventBridge retries delivering events - EventBridge retry behaviour
Handling errors for an SQS event source in Lambda - Read about partial batch responses in SQS
Using dead-letter queues in Amazon SQS - Information about redrive policy configuration
Configuring partial batch response with Kinesis Data Streams and Lambda - More on bisecting and checkpointing for streams