From 140092d8e820837ecabb53741293d789a37e70f6 Mon Sep 17 00:00:00 2001 From: Trent Mick Date: Wed, 9 Mar 2022 14:18:25 -0800 Subject: [PATCH] Spec that agents in Lambda should *not* do back-off --- .../tracing-instrumentation-aws-lambda.md | 26 ++++++++++++++++--- specs/agents/transport.md | 2 +- 2 files changed, 23 insertions(+), 5 deletions(-) diff --git a/specs/agents/tracing-instrumentation-aws-lambda.md b/specs/agents/tracing-instrumentation-aws-lambda.md index 952ad3cc..1e058cfb 100644 --- a/specs/agents/tracing-instrumentation-aws-lambda.md +++ b/specs/agents/tracing-instrumentation-aws-lambda.md @@ -198,10 +198,28 @@ Field | Value | Description | Source `context.cloud.origin.region` | e.g. `us-east-1` | S3 bucket region. | `record.awsRegion` `context.cloud.origin.provider` | `aws` | Use `aws` as fix value. | - -## Data Flushing -Lambda functions are immediately frozen as soon as the handler method ends. In case APM data is sent in an asyncronous way (as most of the agents do by default) data can get lost if not sent before the lambda function ends. +## Transport -Therefore, the Lambda instrumentation has to ensure that data is flushed in a blocking way before the execution of the handler function ends. +Typically, Lambda functions using an APM agent will include the [APM Lambda +Extension](https://github.com/elastic/apm-aws-lambda/tree/main/apm-lambda-extension) +to which the APM agent sends data locally. There are some changes to the APM +agents' [transport behavior](./transport.md) to APM Server in this environment. -Some Lambda functions will use the custom-built Lambda extension that allows the agent to send its data locally. The extension asynchronously forwards the data it receives from the agent to the APM server so the Lambda function can return its result with minimal delay. In order for the extension to know when it can flush its data, it must receive a signal indicating that the lambda function has completed. There are two possible signals: one is via a subscription to the AWS Lambda Logs API and the other is an agent intake request with the query param `flushed=true`. A signal from the agent is preferrable because there is an inherent delay with the sending of the Logs API signal. + +### Data Flushing + +Lambda function VMs are frozen as soon as the handler method ends and any extensions signal completion. In case APM data is sent in an asynchronous way (as most of the agents do by default) data can get lost if not sent before the lambda function ends. Therefore, the Lambda instrumentation has to ensure that data is flushed in a blocking way before the execution of the handler function ends. + +The extension asynchronously forwards the data it receives from the agent to the APM server so the Lambda function can return its result with minimal delay. In order for the extension to know when it can flush its data, it must receive a signal indicating that the lambda function has completed. There are two possible signals: one is via a subscription to the AWS Lambda Logs API and the other is an agent intake request with the query param `flushed=true`. A signal from the agent is preferrable because there is an inherent delay with the sending of the Logs API signal. Therefore, the agent must send its final intake request at the end of the function invocation with the query param `flushed=true`. In case there is no more data to send at the end of the function invocation, the agent must send an empty intake request with this query param. + +### Transport errors + +APM agents in a Lambda VM, sending to the local extension SHOULD NOT implement +the back-off / grace period after failed intake requests that is described in +[the transport spec](./transport.md#transport-errors). It is the responsibility +of the extension to handle back-off and buffering, if at all. Because the +extension *asynchronously* passes APM data on to APM server, it does not return +APM server responses to the agent; therefore the agent cannot meaningfully +handle backpressure. + diff --git a/specs/agents/transport.md b/specs/agents/transport.md index 3cd3d7cf..a3f8db96 100644 --- a/specs/agents/transport.md +++ b/specs/agents/transport.md @@ -49,7 +49,7 @@ When a request fails, the agent has no way of knowing exactly what data was succ The agent should therefore drop the entire compressed buffer: both the internal zlib buffer, and potentially the already compressed data if such data is also buffered. Data subsequently written to the compression library can be directed to a new HTTP request. -The new HTTP request should not necessarily be started immediately after the previous HTTP request fails, as the reason for the failure might not have been resolved up-stream. Instead an incremental back-off algorithm SHOULD be used to delay new requests. The grace period should be calculated in seconds using the algorithm `min(reconnectCount++, 6) ** 2 ± 10%`, where `reconnectCount` starts at zero. So the delay after the first error is 0 seconds, then circa 1, 4, 9, 16, 25 and finally 36 seconds. We add ±10% jitter to the calculated grace period in case multiple agents entered the grace period simultaneously. This way they will not all try to reconnect at the same time. +The new HTTP request should not necessarily be started immediately after the previous HTTP request fails, as the reason for the failure might not have been resolved up-stream. Instead an incremental back-off algorithm SHOULD be used to delay new requests. The grace period should be calculated in seconds using the algorithm `min(reconnectCount++, 6) ** 2 ± 10%`, where `reconnectCount` starts at zero. So the delay after the first error is 0 seconds, then circa 1, 4, 9, 16, 25 and finally 36 seconds. We add ±10% jitter to the calculated grace period in case multiple agents entered the grace period simultaneously. This way they will not all try to reconnect at the same time. (APM agents in an AWS Lambda function that are sending to the APM Lambda Extension SHOULD NOT implement back-off. See [the Lambda instrumentation section on transport errors](tracing-instrumentation-aws-lambda.md#transport-errors).) Agents should support specifying multiple server URLs. When a transport error occurs, the agent should switch to another server URL at the same time as backing off.