APM Server TBS FAQ (#2330)

carsonip · web-flow · commit 8e50e5ddc7d1 · 2025-08-04T13:07:07.000Z
Add a FAQ section for APM Server Tail based sampling
diff --git a/solutions/observability/apm/tail-based-sampling.md b/solutions/observability/apm/tail-based-sampling.md
@@ -189,3 +189,39 @@ This metric can also be used to get an estimate of the storage requirements for
 ### `apm-server.sampling.tail.storage.value_log_size` [sampling-tail-monitoring-storage-value-log-size-ref]
 
 This metric tracks the storage size for value log files used by the previous implementation of a tail-based sampler. This metric was deprecated in 9.0.0 and should always report `0`.
+
+## Frequently Asked Questions (FAQ) [sampling-tail-faq-ref]
+
+:::{dropdown} Why doesn't the sampling rate shown in Storage Explorer match the configured tail sampling rate?
+
+In APM Server, the tail sampling policy applied to a distributed trace is determined by evaluating the configured policies in order against the root transaction (the transaction without a parent). To learn more about how tail sampling policies are applied, see the examples in [Configure Tail-based sampling](/solutions/observability/apm/transaction-sampling.md#apm-configure-tail-based-sampling).
+
+In contrast, the APM UI Storage Explorer calculates the effective average sampling rate for each service using a different method. It considers both head-based and tail-based sampling, but does not account for root transactions. As a result, the sampling rate displayed in Storage Explorer may differ from the configured tail sampling rate, which can give the false impression that tail-based sampling is not functioning correctly.
+
+For more information, check the related [Kibana issue](https://github.com/elastic/kibana/issues/226600).
+:::
+
+:::{dropdown} Why do transactions disappear after enabling tail-based sampling?
+
+If a transaction is consistently not sampled after enabling tail-based sampling, verify that your instrumentation is not missing root transactions (transactions without a parent). APM Server makes sampling decisions when a distributed trace ends, which occurs when the root transaction ends. If the root transaction is not received by APM Server, it cannot make a sampling decision and will silently drop all associated trace events.
+
+This issue often arises when it is assumed that a particular service (e.g., service A) always produces the root transaction, but in reality, another service (e.g., service B) may precede it. If service B is not instrumented or sends data to a different APM Server cluster, the root transaction will be missing. To resolve this, ensure that all relevant services are instrumented and send data to the same APM Server cluster, or adjust the trace continuation strategy accordingly.
+
+To identify traces missing a root transaction, run the following {{esql}} query during a period when tail-based sampling is disabled. Use a short time range to limit the number of results:
+
+```
+FROM "traces-apm-*"
+| STATS total_docs = COUNT(*), total_child_docs = COUNT(parent.id) BY trace.id, transaction.id
+| WHERE total_docs == total_child_docs
+| KEEP trace.id, transaction.id
+```
+:::
+
+:::{dropdown} Why is the configured tail sampling rate ignored and why are traces always sampled, causing unexpected load to Elasticsearch?
+
+When the storage limit for tail-based sampling is reached, APM Server will log "configured limit reached" (or "configured storage limit reached" in version 8) as it cannot store new trace events for sampling. By default, traces bypass sampling and are always indexed (sampling rate becomes 100%). This can cause a sudden increase in indexing load, potentially overloading Elasticsearch, as it must process all incoming traces instead of only the sampled subset.
+
+To mitigate this risk, enable the [`discard_on_write_failure`](#sampling-tail-discard-on-write-failure-ref) setting. When set to `true`, APM Server discards traces that cannot be written due to storage or indexing failures, rather than indexing them all. This helps protect Elasticsearch from excessive load. Note that enabling this option can result in data loss and broken traces, so it should be used with caution and only when system stability is a priority.
+
+For more information, refer to the [Discard On Write Failure](#sampling-tail-discard-on-write-failure-ref) section.
+:::