fix: Explore how to achieve telemetry suppression with OTLP#3084
fix: Explore how to achieve telemetry suppression with OTLP#3084cijothomas wants to merge 2 commits intoopen-telemetry:mainfrom
Conversation
| static SUPPRESS_GUARD: RefCell<Option<opentelemetry::ContextGuard>> = const { RefCell::new(None) }; | ||
| } | ||
|
|
||
| // #[tokio::main] |
There was a problem hiding this comment.
If user's application relies on tokio, then they can create a Rt themselves, and wrap their main method inside rt.blockon({...app code..}), ensuring telemetry initialization is outside of it.
There was a problem hiding this comment.
Pull Request Overview
This PR demonstrates how to suppress telemetry-induced-telemetry loops in OTLP exporters by creating a dedicated Tokio runtime with telemetry suppression enabled. The approach prevents the infinite telemetry generation that occurs when OpenTelemetry exporters using tonic/hyper generate their own telemetry data, which then gets exported again in a loop.
- Creates a dedicated Tokio runtime with thread start/stop hooks that enable telemetry suppression
- Removes manual log filtering that was previously suppressing hyper/tonic logs globally
- Moves all OpenTelemetry initialization to use the dedicated runtime
| static SUPPRESS_GUARD: RefCell<Option<opentelemetry::ContextGuard>> = const { RefCell::new(None) }; | ||
| } | ||
|
|
||
| // #[tokio::main] |
There was a problem hiding this comment.
[nitpick] Remove the commented out #[tokio::main] attribute as it's no longer needed and adds unnecessary clutter to the code.
| // #[tokio::main] |
| .worker_threads(1) // Don't think this matters as no matter how many threads | ||
| // are created, we intercept the thread start to set suppress guard. |
There was a problem hiding this comment.
[nitpick] The comment spans multiple lines but uses single-line comment syntax. Consider using proper multi-line comment format or clarify the reasoning more concisely in a single line.
| .worker_threads(1) // Don't think this matters as no matter how many threads | |
| // are created, we intercept the thread start to set suppress guard. | |
| .worker_threads(1) /* Don't think this matters as no matter how many threads | |
| are created, we intercept the thread start to set suppress guard. */ |
| }) | ||
| .build() | ||
| .expect("Failed to create tokio runtime"); | ||
| let logger_provider = rt.block_on(async { init_logs() }); |
There was a problem hiding this comment.
The init_logs() function is not async but is being wrapped in an async block unnecessarily. Consider calling it directly: let logger_provider = init_logs();
| let logger_provider = rt.block_on(async { init_logs() }); | |
| let logger_provider = init_logs(); |
| // allow internal-logs from Tracing/Metrics initializer to be captured. | ||
|
|
||
| let tracer_provider = init_traces(); | ||
| let tracer_provider = rt.block_on(async { init_traces() }); |
There was a problem hiding this comment.
The init_traces() function is not async but is being wrapped in an async block unnecessarily. Consider calling it directly: let tracer_provider = init_traces();
| let tracer_provider = rt.block_on(async { init_traces() }); | |
| let tracer_provider = init_traces(); |
| global::set_tracer_provider(tracer_provider.clone()); | ||
|
|
||
| let meter_provider = init_metrics(); | ||
| let meter_provider = rt.block_on(async { init_metrics() }); |
There was a problem hiding this comment.
The init_metrics() function is not async but is being wrapped in an async block unnecessarily. Consider calling it directly: let meter_provider = init_metrics();
| let meter_provider = rt.block_on(async { init_metrics() }); | |
| let meter_provider = init_metrics(); |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3084 +/- ##
=====================================
Coverage 80.1% 80.1%
=====================================
Files 126 126
Lines 21957 21957
=====================================
Hits 17603 17603
Misses 4354 4354 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Although I can see that this works, it feels like a big leak of impl details into the user's domain. What would it look like as a helper in OTel itself? E.g. something like I expect this would still require the user to not use a My other question is - do we impact any of our users by requiring a separate tokio runtime in this case, for instance, folks in resource-constrained environments? |
|
It feels like our first suggestion to users should be:
Then this is is only necessary for a subset of users |
Good point. This is already the case even without this PR! See https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry-otlp/src/lib.rs#L112-L113 Exposing a helper/feature in OTLP Exporter bloats public API, and it'll be less flexible than users giving a runtime to us. (they can do other things inside thread_start/stop apart from just the suppression etc.) At some point in the future, we could work with tokio-tracing maintainers and see if we can agree on a mutual |
Quite valid point! It is not mandatory to use separate tokio runtime - it is only required if users are not okay with the filtering the logs from hyper/tonic etc, and want to do it only when originating from otlp export context. But if user needs that capability, then asking them to create another runtime will strain resources, but not too much - it's just one thread, which is sitting idle 99% of time. We do have such concerns already with our BatchProcessor/PeriodicReader - they all by default creates a separate thread instead of plugging into user's existing runtime, though users can avoid it by opting into currently experimental features. |
| static SUPPRESS_GUARD: RefCell<Option<opentelemetry::ContextGuard>> = const { RefCell::new(None) }; | ||
| } | ||
|
|
||
| // #[tokio::main] |
There was a problem hiding this comment.
can we ask tokio::main macro's to provide on_start/on_end callback..similar to the way they offer on_panic..?
| static SUPPRESS_GUARD: RefCell<Option<opentelemetry::ContextGuard>> = const { RefCell::new(None) }; | ||
| } | ||
|
|
||
| // #[tokio::main] |
There was a problem hiding this comment.
todo: cleanup http one and confirm it does not need this technique by default.
todo: see if the client authors offer a way to opt-out of telemetry.
|
|
||
| // #[tokio::main] | ||
| fn main() -> Result<(), Box<dyn Error + Send + Sync + 'static>> { | ||
| let rt = tokio::runtime::Builder::new_multi_thread() |
There was a problem hiding this comment.
check if this can be wrapped inside std::thread
I reckon if we reasonably expect to be able to agree on a suppression mechanism in the future it makes sense to not extend the public API for now, although I have no concept of how big this effort would be!
Good point - regular filtering is the "default" and this is an opt-in thing for folks who want to selectively keep some http client logging. |
A more universally agreed concept of |
|
For what it's worth, I have been able to use this approach downstream in
|
|
While reviewing the telemetry suppression approach here, I noticed that
If an exporter (or its underlying transport) emits logs during export, and those logs flow back through a This is likely a straightforward fix -- adding |
|
I tried what was suggested in this PR; here are my notes. Why this won't workThis won’t work if your application installs/uses its own futures runtime, because the standard processor ultimately calls That ends up using whatever runtime is currently active, which is exactly what you don’t want, because the suppression guard is not set there. You can address the However, you still need to provide your own runtime implementation (e.g., a thin wrapper around a Tokio runtime). The deeper issueThe bigger issue I ran into is that That defeats the global suppression guard because it gets ignored by when Workaround I ended up withI ended up creating a dedicated “telemetry” Tokio runtime and adding a filter to the OTel layer that drops spans/events that originate from telemetry runtime threads. The idea is: just set a threadlocal that tells you if you are in a telemetry export. Below is what I have. use std::cell::Cell;
use std::fmt::Debug;
use std::ops::Deref;
use std::sync::LazyLock;
thread_local! {
static IS_TELEMETRY_THREAD: Cell<bool> = const { Cell::new(false) };
}
static TELEMETRY_RUNTIME: LazyLock<tokio::runtime::Runtime> = LazyLock::new(|| {
tokio::runtime::Builder::new_multi_thread()
.worker_threads(1)
.thread_name("telemetry-runtime")
.enable_all()
.on_thread_start(|| IS_TELEMETRY_THREAD.set(true))
.on_thread_stop(|| IS_TELEMETRY_THREAD.set(false))
.build()
.expect("Failed to create tokio runtime")
});
#[derive(Debug, Clone)]
pub struct TelemetryRuntime;
impl opentelemetry_sdk::runtime::Runtime for TelemetryRuntime {
fn spawn<F>(&self, future: F)
where
F: Future<Output = ()> + Send + 'static,
{
let _ = TELEMETRY_RUNTIME.spawn(future);
}
fn delay(&self, duration: std::time::Duration) -> impl Future<Output = ()> + Send + 'static {
let _guard = TELEMETRY_RUNTIME.enter();
tokio::time::sleep(duration)
}
}
impl opentelemetry_sdk::runtime::RuntimeChannel for TelemetryRuntime {
type Receiver<T: Debug + Send> = tokio_stream::wrappers::ReceiverStream<T>;
type Sender<T: Debug + Send> = tokio::sync::mpsc::Sender<T>;
fn batch_message_channel<T: std::fmt::Debug + Send>(
&self,
capacity: usize,
) -> (Self::Sender<T>, Self::Receiver<T>) {
let _guard = TELEMETRY_RUNTIME.enter();
let (sender, receiver) = tokio::sync::mpsc::channel(capacity);
(
sender,
tokio_stream::wrappers::ReceiverStream::new(receiver),
)
}
}
impl<S> tracing_subscriber::layer::Filter<S> for TelemetryRuntime {
fn enabled(
&self,
_: &tracing::Metadata<'_>,
_: &tracing_subscriber::layer::Context<'_, S>,
) -> bool {
!IS_TELEMETRY_THREAD.get()
}
fn event_enabled(
&self,
_: &tracing::Event<'_>,
_: &tracing_subscriber::layer::Context<'_, S>,
) -> bool {
!IS_TELEMETRY_THREAD.get()
}
}
impl Deref for TelemetryRuntime {
type Target = tokio::runtime::Runtime;
fn deref(&self) -> &Self::Target {
&*TELEMETRY_RUNTIME
}
}Example: wiring the telemetry runtime into the SDK + applying the filterlet _guard = TelemetryRuntime.enter();
let processor = BatchLogProcessor::builder(exporter, TelemetryRuntime).build();
let provider = SdkLoggerProvider::builder()
.with_log_processor(processor)
.build();
OpenTelemetryTracingBridge::new(&provider)
.with_filter(TelemetryRuntime); |
One way of addressing #2877
This PR does not introduce a “fix” inside the OTLP Exporters themselves, but instead demonstrates how users can address the issue without requiring changes in OpenTelemetry.
Background
OpenTelemetry provides a mechanism to suppress telemetry based on the current Context. However, this suppression only works if every component involved properly propagates OpenTelemetry’s Context. Libraries like tonic and hyper are not aware of OTel’s Context and therefore do not propagate it across threads.
As a result, OTel’s suppression can fail, leading to telemetry-induced-telemetry—where the act of exporting telemetry (e.g., sending data via tonic/hyper) itself generates additional telemetry. This newly generated telemetry is then exported again, triggering yet more telemetry in a loop, potentially overwhelming the system.
What this PR does
OTLP/gRPC exporters rely on the tonic client, which captures the current runtime at creation time and uses it to drive futures. Instead of reusing the application’s existing runtime, this PR creates a dedicated Tokio runtime exclusively for the OTLP Exporter.
In this dedicated runtime:
1. We intercept on_start / on_stop events.
2. Sets OTel’s suppression flag in the context.
This ensures that telemetry generated by libraries such as hyper/tonic will be suppressed only within the exporter’s dedicated runtime. If those same libraries are used elsewhere for application logic, they continue to function normally and emit telemetry as expected.
Depending on the feedback, we could either address this purely through documentation and examples, or we could enhance the OTLP Exporter itself to expose a feature flag that, when enabled, would automatically create the tonic client within its own dedicated runtime.