Skip to content

fix: Explore how to achieve telemetry suppression with OTLP#3084

Open
cijothomas wants to merge 2 commits intoopen-telemetry:mainfrom
cijothomas:cijothomas/tryfix2877
Open

fix: Explore how to achieve telemetry suppression with OTLP#3084
cijothomas wants to merge 2 commits intoopen-telemetry:mainfrom
cijothomas:cijothomas/tryfix2877

Conversation

@cijothomas
Copy link
Copy Markdown
Member

@cijothomas cijothomas commented Jul 25, 2025

One way of addressing #2877

This PR does not introduce a “fix” inside the OTLP Exporters themselves, but instead demonstrates how users can address the issue without requiring changes in OpenTelemetry.

Background

OpenTelemetry provides a mechanism to suppress telemetry based on the current Context. However, this suppression only works if every component involved properly propagates OpenTelemetry’s Context. Libraries like tonic and hyper are not aware of OTel’s Context and therefore do not propagate it across threads.

As a result, OTel’s suppression can fail, leading to telemetry-induced-telemetry—where the act of exporting telemetry (e.g., sending data via tonic/hyper) itself generates additional telemetry. This newly generated telemetry is then exported again, triggering yet more telemetry in a loop, potentially overwhelming the system.

What this PR does

OTLP/gRPC exporters rely on the tonic client, which captures the current runtime at creation time and uses it to drive futures. Instead of reusing the application’s existing runtime, this PR creates a dedicated Tokio runtime exclusively for the OTLP Exporter.

In this dedicated runtime:
1. We intercept on_start / on_stop events.
2. Sets OTel’s suppression flag in the context.

This ensures that telemetry generated by libraries such as hyper/tonic will be suppressed only within the exporter’s dedicated runtime. If those same libraries are used elsewhere for application logic, they continue to function normally and emit telemetry as expected.

Depending on the feedback, we could either address this purely through documentation and examples, or we could enhance the OTLP Exporter itself to expose a feature flag that, when enabled, would automatically create the tonic client within its own dedicated runtime.

@cijothomas cijothomas requested a review from a team as a code owner July 25, 2025 16:37
@cijothomas cijothomas changed the title Fix: Explore how to achieve telemetry suppression with OTLP fix: Explore how to achieve telemetry suppression with OTLP Jul 25, 2025
static SUPPRESS_GUARD: RefCell<Option<opentelemetry::ContextGuard>> = const { RefCell::new(None) };
}

// #[tokio::main]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user's application relies on tokio, then they can create a Rt themselves, and wrap their main method inside rt.blockon({...app code..}), ensuring telemetry initialization is outside of it.

@cijothomas cijothomas requested a review from Copilot July 25, 2025 16:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR demonstrates how to suppress telemetry-induced-telemetry loops in OTLP exporters by creating a dedicated Tokio runtime with telemetry suppression enabled. The approach prevents the infinite telemetry generation that occurs when OpenTelemetry exporters using tonic/hyper generate their own telemetry data, which then gets exported again in a loop.

  • Creates a dedicated Tokio runtime with thread start/stop hooks that enable telemetry suppression
  • Removes manual log filtering that was previously suppressing hyper/tonic logs globally
  • Moves all OpenTelemetry initialization to use the dedicated runtime

static SUPPRESS_GUARD: RefCell<Option<opentelemetry::ContextGuard>> = const { RefCell::new(None) };
}

// #[tokio::main]
Copy link

Copilot AI Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Remove the commented out #[tokio::main] attribute as it's no longer needed and adds unnecessary clutter to the code.

Suggested change
// #[tokio::main]

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +71
.worker_threads(1) // Don't think this matters as no matter how many threads
// are created, we intercept the thread start to set suppress guard.
Copy link

Copilot AI Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment spans multiple lines but uses single-line comment syntax. Consider using proper multi-line comment format or clarify the reasoning more concisely in a single line.

Suggested change
.worker_threads(1) // Don't think this matters as no matter how many threads
// are created, we intercept the thread start to set suppress guard.
.worker_threads(1) /* Don't think this matters as no matter how many threads
are created, we intercept the thread start to set suppress guard. */

Copilot uses AI. Check for mistakes.
})
.build()
.expect("Failed to create tokio runtime");
let logger_provider = rt.block_on(async { init_logs() });
Copy link

Copilot AI Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init_logs() function is not async but is being wrapped in an async block unnecessarily. Consider calling it directly: let logger_provider = init_logs();

Suggested change
let logger_provider = rt.block_on(async { init_logs() });
let logger_provider = init_logs();

Copilot uses AI. Check for mistakes.
// allow internal-logs from Tracing/Metrics initializer to be captured.

let tracer_provider = init_traces();
let tracer_provider = rt.block_on(async { init_traces() });
Copy link

Copilot AI Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init_traces() function is not async but is being wrapped in an async block unnecessarily. Consider calling it directly: let tracer_provider = init_traces();

Suggested change
let tracer_provider = rt.block_on(async { init_traces() });
let tracer_provider = init_traces();

Copilot uses AI. Check for mistakes.
global::set_tracer_provider(tracer_provider.clone());

let meter_provider = init_metrics();
let meter_provider = rt.block_on(async { init_metrics() });
Copy link

Copilot AI Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init_metrics() function is not async but is being wrapped in an async block unnecessarily. Consider calling it directly: let meter_provider = init_metrics();

Suggested change
let meter_provider = rt.block_on(async { init_metrics() });
let meter_provider = init_metrics();

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov bot commented Jul 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.1%. Comparing base (e9ca158) to head (54e05d5).

Additional details and impacted files
@@          Coverage Diff          @@
##            main   #3084   +/-   ##
=====================================
  Coverage   80.1%   80.1%           
=====================================
  Files        126     126           
  Lines      21957   21957           
=====================================
  Hits       17603   17603           
  Misses      4354    4354           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@scottgerring
Copy link
Copy Markdown
Member

Although I can see that this works, it feels like a big leak of impl details into the user's domain. What would it look like as a helper in OTel itself? E.g. something like withTelemetrySuppression(_ => { /* setup otel here */ } ) ?

I expect this would still require the user to not use a tokio_main but rather explicitly create their runtime after setting up OTel using this helper to wrap, but this would still go a ways to make it feel a bit less leaky.

My other question is - do we impact any of our users by requiring a separate tokio runtime in this case, for instance, folks in resource-constrained environments?

@scottgerring
Copy link
Copy Markdown
Member

It feels like our first suggestion to users should be:

If you're not using tonic/hyper as a HTTP client in your app, simply tune your tracing subscriber to suppress their telemetry

Then this is is only necessary for a subset of users

@cijothomas
Copy link
Copy Markdown
Member Author

Although I can see that this works, it feels like a big leak of impl details into the user's domain. What would it look like as a helper in OTel itself? E.g. something like withTelemetrySuppression(_ => { /* setup otel here */ } ) ?

I expect this would still require the user to not use a tokio_main but rather explicitly create their runtime after setting up OTel using this helper to wrap, but this would still go a ways to make it feel a bit less leaky.

My other question is - do we impact any of our users by requiring a separate tokio runtime in this case, for instance, folks in resource-constrained environments?

Good point. This is already the case even without this PR! See https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry-otlp/src/lib.rs#L112-L113
OTLP/gRPC Exporter already requires a tokio runtime - either it captures current one if user has tokio::main, or we explicitly ask users to create an RT and do OTLP instantiation inside it.

Exposing a helper/feature in OTLP Exporter bloats public API, and it'll be less flexible than users giving a runtime to us. (they can do other things inside thread_start/stop apart from just the suppression etc.)

At some point in the future, we could work with tokio-tracing maintainers and see if we can agree on a mutual Context field for suppression, but this requires a lot of research and co-ordination. The approach shown in this PR is just a way for users to unblock themselves right now, without OTel/OTLP doing anything extra.

@cijothomas
Copy link
Copy Markdown
Member Author

My other question is - do we impact any of our users by requiring a separate tokio runtime in this case, for instance, folks in resource-constrained environments?

Quite valid point! It is not mandatory to use separate tokio runtime - it is only required if users are not okay with the filtering the logs from hyper/tonic etc, and want to do it only when originating from otlp export context. But if user needs that capability, then asking them to create another runtime will strain resources, but not too much - it's just one thread, which is sitting idle 99% of time. We do have such concerns already with our BatchProcessor/PeriodicReader - they all by default creates a separate thread instead of plugging into user's existing runtime, though users can avoid it by opting into currently experimental features.

static SUPPRESS_GUARD: RefCell<Option<opentelemetry::ContextGuard>> = const { RefCell::new(None) };
}

// #[tokio::main]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we ask tokio::main macro's to provide on_start/on_end callback..similar to the way they offer on_panic..?

static SUPPRESS_GUARD: RefCell<Option<opentelemetry::ContextGuard>> = const { RefCell::new(None) };
}

// #[tokio::main]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: cleanup http one and confirm it does not need this technique by default.
todo: see if the client authors offer a way to opt-out of telemetry.


// #[tokio::main]
fn main() -> Result<(), Box<dyn Error + Send + Sync + 'static>> {
let rt = tokio::runtime::Builder::new_multi_thread()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if this can be wrapped inside std::thread

@scottgerring
Copy link
Copy Markdown
Member

Exposing a helper/feature in OTLP Exporter bloats public API, and it'll be less flexible than users giving a runtime to us. (they can do other things inside thread_start/stop apart from just the suppression etc.)
...
At some point in the future, we could work with tokio-tracing maintainers and see if we can agree on a mutual Context field for suppression, but this requires a lot of research and co-ordination. The approach shown in this PR is just a way for users to unblock themselves right now, without OTel/OTLP doing anything extra.

I reckon if we reasonably expect to be able to agree on a suppression mechanism in the future it makes sense to not extend the public API for now, although I have no concept of how big this effort would be!

It is not mandatory to use separate tokio runtime - it is only required if users are not okay with the filtering the logs from hyper/tonic etc, and want to do it only when originating from otlp export context

Good point - regular filtering is the "default" and this is an opt-in thing for folks who want to selectively keep some http client logging.

@cijothomas
Copy link
Copy Markdown
Member Author

I reckon if we reasonably expect to be able to agree on a suppression mechanism in the future it makes sense to not extend the public API for now, although I have no concept of how big this effort would be!

A more universally agreed concept of Context would be nice, but it'll require lot of work to drive something like that.
Another alternative is for the clients we use (tonic/hyper etc) to expose a way to opt-out to their usual logging, and then OTLP opting out this way. That'd also require some efforts to drive this across the clients we use!

@davidhewitt
Copy link
Copy Markdown
Contributor

davidhewitt commented Aug 20, 2025

For what it's worth, I have been able to use this approach downstream in logfire to successfully suppress all export telemetry.

  • To avoid reqwest spawning a background thread outside of my control, I had to switch to use the reqwest-client (async client) in opentelemetry-otlp.
  • Due to that client needing a tokio runtime, I decided to just spawn a background tokio runtime inside the logfire SDK for the exporters. Using the approach here I suppress all telemetry on that runtime's threads.
  • ... and similarly I needed to use the experimental async batch exporters, because the async reqwest client doesn't work in the background thread of the sync BatchSpanExporter (etc) because those threads don't have a tokio context.

pydantic/logfire-rust#95

@bryantbiggs
Copy link
Copy Markdown
Contributor

While reviewing the telemetry suppression approach here, I noticed that SimpleConcurrentLogProcessor::emit() in opentelemetry-sdk/src/logs/concurrent_log_processor.rs does not enter a telemetry-suppressed scope before calling the exporter, unlike every other processor in the SDK:

  • SimpleLogProcessor::emit() has let _suppress_guard = Context::enter_telemetry_suppressed_scope();
  • BatchLogProcessor suppresses in both its export and batch-processing paths
  • SimpleConcurrentLogProcessor::emit() calls futures_executor::block_on(self.exporter.export(...)) with no suppression at all

If an exporter (or its underlying transport) emits logs during export, and those logs flow back through a SimpleConcurrentLogProcessor, there is nothing to break the recursion.

This is likely a straightforward fix -- adding let _suppress_guard = Context::enter_telemetry_suppressed_scope(); at the top of SimpleConcurrentLogProcessor::emit(), consistent with how SimpleLogProcessor does it. Might be worth folding into this PR or a follow-up.

@michaelvanstraten
Copy link
Copy Markdown

michaelvanstraten commented Feb 21, 2026

I tried what was suggested in this PR; here are my notes.

Why this won't work

This won’t work if your application installs/uses its own futures runtime, because the standard processor ultimately calls futures_executor::block_on

.and_then(|exporter| futures_executor::block_on(exporter.export(vec![span])));

That ends up using whatever runtime is currently active, which is exactly what you don’t want, because the suppression guard is not set there.

You can address the block_on issue by switching to the experimental processors that accept an async runtime, like the ones defined here:

pub struct BatchSpanProcessor<R: RuntimeChannel> {

However, you still need to provide your own runtime implementation (e.g., a thin wrapper around a Tokio runtime).

The deeper issue

The bigger issue I ran into is that h2 creates a parentless span when it starts a new connection here

https://github.com/hyperium/h2/blob/5634dddea8ff9ed4e8df327a64765738f3e997d8/src/proto/connection.rs#L129

That defeats the global suppression guard because it gets ignored by tracing-opentelemetry here:

https://github.com/tokio-rs/tracing-opentelemetry/blob/91c4faa0082a3bfe2b3e9e59ed1db295830f55ec/src/layer.rs#L949

when is_contextual returns false.

Workaround I ended up with

I ended up creating a dedicated “telemetry” Tokio runtime and adding a filter to the OTel layer that drops spans/events that originate from telemetry runtime threads. The idea is: just set a threadlocal that tells you if you are in a telemetry export.

Below is what I have.

use std::cell::Cell;
use std::fmt::Debug;
use std::ops::Deref;
use std::sync::LazyLock;

thread_local! {
    static IS_TELEMETRY_THREAD: Cell<bool> = const { Cell::new(false) };
}

static TELEMETRY_RUNTIME: LazyLock<tokio::runtime::Runtime> = LazyLock::new(|| {
    tokio::runtime::Builder::new_multi_thread()
        .worker_threads(1)
        .thread_name("telemetry-runtime")
        .enable_all()
        .on_thread_start(|| IS_TELEMETRY_THREAD.set(true))
        .on_thread_stop(|| IS_TELEMETRY_THREAD.set(false))
        .build()
        .expect("Failed to create tokio runtime")
});

#[derive(Debug, Clone)]
pub struct TelemetryRuntime;

impl opentelemetry_sdk::runtime::Runtime for TelemetryRuntime {
    fn spawn<F>(&self, future: F)
    where
        F: Future<Output = ()> + Send + 'static,
    {
        let _ = TELEMETRY_RUNTIME.spawn(future);
    }

    fn delay(&self, duration: std::time::Duration) -> impl Future<Output = ()> + Send + 'static {
        let _guard = TELEMETRY_RUNTIME.enter();
        tokio::time::sleep(duration)
    }
}

impl opentelemetry_sdk::runtime::RuntimeChannel for TelemetryRuntime {
    type Receiver<T: Debug + Send> = tokio_stream::wrappers::ReceiverStream<T>;
    type Sender<T: Debug + Send> = tokio::sync::mpsc::Sender<T>;

    fn batch_message_channel<T: std::fmt::Debug + Send>(
        &self,
        capacity: usize,
    ) -> (Self::Sender<T>, Self::Receiver<T>) {
        let _guard = TELEMETRY_RUNTIME.enter();
        let (sender, receiver) = tokio::sync::mpsc::channel(capacity);
        (
            sender,
            tokio_stream::wrappers::ReceiverStream::new(receiver),
        )
    }
}

impl<S> tracing_subscriber::layer::Filter<S> for TelemetryRuntime {
    fn enabled(
        &self,
        _: &tracing::Metadata<'_>,
        _: &tracing_subscriber::layer::Context<'_, S>,
    ) -> bool {
        !IS_TELEMETRY_THREAD.get()
    }

    fn event_enabled(
        &self,
        _: &tracing::Event<'_>,
        _: &tracing_subscriber::layer::Context<'_, S>,
    ) -> bool {
        !IS_TELEMETRY_THREAD.get()
    }
}

impl Deref for TelemetryRuntime {
    type Target = tokio::runtime::Runtime;

    fn deref(&self) -> &Self::Target {
        &*TELEMETRY_RUNTIME
    }
}

Example: wiring the telemetry runtime into the SDK + applying the filter

let _guard = TelemetryRuntime.enter();

let processor = BatchLogProcessor::builder(exporter, TelemetryRuntime).build();

let provider = SdkLoggerProvider::builder()
    .with_log_processor(processor)
    .build();

OpenTelemetryTracingBridge::new(&provider)
    .with_filter(TelemetryRuntime);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants