Skip to content

feat: KEP for OpenTelemetry integration in Kubeflow SDK#382

Open
XploY04 wants to merge 1 commit intokubeflow:mainfrom
XploY04:main
Open

feat: KEP for OpenTelemetry integration in Kubeflow SDK#382
XploY04 wants to merge 1 commit intokubeflow:mainfrom
XploY04:main

Conversation

@XploY04
Copy link
Contributor

@XploY04 XploY04 commented Mar 12, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #164

Checklist:

  • Docs included if any changes are user facing

@jaiakash @kramaranya @dhanishaphadate please review

…ow#164)

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Copilot AI review requested due to automatic review settings March 12, 2026 16:51
@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Contributor

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a KEP (proposal) describing how to integrate OpenTelemetry (OTel) into the Kubeflow Python SDK to provide tracing/metrics/log correlation for SDK operations (initially focused on TrainerClient, with extensibility to other clients).

Changes:

  • Introduces KEP-164 with proposed architecture, dependency approach, span/attribute conventions, context propagation, and a phased implementation plan.
  • Documents intended user experience (installation + usage) and outlines a test plan using an in-memory exporter/reader.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +14 to +21
The SDK depends on `opentelemetry-api` only, which is no-op by default. Users who want
actual telemetry install `opentelemetry-sdk` and an exporter separately. No SDK code
changes required on their end.

The first implementation covers `TrainerClient` and all three backends (Kubernetes,
Container, LocalProcess). The shared telemetry module in `kubeflow/common/telemetry/` is
built to be reused by `PipelinesClient`, `OptimizerClient`, `ModelRegistryClient`, and
`SparkClient` later.
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary says the SDK depends on opentelemetry-api only, but later the proposal adds opentelemetry-semantic-conventions as a core dependency; please align the wording to reflect both (or justify why semconv would be optional).

Copilot uses AI. Check for mistakes.
Comment on lines +242 to +248
return job_name
except Exception as e:
span.set_status(StatusCode.ERROR, str(e))
span.record_exception(e)
raise
```

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples use span.set_status(StatusCode.ERROR, str(e)), but in the OpenTelemetry Python API Span.set_status expects a Status object (and not a status code + description tuple), so this snippet is likely to raise/type-mismatch when implemented; update the example to construct a Status(StatusCode.ERROR, description=...) (or the equivalent supported by the target OTel version).

Copilot uses AI. Check for mistakes.
Comment on lines +514 to +521
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, InMemorySpanExporter


def setup_test_telemetry() -> InMemorySpanExporter:
"""Set up an in-memory span exporter for test assertions."""
exporter = InMemorySpanExporter()
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InMemorySpanExporter is typically imported from opentelemetry.sdk.trace.export.in_memory_span_exporter, not from opentelemetry.sdk.trace.export; as written, the sample helper may not import on common OTel versions, so please correct the import path in the KEP to avoid copy/paste failures.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrate Kubeflow SDK with OpenTelemetry

2 participants