feat: KEP for OpenTelemetry integration in Kubeflow SDK#382
feat: KEP for OpenTelemetry integration in Kubeflow SDK#382XploY04 wants to merge 1 commit intokubeflow:mainfrom
Conversation
…ow#164) Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
Adds a KEP (proposal) describing how to integrate OpenTelemetry (OTel) into the Kubeflow Python SDK to provide tracing/metrics/log correlation for SDK operations (initially focused on TrainerClient, with extensibility to other clients).
Changes:
- Introduces KEP-164 with proposed architecture, dependency approach, span/attribute conventions, context propagation, and a phased implementation plan.
- Documents intended user experience (installation + usage) and outlines a test plan using an in-memory exporter/reader.
You can also share your feedback on Copilot code review. Take the survey.
| The SDK depends on `opentelemetry-api` only, which is no-op by default. Users who want | ||
| actual telemetry install `opentelemetry-sdk` and an exporter separately. No SDK code | ||
| changes required on their end. | ||
|
|
||
| The first implementation covers `TrainerClient` and all three backends (Kubernetes, | ||
| Container, LocalProcess). The shared telemetry module in `kubeflow/common/telemetry/` is | ||
| built to be reused by `PipelinesClient`, `OptimizerClient`, `ModelRegistryClient`, and | ||
| `SparkClient` later. |
There was a problem hiding this comment.
The summary says the SDK depends on opentelemetry-api only, but later the proposal adds opentelemetry-semantic-conventions as a core dependency; please align the wording to reflect both (or justify why semconv would be optional).
| return job_name | ||
| except Exception as e: | ||
| span.set_status(StatusCode.ERROR, str(e)) | ||
| span.record_exception(e) | ||
| raise | ||
| ``` | ||
|
|
There was a problem hiding this comment.
The examples use span.set_status(StatusCode.ERROR, str(e)), but in the OpenTelemetry Python API Span.set_status expects a Status object (and not a status code + description tuple), so this snippet is likely to raise/type-mismatch when implemented; update the example to construct a Status(StatusCode.ERROR, description=...) (or the equivalent supported by the target OTel version).
| from opentelemetry import trace | ||
| from opentelemetry.sdk.trace import TracerProvider | ||
| from opentelemetry.sdk.trace.export import SimpleSpanProcessor, InMemorySpanExporter | ||
|
|
||
|
|
||
| def setup_test_telemetry() -> InMemorySpanExporter: | ||
| """Set up an in-memory span exporter for test assertions.""" | ||
| exporter = InMemorySpanExporter() |
There was a problem hiding this comment.
InMemorySpanExporter is typically imported from opentelemetry.sdk.trace.export.in_memory_span_exporter, not from opentelemetry.sdk.trace.export; as written, the sample helper may not import on common OTel versions, so please correct the import path in the KEP to avoid copy/paste failures.
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #164
Checklist:
@jaiakash @kramaranya @dhanishaphadate please review