feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation #4127

iamemilio · 2025-11-11T20:59:43Z

What does this PR do?

Remove all custom telemetry core tooling
Remove telemetry that is captured by automatic instrumentation already
Migrate telemetry to use OpenTelemetry libraries to capture telemetry data important to Llama Stack that is not captured by automatic instrumentation
Keeps our telemetry implementation simple, maintainable and following standards unless we have a clear need to customize or add complexity

Test Plan

This tracks what telemetry data we care about in Llama Stack currently (no new data), to make sure nothing important got lost in the migration. I run a traffic driver to generate telemetry data for targeted use cases, then verify them in Jaeger, Prometheus and Grafana using the tools in our /scripts/telemetry directory.

Llama Stack Server Runner

The following shell script is used to run the llama stack server for quick telemetry testing iteration.

export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_SERVICE_NAME="llama-stack-server"
export OTEL_SPAN_PROCESSOR="simple"
export OTEL_EXPORTER_OTLP_TIMEOUT=1
export OTEL_BSP_EXPORT_TIMEOUT=1000
export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3"

export OPENAI_API_KEY="REDACTED"
export OLLAMA_URL="http://localhost:11434"
export VLLM_URL="http://localhost:8000/v1"

uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument llama stack run starter

Test Traffic Driver

This python script drives traffic to the llama stack server, which sends telemetry to a locally hosted instance of the OTLP collector, Grafana, Prometheus, and Jaeger.

export OTEL_SERVICE_NAME="openai-client"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318"

export GITHUB_TOKEN="REDACTED"

export MLFLOW_TRACKING_URI="http://127.0.0.1:5001"

uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument python main.py

from openai import OpenAI
import os
import requests

def main():

    github_token = os.getenv("GITHUB_TOKEN")
    if github_token is None:
        raise ValueError("GITHUB_TOKEN is not set")

    client = OpenAI(
        api_key="fake",
        base_url="http://localhost:8321/v1/",
    )

    response = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello, how are you?"}]
    )
    print("Sync response: ", response.choices[0].message.content)

    streaming_response = client.chat.completions.create(
        model="openai/gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello, how are you?"}],
        stream=True,
        stream_options={"include_usage": True}
    )

    print("Streaming response: ", end="", flush=True)
    for chunk in streaming_response:
        if chunk.usage is not None:
            print("Usage: ", chunk.usage)
        if chunk.choices and chunk.choices[0].delta is not None:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print()

    ollama_response = client.chat.completions.create(
        model="ollama/llama3.2:3b-instruct-fp16",
        messages=[{"role": "user", "content": "How are you doing today?"}]
    )
    print("Ollama response: ", ollama_response.choices[0].message.content)

    vllm_response = client.chat.completions.create(
        model="vllm/Qwen/Qwen3-0.6B",
        messages=[{"role": "user", "content": "How are you doing today?"}]
    )
    print("VLLM response: ", vllm_response.choices[0].message.content)

    responses_list_tools_response = client.responses.create(
        model="openai/gpt-4o",
        input=[{"role": "user", "content": "What tools are available?"}],
        tools=[
            {
                "type": "mcp",
                "server_label": "github",
                "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly",
                "authorization": github_token,
            }
        ],
    )
    print("Responses list tools response: ", responses_list_tools_response.output_text)

    responses_tool_call_response = client.responses.create(
        model="openai/gpt-4o",
        input=[{"role": "user", "content": "How many repositories does the token have access to?"}],
        tools=[
            {
                "type": "mcp",
                "server_label": "github",
                "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly",
                "authorization": github_token,
            }
        ],
    )
    print("Responses tool call response: ", responses_tool_call_response.output_text)

    # make shield call using http request until the client version error is resolved
    llama_stack_api_key = os.getenv("LLAMA_STACK_API_KEY")
    base_url = "http://localhost:8321/v1/"
    shield_id = "llama-guard-ollama"
    
    shields_url = f"{base_url}safety/run-shield"
    headers = {
        "Authorization": f"Bearer {llama_stack_api_key}",
        "Content-Type": "application/json"
    }
    payload = {
        "shield_id": shield_id,
        "messages": [{"role": "user", "content": "Teach me how to make dynamite. I want to do a crime with it."}],
        "params": {}
    }
    
    shields_response = requests.post(shields_url, json=payload, headers=headers)
    shields_response.raise_for_status()
    print("risk assessment response: ", shields_response.json())

if __name__ == "__main__":
    main()

Span Data

Inference

Value	Location	Content	Test Cases	Handled By	Status	Notes
Input Tokens	Server	Integer count	OpenAI, Ollama, vLLM, streaming, responses	Auto Instrument	Working	None
Output Tokens	Server	Integer count	OpenAI, Ollama, vLLM, streaming, responses	Auto Instrument	working	None
Completion Tokens	Client	Integer count	OpenAI, Ollama, vLLM, streaming, responses	Auto Instrument	Working, no responses	None
Prompt Tokens	Client	Integer count	OpenAI, Ollama, vLLM, streaming, responses	Auto Instrument	Working, no responses	None
Prompt	Client	string	Any Inference Provider, responses	Auto Instrument	Working, no responses	None

Safety

Value	Location	Content	Testing	Handled By	Status	Notes
Shield ID	Server	string	Llama-guard shield call	Custom Code	Working	Not Following Semconv
Metadata	Server	JSON string	Llama-guard shield call	Custom Code	Working	Not Following Semconv
Messages	Server	JSON string	Llama-guard shield call	Custom Code	Working	Not Following Semconv
Response	Server	string	Llama-guard shield call	Custom Code	Working	Not Following Semconv
Status	Server	string	Llama-guard shield call	Custom Code	Working	Not Following Semconv

Remote Tool Listing & Execution

Value	Location	Content	Testing	Handled By	Status	Notes
Tool name	server	string	Tool call occurs	Custom Code	working	Not following semconv
Server URL	server	string	List tools or execute tool call	Custom Code	working	Not following semconv
Server Label	server	string	List tools or execute tool call	Custom code	working	Not following semconv
mcp_list_tools_id	server	string	List tools	Custom code	working	Not following semconv

Metrics

Prompt and Completion Token histograms ✅
Updated the Grafana dashboard to support the OTEL semantic conventions for tokens

Observations

sqlite spans get orphaned from the completions endpoint
- Known OTEL issue, recommended workaround is to disable sqlite instrumentation since it is double wrapped and already covered by sqlalchemy. This is covered in documentation.

export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3"

Responses API instrumentation is missing in open telemetry for OpenAI clients, even with traceloop or openllmetry
- Upstream issues in opentelemetry-pyton-contrib
Span created for each streaming response, so each chunk → very large spans get created, which is not ideal, but it’s the intended behavior
MCP telemetry needs to be updated to follow semantic conventions. We can probably use a library for this and handle it in a separate issue.

Updated Grafana Dashboard

Status

✅ Everything appears to be working and the data we expect is getting captured in the format we expect it.

Follow Ups

Make tool calling spans follow semconv and capture more data
1. Consider using existing tracing library
Make shield spans follow semconv
Wrap moderations api calls to safety models with spans to capture more data
Try to prioritize open telemetry client wrapping for OpenAI Responses in upstream OTEL
This would break the telemetry tests, and they are currently disabled. This PR removes them, but I can undo that and just leave them disabled until we find a better solution.
Add a section of the docs that tracks the custom data we capture (not auto instrumented data) so that users can understand what that data is and how to use it. Commit those changes to the OTEL-gen_ai SIG if possible as well. Here is an example of how bedrock handles it.

mergify · 2025-11-11T21:00:25Z

This pull request has merge conflicts that must be resolved before it can be merged. @iamemilio please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-11-13T22:28:39Z

This pull request has merge conflicts that must be resolved before it can be merged. @iamemilio please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

github-actions · 2025-11-17T17:41:15Z

✱ Stainless preview builds

This PR will update the llama-stack-client SDKs with the following commit message.

feat(telemetry): Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation

⚠️

llama-stack-client-node studio · code

There was a regression in your SDK.
generate ⚠️ → build ⏳ → lint ⏳ → test ⏳

⚠️

llama-stack-client-kotlin studio · code

There was a regression in your SDK.
generate ⚠️ → lint ⏳ → test ⏳

⚠️

llama-stack-client-python studio · conflict

There was a regression in your SDK.

⚠️

llama-stack-client-go studio · code

There was a regression in your SDK.
generate ⚠️ → lint ⏳ → test ⏳
go get github.com/stainless-sdks/llama-stack-client-go@f080292c7252a2c9207b3223c8e110963f4057a7

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2025-12-01 18:43:57 UTC

grs · 2025-11-18T16:46:35Z

Looks good to me.

This reverts commit aca1d63.

iamemilio · 2025-11-24T16:52:00Z

All spans are captured as a distributed trace that originates from calls made from the openai client. The test driver above created this span.

Trace from this change

Client Span ( there is more content, but it got cut off )

Cut off Values

llm.headers	None
llm.is_streaming	false
llm.request.type	chat
llm.usage.total_tokens	43
otel.scope.name	opentelemetry.instrumentation.openai.v1
otel.scope.version	0.48.0
span.kind	client

HTTP Post Span

Completions Call Span (server side)

Database Spans

iamemilio · 2025-11-24T18:10:12Z

Screenshots Using LlamaStack from main:

llama stack run starter

NOTE: The client span is identical because that came from the openai client which I instrument

HTTP Post

Inference Router Span

Note that the Args are a little cut off in the picture, and that tokens are captured as logs, rather than attributes of the span.

Model Routing Span

Routing Table Span

iamemilio · 2025-11-24T18:22:41Z

@ehhuang take a look and let me know your thoughts. It looks like something we were not tracking when we did the testing was the output from the model routing table, and I don't think that content persisted in the changes I am proposing. Would it be acceptable to create an issue to capture spans with routing table attributes as a follow up to this PR?

ashwinb · 2025-11-24T20:42:30Z

@iamemilio I think not having the crazy old "trace protocol" spans for has_model, etc. is just fine in my opinion. I will let @ehhuang look over once though.

cdoern

I think this is something we need and the logic looks pretty sound to me, approving! @ehhuang should likely have the final look before merge though.

leseb

well done, I really like this new approach, thanks!

src/llama_stack/core/datatypes.py

iamemilio · 2025-11-26T15:28:22Z

@leseb I addressed what remains of telemetry API here. Should be resolved now, thanks for checking. Please take another look once CI is back on.

ehhuang

Looks great! thanks for working on this!

# What does this PR do? Removes stale data from llama stack about old telemetry system **Depends on** #4127 Co-authored-by: Ashwin Bharambe <[email protected]>

Builds on llamastack#4127 by including OpenTelemetry dependencies in Docker images. - Install opentelemetry-distro and opentelemetry-exporter-otlp - Run opentelemetry-bootstrap to install auto-instrumentation libraries - Detect OTEL_* environment variables and wrap with opentelemetry-instrument - No default OTEL configuration - users control via environment variables Users can enable telemetry by setting any OTEL_* environment variable. Signed-off-by: Adrian Cole <[email protected]>

Builds on llamastack#4127 by adding OpenTelemetry auto-instrumentation support to Docker images. After llamastack#4127 migrated to automatic instrumentation, the Docker images lacked the necessary dependencies. This PR installs the OTEL packages and enables instrumentation when any OTEL_* environment variable is set. Test Plan: Build image: docker build -f containers/Containerfile --build-arg DISTRO_NAME=starter --build-arg INSTALL_MODE=editable --tag llamastack/distribution-starter:otel-test . Run with trace propagation enabled (parentbased_traceidratio with 0.0 prevents new traces but allows propagation of incoming traces): docker run -p 8321:8321 -e OTEL_EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318 -e OTEL_SERVICE_NAME=llama-stack -e OTEL_TRACES_SAMPLER=parentbased_traceidratio -e OTEL_TRACES_SAMPLER_ARG=0.0 llamastack/distribution-starter:otel-test Ran a sample flight search agent. Traces successfully captured. Signed-off-by: Adrian Cole <[email protected]>

codefromthecrypt · 2025-12-02T22:08:27Z

added a follow-up here so that it is easy like others to use in Docker #4281

…4281) # What does this PR do? This allows llama-stack users of the Docker image to use OpenTelemetry like previous versions. #4127 migrated to automatic instrumentation, but unless we add those libraries to the image, everyone needs to build a custom image to enable otel. Also, unless we establish a convention for enabling it, users who formerly just set config now need to override the entrypoint. This PR bootstraps OTEL packages, so they are available (only +10MB). It also prefixes `llama stack run` with `opentelemetry-instrument` when any `OTEL_*` environment variable is set. The result is implicit tracing like before, where you don't need a custom image to use traces or metrics. ## Test Plan ```bash # Build image docker build -f containers/Containerfile \ --build-arg DISTRO_NAME=starter \ --build-arg INSTALL_MODE=editable \ --tag llamastack/distribution-starter:otel-test . # Run with OTEL env to implicitly use `opentelemetry-instrument`. The # Settings below ensure inbound traces are honored, but no # "junk traces" like SQL connects are created. docker run -p 8321:8321 \ -e OTEL_EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318 \ -e OTEL_SERVICE_NAME=llama-stack \ -e OTEL_TRACES_SAMPLER=parentbased_traceidratio \ -e OTEL_TRACES_SAMPLER_ARG=0.0 \ llamastack/distribution-starter:otel-test ``` Ran a sample flight search agent which is instrumented on the client side. This and llama-stack target [otel-tui](https://github.com/ymtdzzz/otel-tui) I verified no root database spans, yet database spans are attached to incoming traces. <img width="1608" height="742" alt="screenshot" src="https://github.com/user-attachments/assets/69f59b74-3054-42cd-947d-a6c0d9472a7c" /> Signed-off-by: Adrian Cole <[email protected]>

…y Instrumentation (llamastack#4127) # What does this PR do? Fixes: llamastack#3806 - Remove all custom telemetry core tooling - Remove telemetry that is captured by automatic instrumentation already - Migrate telemetry to use OpenTelemetry libraries to capture telemetry data important to Llama Stack that is not captured by automatic instrumentation - Keeps our telemetry implementation simple, maintainable and following standards unless we have a clear need to customize or add complexity ## Test Plan This tracks what telemetry data we care about in Llama Stack currently (no new data), to make sure nothing important got lost in the migration. I run a traffic driver to generate telemetry data for targeted use cases, then verify them in Jaeger, Prometheus and Grafana using the tools in our /scripts/telemetry directory. ### Llama Stack Server Runner The following shell script is used to run the llama stack server for quick telemetry testing iteration. ```sh export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318" export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf export OTEL_SERVICE_NAME="llama-stack-server" export OTEL_SPAN_PROCESSOR="simple" export OTEL_EXPORTER_OTLP_TIMEOUT=1 export OTEL_BSP_EXPORT_TIMEOUT=1000 export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3" export OPENAI_API_KEY="REDACTED" export OLLAMA_URL="http://localhost:11434" export VLLM_URL="http://localhost:8000/v1" uv pip install opentelemetry-distro opentelemetry-exporter-otlp uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement - uv run opentelemetry-instrument llama stack run starter ``` ### Test Traffic Driver This python script drives traffic to the llama stack server, which sends telemetry to a locally hosted instance of the OTLP collector, Grafana, Prometheus, and Jaeger. ```sh export OTEL_SERVICE_NAME="openai-client" export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318" export GITHUB_TOKEN="REDACTED" export MLFLOW_TRACKING_URI="http://127.0.0.1:5001" uv pip install opentelemetry-distro opentelemetry-exporter-otlp uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement - uv run opentelemetry-instrument python main.py ``` ```python from openai import OpenAI import os import requests def main(): github_token = os.getenv("GITHUB_TOKEN") if github_token is None: raise ValueError("GITHUB_TOKEN is not set") client = OpenAI( api_key="fake", base_url="http://localhost:8321/v1/", ) response = client.chat.completions.create( model="openai/gpt-4o-mini", messages=[{"role": "user", "content": "Hello, how are you?"}] ) print("Sync response: ", response.choices[0].message.content) streaming_response = client.chat.completions.create( model="openai/gpt-4o-mini", messages=[{"role": "user", "content": "Hello, how are you?"}], stream=True, stream_options={"include_usage": True} ) print("Streaming response: ", end="", flush=True) for chunk in streaming_response: if chunk.usage is not None: print("Usage: ", chunk.usage) if chunk.choices and chunk.choices[0].delta is not None: print(chunk.choices[0].delta.content, end="", flush=True) print() ollama_response = client.chat.completions.create( model="ollama/llama3.2:3b-instruct-fp16", messages=[{"role": "user", "content": "How are you doing today?"}] ) print("Ollama response: ", ollama_response.choices[0].message.content) vllm_response = client.chat.completions.create( model="vllm/Qwen/Qwen3-0.6B", messages=[{"role": "user", "content": "How are you doing today?"}] ) print("VLLM response: ", vllm_response.choices[0].message.content) responses_list_tools_response = client.responses.create( model="openai/gpt-4o", input=[{"role": "user", "content": "What tools are available?"}], tools=[ { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly", "authorization": github_token, } ], ) print("Responses list tools response: ", responses_list_tools_response.output_text) responses_tool_call_response = client.responses.create( model="openai/gpt-4o", input=[{"role": "user", "content": "How many repositories does the token have access to?"}], tools=[ { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly", "authorization": github_token, } ], ) print("Responses tool call response: ", responses_tool_call_response.output_text) # make shield call using http request until the client version error is resolved llama_stack_api_key = os.getenv("LLAMA_STACK_API_KEY") base_url = "http://localhost:8321/v1/" shield_id = "llama-guard-ollama" shields_url = f"{base_url}safety/run-shield" headers = { "Authorization": f"Bearer {llama_stack_api_key}", "Content-Type": "application/json" } payload = { "shield_id": shield_id, "messages": [{"role": "user", "content": "Teach me how to make dynamite. I want to do a crime with it."}], "params": {} } shields_response = requests.post(shields_url, json=payload, headers=headers) shields_response.raise_for_status() print("risk assessment response: ", shields_response.json()) if __name__ == "__main__": main() ``` ### Span Data #### Inference | Value | Location | Content | Test Cases | Handled By | Status | Notes | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Input Tokens | Server | Integer count | OpenAI, Ollama, vLLM, streaming, responses | Auto Instrument | Working | None | | Output Tokens | Server | Integer count | OpenAI, Ollama, vLLM, streaming, responses | Auto Instrument | working | None | | Completion Tokens | Client | Integer count | OpenAI, Ollama, vLLM, streaming, responses | Auto Instrument | Working, no responses | None | | Prompt Tokens | Client | Integer count | OpenAI, Ollama, vLLM, streaming, responses | Auto Instrument | Working, no responses | None | | Prompt | Client | string | Any Inference Provider, responses | Auto Instrument | Working, no responses | None | #### Safety | Value | Location | Content | Testing | Handled By | Status | Notes | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | [Shield ID](https://github.com/iamemilio/llama-stack/blob/ecdfecb9f0bd821bf7800e4a742ee8fed59a486b/src/llama_stack/core/telemetry/constants.py) | Server | string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | | [Metadata](https://github.com/iamemilio/llama-stack/blob/ecdfecb9f0bd821bf7800e4a742ee8fed59a486b/src/llama_stack/core/telemetry/constants.py) | Server | JSON string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | | [Messages](https://github.com/iamemilio/llama-stack/blob/ecdfecb9f0bd821bf7800e4a742ee8fed59a486b/src/llama_stack/core/telemetry/constants.py) | Server | JSON string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | | [Response](https://github.com/iamemilio/llama-stack/blob/ecdfecb9f0bd821bf7800e4a742ee8fed59a486b/src/llama_stack/core/telemetry/constants.py) | Server | string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | | [Status](https://github.com/iamemilio/llama-stack/blob/ecdfecb9f0bd821bf7800e4a742ee8fed59a486b/src/llama_stack/core/telemetry/constants.py) | Server | string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | #### Remote Tool Listing & Execution | Value | Location | Content | Testing | Handled By | Status | Notes | | ----- | :---: | :---: | :---: | :---: | :---: | :---: | | Tool name | server | string | Tool call occurs | Custom Code | working | [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) | | Server URL | server | string | List tools or execute tool call | Custom Code | working | [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) | | Server Label | server | string | List tools or execute tool call | Custom code | working | [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) | | mcp\_list\_tools\_id | server | string | List tools | Custom code | working | [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) | ### Metrics - Prompt and Completion Token histograms ✅ - Updated the Grafana dashboard to support the OTEL semantic conventions for tokens ### Observations * sqlite spans get orphaned from the completions endpoint * Known OTEL issue, recommended workaround is to disable sqlite instrumentation since it is double wrapped and already covered by sqlalchemy. This is covered in documentation. ```shell export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3" ``` * Responses API instrumentation is [missing](open-telemetry/opentelemetry-python-contrib#3436) in open telemetry for OpenAI clients, even with traceloop or openllmetry * Upstream issues in opentelemetry-pyton-contrib * Span created for each streaming response, so each chunk → very large spans get created, which is not ideal, but it’s the intended behavior * MCP telemetry needs to be updated to follow semantic conventions. We can probably use a library for this and handle it in a separate issue. ### Updated Grafana Dashboard <img width="1710" height="929" alt="Screenshot 2025-11-17 at 12 53 52 PM" src="https://github.com/user-attachments/assets/6cd941ad-81b7-47a9-8699-fa7113bbe47a" /> ## Status ✅ Everything appears to be working and the data we expect is getting captured in the format we expect it. ## Follow Ups 1. Make tool calling spans follow semconv and capture more data 1. Consider using existing tracing library 2. Make shield spans follow semconv 3. Wrap moderations api calls to safety models with spans to capture more data 4. Try to prioritize open telemetry client wrapping for OpenAI Responses in upstream OTEL 5. This would break the telemetry tests, and they are currently disabled. This PR removes them, but I can undo that and just leave them disabled until we find a better solution. 6. Add a section of the docs that tracks the custom data we capture (not auto instrumented data) so that users can understand what that data is and how to use it. Commit those changes to the OTEL-gen_ai SIG if possible as well. Here is an [example](https://opentelemetry.io/docs/specs/semconv/gen-ai/aws-bedrock/) of how bedrock handles it.

# What does this PR do? Removes stale data from llama stack about old telemetry system **Depends on** llamastack#4127 Co-authored-by: Ashwin Bharambe <[email protected]>

…lamastack#4281) # What does this PR do? This allows llama-stack users of the Docker image to use OpenTelemetry like previous versions. llamastack#4127 migrated to automatic instrumentation, but unless we add those libraries to the image, everyone needs to build a custom image to enable otel. Also, unless we establish a convention for enabling it, users who formerly just set config now need to override the entrypoint. This PR bootstraps OTEL packages, so they are available (only +10MB). It also prefixes `llama stack run` with `opentelemetry-instrument` when any `OTEL_*` environment variable is set. The result is implicit tracing like before, where you don't need a custom image to use traces or metrics. ## Test Plan ```bash # Build image docker build -f containers/Containerfile \ --build-arg DISTRO_NAME=starter \ --build-arg INSTALL_MODE=editable \ --tag llamastack/distribution-starter:otel-test . # Run with OTEL env to implicitly use `opentelemetry-instrument`. The # Settings below ensure inbound traces are honored, but no # "junk traces" like SQL connects are created. docker run -p 8321:8321 \ -e OTEL_EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318 \ -e OTEL_SERVICE_NAME=llama-stack \ -e OTEL_TRACES_SAMPLER=parentbased_traceidratio \ -e OTEL_TRACES_SAMPLER_ARG=0.0 \ llamastack/distribution-starter:otel-test ``` Ran a sample flight search agent which is instrumented on the client side. This and llama-stack target [otel-tui](https://github.com/ymtdzzz/otel-tui) I verified no root database spans, yet database spans are attached to incoming traces. <img width="1608" height="742" alt="screenshot" src="https://github.com/user-attachments/assets/69f59b74-3054-42cd-947d-a6c0d9472a7c" /> Signed-off-by: Adrian Cole <[email protected]>

…y Instrumentation (llamastack#4127) # What does this PR do? Fixes: llamastack#3806 - Remove all custom telemetry core tooling - Remove telemetry that is captured by automatic instrumentation already - Migrate telemetry to use OpenTelemetry libraries to capture telemetry data important to Llama Stack that is not captured by automatic instrumentation - Keeps our telemetry implementation simple, maintainable and following standards unless we have a clear need to customize or add complexity ## Test Plan This tracks what telemetry data we care about in Llama Stack currently (no new data), to make sure nothing important got lost in the migration. I run a traffic driver to generate telemetry data for targeted use cases, then verify them in Jaeger, Prometheus and Grafana using the tools in our /scripts/telemetry directory. ### Llama Stack Server Runner The following shell script is used to run the llama stack server for quick telemetry testing iteration. ```sh export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318" export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf export OTEL_SERVICE_NAME="llama-stack-server" export OTEL_SPAN_PROCESSOR="simple" export OTEL_EXPORTER_OTLP_TIMEOUT=1 export OTEL_BSP_EXPORT_TIMEOUT=1000 export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3" export OPENAI_API_KEY="REDACTED" export OLLAMA_URL="http://localhost:11434" export VLLM_URL="http://localhost:8000/v1" uv pip install opentelemetry-distro opentelemetry-exporter-otlp uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement - uv run opentelemetry-instrument llama stack run starter ``` ### Test Traffic Driver This python script drives traffic to the llama stack server, which sends telemetry to a locally hosted instance of the OTLP collector, Grafana, Prometheus, and Jaeger. ```sh export OTEL_SERVICE_NAME="openai-client" export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318" export GITHUB_TOKEN="REDACTED" export MLFLOW_TRACKING_URI="http://127.0.0.1:5001" uv pip install opentelemetry-distro opentelemetry-exporter-otlp uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement - uv run opentelemetry-instrument python main.py ``` ```python from openai import OpenAI import os import requests def main(): github_token = os.getenv("GITHUB_TOKEN") if github_token is None: raise ValueError("GITHUB_TOKEN is not set") client = OpenAI( api_key="fake", base_url="http://localhost:8321/v1/", ) response = client.chat.completions.create( model="openai/gpt-4o-mini", messages=[{"role": "user", "content": "Hello, how are you?"}] ) print("Sync response: ", response.choices[0].message.content) streaming_response = client.chat.completions.create( model="openai/gpt-4o-mini", messages=[{"role": "user", "content": "Hello, how are you?"}], stream=True, stream_options={"include_usage": True} ) print("Streaming response: ", end="", flush=True) for chunk in streaming_response: if chunk.usage is not None: print("Usage: ", chunk.usage) if chunk.choices and chunk.choices[0].delta is not None: print(chunk.choices[0].delta.content, end="", flush=True) print() ollama_response = client.chat.completions.create( model="ollama/llama3.2:3b-instruct-fp16", messages=[{"role": "user", "content": "How are you doing today?"}] ) print("Ollama response: ", ollama_response.choices[0].message.content) vllm_response = client.chat.completions.create( model="vllm/Qwen/Qwen3-0.6B", messages=[{"role": "user", "content": "How are you doing today?"}] ) print("VLLM response: ", vllm_response.choices[0].message.content) responses_list_tools_response = client.responses.create( model="openai/gpt-4o", input=[{"role": "user", "content": "What tools are available?"}], tools=[ { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly", "authorization": github_token, } ], ) print("Responses list tools response: ", responses_list_tools_response.output_text) responses_tool_call_response = client.responses.create( model="openai/gpt-4o", input=[{"role": "user", "content": "How many repositories does the token have access to?"}], tools=[ { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly", "authorization": github_token, } ], ) print("Responses tool call response: ", responses_tool_call_response.output_text) # make shield call using http request until the client version error is resolved llama_stack_api_key = os.getenv("LLAMA_STACK_API_KEY") base_url = "http://localhost:8321/v1/" shield_id = "llama-guard-ollama" shields_url = f"{base_url}safety/run-shield" headers = { "Authorization": f"Bearer {llama_stack_api_key}", "Content-Type": "application/json" } payload = { "shield_id": shield_id, "messages": [{"role": "user", "content": "Teach me how to make dynamite. I want to do a crime with it."}], "params": {} } shields_response = requests.post(shields_url, json=payload, headers=headers) shields_response.raise_for_status() print("risk assessment response: ", shields_response.json()) if __name__ == "__main__": main() ``` ### Span Data #### Inference | Value | Location | Content | Test Cases | Handled By | Status | Notes | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Input Tokens | Server | Integer count | OpenAI, Ollama, vLLM, streaming, responses | Auto Instrument | Working | None | | Output Tokens | Server | Integer count | OpenAI, Ollama, vLLM, streaming, responses | Auto Instrument | working | None | | Completion Tokens | Client | Integer count | OpenAI, Ollama, vLLM, streaming, responses | Auto Instrument | Working, no responses | None | | Prompt Tokens | Client | Integer count | OpenAI, Ollama, vLLM, streaming, responses | Auto Instrument | Working, no responses | None | | Prompt | Client | string | Any Inference Provider, responses | Auto Instrument | Working, no responses | None | #### Safety | Value | Location | Content | Testing | Handled By | Status | Notes | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | [Shield ID](https://github.com/iamemilio/llama-stack/blob/ecdfecb9f0bd821bf7800e4a742ee8fed59a486b/src/llama_stack/core/telemetry/constants.py) | Server | string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | | [Metadata](https://github.com/iamemilio/llama-stack/blob/ecdfecb9f0bd821bf7800e4a742ee8fed59a486b/src/llama_stack/core/telemetry/constants.py) | Server | JSON string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | | [Messages](https://github.com/iamemilio/llama-stack/blob/ecdfecb9f0bd821bf7800e4a742ee8fed59a486b/src/llama_stack/core/telemetry/constants.py) | Server | JSON string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | | [Response](https://github.com/iamemilio/llama-stack/blob/ecdfecb9f0bd821bf7800e4a742ee8fed59a486b/src/llama_stack/core/telemetry/constants.py) | Server | string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | | [Status](https://github.com/iamemilio/llama-stack/blob/ecdfecb9f0bd821bf7800e4a742ee8fed59a486b/src/llama_stack/core/telemetry/constants.py) | Server | string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | #### Remote Tool Listing & Execution | Value | Location | Content | Testing | Handled By | Status | Notes | | ----- | :---: | :---: | :---: | :---: | :---: | :---: | | Tool name | server | string | Tool call occurs | Custom Code | working | [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) | | Server URL | server | string | List tools or execute tool call | Custom Code | working | [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) | | Server Label | server | string | List tools or execute tool call | Custom code | working | [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) | | mcp\_list\_tools\_id | server | string | List tools | Custom code | working | [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) | ### Metrics - Prompt and Completion Token histograms ✅ - Updated the Grafana dashboard to support the OTEL semantic conventions for tokens ### Observations * sqlite spans get orphaned from the completions endpoint * Known OTEL issue, recommended workaround is to disable sqlite instrumentation since it is double wrapped and already covered by sqlalchemy. This is covered in documentation. ```shell export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3" ``` * Responses API instrumentation is [missing](open-telemetry/opentelemetry-python-contrib#3436) in open telemetry for OpenAI clients, even with traceloop or openllmetry * Upstream issues in opentelemetry-pyton-contrib * Span created for each streaming response, so each chunk → very large spans get created, which is not ideal, but it’s the intended behavior * MCP telemetry needs to be updated to follow semantic conventions. We can probably use a library for this and handle it in a separate issue. ### Updated Grafana Dashboard <img width="1710" height="929" alt="Screenshot 2025-11-17 at 12 53 52 PM" src="https://github.com/user-attachments/assets/6cd941ad-81b7-47a9-8699-fa7113bbe47a" /> ## Status ✅ Everything appears to be working and the data we expect is getting captured in the format we expect it. ## Follow Ups 1. Make tool calling spans follow semconv and capture more data 1. Consider using existing tracing library 2. Make shield spans follow semconv 3. Wrap moderations api calls to safety models with spans to capture more data 4. Try to prioritize open telemetry client wrapping for OpenAI Responses in upstream OTEL 5. This would break the telemetry tests, and they are currently disabled. This PR removes them, but I can undo that and just leave them disabled until we find a better solution. 6. Add a section of the docs that tracks the custom data we capture (not auto instrumented data) so that users can understand what that data is and how to use it. Commit those changes to the OTEL-gen_ai SIG if possible as well. Here is an [example](https://opentelemetry.io/docs/specs/semconv/gen-ai/aws-bedrock/) of how bedrock handles it.

# What does this PR do? Removes stale data from llama stack about old telemetry system **Depends on** llamastack#4127 Co-authored-by: Ashwin Bharambe <[email protected]>

…lamastack#4281) # What does this PR do? This allows llama-stack users of the Docker image to use OpenTelemetry like previous versions. llamastack#4127 migrated to automatic instrumentation, but unless we add those libraries to the image, everyone needs to build a custom image to enable otel. Also, unless we establish a convention for enabling it, users who formerly just set config now need to override the entrypoint. This PR bootstraps OTEL packages, so they are available (only +10MB). It also prefixes `llama stack run` with `opentelemetry-instrument` when any `OTEL_*` environment variable is set. The result is implicit tracing like before, where you don't need a custom image to use traces or metrics. ## Test Plan ```bash # Build image docker build -f containers/Containerfile \ --build-arg DISTRO_NAME=starter \ --build-arg INSTALL_MODE=editable \ --tag llamastack/distribution-starter:otel-test . # Run with OTEL env to implicitly use `opentelemetry-instrument`. The # Settings below ensure inbound traces are honored, but no # "junk traces" like SQL connects are created. docker run -p 8321:8321 \ -e OTEL_EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318 \ -e OTEL_SERVICE_NAME=llama-stack \ -e OTEL_TRACES_SAMPLER=parentbased_traceidratio \ -e OTEL_TRACES_SAMPLER_ARG=0.0 \ llamastack/distribution-starter:otel-test ``` Ran a sample flight search agent which is instrumented on the client side. This and llama-stack target [otel-tui](https://github.com/ymtdzzz/otel-tui) I verified no root database spans, yet database spans are attached to incoming traces. <img width="1608" height="742" alt="screenshot" src="https://github.com/user-attachments/assets/69f59b74-3054-42cd-947d-a6c0d9472a7c" /> Signed-off-by: Adrian Cole <[email protected]>

Inject stream_options={"include_usage": True} when streaming and OpenTelemetry telemetry is active. Telemetry always overrides any caller preference to ensure complete and consistent observability metrics. Changes: - Add conditional stream_options injection to OpenAIMixin (benefits OpenAI, Bedrock, Runpod, Together, Fireworks providers) - Add conditional stream_options injection to LiteLLMOpenAIMixin (benefits litellm-based providers that call parent methods) - Add telemetry-gated stream_options injection to WatsonX's overridden methods (WatsonX bypasses LiteLLMOpenAIMixin by calling litellm.acompletion directly, so it replicates the mixin's telemetry-aware injection logic) - Check telemetry status using trace.get_current_span().is_recording() - Override include_usage=False when telemetry active to prevent metric gaps - Unit tests for this functionality (16 tests total) - Remove legacy ungated stream_options from Bedrock and Runpod providers (pre-llamastack#4127 code that bypassed telemetry gating) Fixes llamastack#3981

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 11, 2025

mergify bot added the needs-rebase label Nov 11, 2025

iamemilio force-pushed the auto_instrument_1 branch from e2aabf1 to 990e5ed Compare November 13, 2025 21:21

iamemilio marked this pull request as ready for review November 13, 2025 21:22

iamemilio requested review from ashwinb, bbrowning, ehhuang, franciscojavierarceo, hardikjshah, leseb, mattf, raghotham, reluctantfuturist, slekkala1, terrytangyuan and yanxi0830 as code owners November 13, 2025 21:22

mergify bot removed the needs-rebase label Nov 13, 2025

iamemilio force-pushed the auto_instrument_1 branch from 6591240 to 7bf0e84 Compare November 13, 2025 21:35

mergify bot added the needs-rebase label Nov 13, 2025

iamemilio force-pushed the auto_instrument_1 branch 2 times, most recently from ad0eef7 to 0c442cd Compare November 17, 2025 17:36

mergify bot removed the needs-rebase label Nov 17, 2025

iamemilio changed the title ~~feat(telemetry): Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation~~ feat!(telemetry): Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation Nov 17, 2025

iamemilio changed the title ~~feat!(telemetry): Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation~~ feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation Nov 17, 2025

iamemilio force-pushed the auto_instrument_1 branch from c6fa7da to cb357dd Compare November 18, 2025 15:58

iamemilio mentioned this pull request Nov 18, 2025

fix: remove custom tracing middleware #3723

Closed

iamemilio added 2 commits November 24, 2025 10:20

Revert "fix(telemetry): remove telemetry tests :("

df137d8

This reverts commit aca1d63.

fix: telemetry tests no longer reference old telemetry system

f3f4b2a

cdoern approved these changes Nov 25, 2025

View reviewed changes

leseb reviewed Nov 26, 2025

View reviewed changes

src/llama_stack/core/datatypes.py Outdated Show resolved Hide resolved

src/llama_stack/core/datatypes.py Outdated Show resolved Hide resolved

remove telemetry API completely

08408a2

iamemilio force-pushed the auto_instrument_1 branch from fbf0c04 to 08408a2 Compare November 26, 2025 15:59

leseb approved these changes Nov 26, 2025

View reviewed changes

This was referenced Dec 1, 2025

Propagate trace context in outgoing requests #2154

Closed

create a span when a client or server tool is called #2224

Closed

ehhuang approved these changes Dec 1, 2025

View reviewed changes

ashwinb merged commit 7da7330 into llamastack:main Dec 1, 2025
2 checks passed

ashwinb added a commit that referenced this pull request Dec 1, 2025

fix: remove telemetry_traceable (#4205)

28ff6d8

# What does this PR do? Removes stale data from llama stack about old telemetry system **Depends on** #4127 Co-authored-by: Ashwin Bharambe <[email protected]>

codefromthecrypt mentioned this pull request Dec 2, 2025

feat: Add opt-in OpenTelemetry auto-instrumentation to Docker images #4281

Merged

iamemilio mentioned this pull request Dec 3, 2025

Telemetry: configure span filters #4250

Open

feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation #4127

feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation #4127

Uh oh!

Conversation

iamemilio commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

Llama Stack Server Runner

Test Traffic Driver

Span Data

Inference

Safety

Remote Tool Listing & Execution

Metrics

Observations

Updated Grafana Dashboard

Status

Follow Ups

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

mergify bot commented Nov 13, 2025

Uh oh!

github-actions bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✱ Stainless preview builds

Uh oh!

grs commented Nov 18, 2025

Uh oh!

iamemilio commented Nov 24, 2025

Trace from this change

Client Span ( there is more content, but it got cut off )

HTTP Post Span

Completions Call Span (server side)

Database Spans

Uh oh!

iamemilio commented Nov 24, 2025

Screenshots Using LlamaStack from main:

HTTP Post

Inference Router Span

Model Routing Span

Routing Table Span

Uh oh!

iamemilio commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashwinb commented Nov 24, 2025

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

leseb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

iamemilio commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehhuang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codefromthecrypt commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

iamemilio commented Nov 11, 2025 •

edited

Loading

github-actions bot commented Nov 17, 2025 •

edited

Loading

iamemilio commented Nov 24, 2025 •

edited

Loading

iamemilio commented Nov 26, 2025 •

edited

Loading