Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/docs/providers/inference/remote_watsonx.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ IBM WatsonX inference provider for accessing AI models on IBM's WatsonX platform
|-------|------|----------|---------|-------------|
| `allowed_models` | `list[str \| None` | No | | List of models that should be registered with the model registry. If None, all models are allowed. |
| `url` | `<class 'str'>` | No | https://us-south.ml.cloud.ibm.com | A base url for accessing the watsonx.ai |
| `api_key` | `pydantic.types.SecretStr \| None` | No | | The watsonx API key |
| `project_id` | `str \| None` | No | | The Project ID key |
| `api_key` | `pydantic.types.SecretStr \| None` | No | | The watsonx.ai API key |
| `project_id` | `str \| None` | No | | The watsonx.ai project ID |
| `timeout` | `<class 'int'>` | No | 60 | Timeout for the HTTP requests |

## Sample Configuration
Expand Down
2 changes: 1 addition & 1 deletion llama_stack/core/routers/inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -611,7 +611,7 @@ async def stream_tokens_and_compute_metrics_openai_chat(
completion_text += "".join(choice_data["content_parts"])

# Add metrics to the chunk
if self.telemetry and chunk.usage:
if self.telemetry and hasattr(chunk, "usage") and chunk.usage:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks unrelated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a bug around this. I have seen a few PRs which introduced the same logic: #3392, #3422

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, my tests for watsonx.ai wouldn't succeed without this fix (as noted in the PR description). I am fine with letting some other PR put the fix in, but I think this PR should probably wait until that one is in if that's the plan.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does watsonx return usage information? we need to fix/adapt in the adapter, not in the core. putting this in the core obscures the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line was already handling the case where there was no usage information by checking chunk.usage. So I would argue this change is just doing what the line was already doing but in a more robust way. FWIW, the watsonx.ai API does return a usage block, e.g.:

	"usage": {
		"completion_tokens": 54,
		"prompt_tokens": 79,
		"total_tokens": 133
	},

Notably though, it doesn't put the usage information on every chunk. It only puts it on the final chunk in the stream. I can see that when I call the streaming REST API directly. However, I don't see it when I call LiteLLM directly with streaming=True. I do see it when I call LiteLLM without streaming, FWIW. So I think LiteLLM might be dropping the usage information from the last chunk. Here is how I tested this in a notebook:

import litellm, asyncio
import nest_asyncio
nest_asyncio.apply()

async def get_litellm_response():
    return await litellm.acompletion(
        model="watsonx/meta-llama/llama-3-3-70b-instruct",
        messages=[{"role": "user", "content": "What is the capital of France?"}],
        stream=True
    )

async def print_litellm_response():
    response = await get_litellm_response()
    async for chunk in response:
        print(chunk)

asyncio.run(print_litellm_response())

With all that said, even if LiteLLM was correctly including this on the last chunk, we'd still have the issue that it is missing from all of the other chunks (unless LiteLLM put in an explicit None for this field for each other chunk). So I still think we should adopt this change here and let the line handle both missing AND explicit None.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fair enough 😄

metrics = self._construct_metrics(
prompt_tokens=chunk.usage.prompt_tokens,
completion_tokens=chunk.usage.completion_tokens,
Expand Down
11 changes: 2 additions & 9 deletions llama_stack/providers/remote/inference/watsonx/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,12 @@
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.

from llama_stack.apis.inference import Inference

from .config import WatsonXConfig


async def get_adapter_impl(config: WatsonXConfig, _deps) -> Inference:
# import dynamically so `llama stack build` does not fail due to missing dependencies
async def get_adapter_impl(config: WatsonXConfig, _deps):
# import dynamically so the import is used only when it is needed
from .watsonx import WatsonXInferenceAdapter

if not isinstance(config, WatsonXConfig):
raise RuntimeError(f"Unexpected config type: {type(config)}")
adapter = WatsonXInferenceAdapter(config)
return adapter


__all__ = ["get_adapter_impl", "WatsonXConfig"]
4 changes: 2 additions & 2 deletions llama_stack/providers/remote/inference/watsonx/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@ class WatsonXConfig(RemoteInferenceProviderConfig):
)
api_key: SecretStr | None = Field(
default_factory=lambda: os.getenv("WATSONX_API_KEY"),
description="The watsonx API key",
description="The watsonx.ai API key",
)
project_id: str | None = Field(
default_factory=lambda: os.getenv("WATSONX_PROJECT_ID"),
description="The Project ID key",
description="The watsonx.ai project ID",
)
timeout: int = Field(
default=60,
Expand Down
47 changes: 0 additions & 47 deletions llama_stack/providers/remote/inference/watsonx/models.py

This file was deleted.

Loading