Skip to content

feat(weave): support merged scorers in monitors#6262

Open
jtschoonhoven wants to merge 7 commits intomasterfrom
jon/errorcat
Open

feat(weave): support merged scorers in monitors#6262
jtschoonhoven wants to merge 7 commits intomasterfrom
jon/errorcat

Conversation

@jtschoonhoven
Copy link
Contributor

@jtschoonhoven jtschoonhoven commented Mar 4, 2026

Description

Updates to support trace analysis features:

  • Defines a new ClassifierScorer that inherits from LLMAsAJudgeScorer
  • Adds Monitor.merge* attributes to control how Scorers are combined. See below.
  • Adds Monitor.is_traced so that tracing within monitors is configurable.
  • Adds a Scorer.display_name property (not necessary but nice-to-have)
  • Adds LLMAsAJudgeScorer.inject_exception and LLMAsAJudgeScorer.inject_source_code_on_exception which can be used to make that data available to the scorer.
  • Updates the completions_create function to receive a parent_id so these completions can be traced correctly.

Testing

All these changes are backwards-compatible and default to the old behavior. Tested locally.

@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@wandbot-3000
Copy link

wandbot-3000 bot commented Mar 4, 2026

Comment on lines +5505 to +5506
trace_id = req.trace_id or generate_id()
parent_id = req.parent_id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing through the parent_id allows us to make this completion a child call within a trace.

Comment on lines +80 to +82
@register_object
class ClassifierScorer(LLMAsAJudgeScorer):
"""A classifier LLM scorer that tags calls (displayed as pills in the UI)."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently ClassifierScorer shares all its behavior with LLMAsAJudgeScorer but we still need to be able to distinguish between them.

Instead of making this its own class I considered adding an is_classifier attribute on LLMAsAJudgeScorer. But I expect these classes to diverge in the future and it will be nice to encapsulate classifier-specific behavior in its own class.

@jtschoonhoven jtschoonhoven changed the title trace analysis WIP feat(weave): support merged scorers in monitors Mar 9, 2026
Comment on lines +43 to +51
# Optionally inject extra fields in score_args
inject_exception: bool = Field(
default=False,
description="Whether `call.exception` should be automatically added to `score_args`.",
)
inject_source_code_on_exception: bool = Field(
default=False,
description="Whether the source code for the op should be automatically added to `score_args` (when `call.exception` is set).",
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For trace analysis we want scorers to have enough context to accurately classify failed calls.

Rather than make this configurable I could instead hardcode this behavior in the scoring worker. But I'm erring on the side of making things configurable.

@jtschoonhoven jtschoonhoven marked this pull request as ready for review March 9, 2026 17:08
@jtschoonhoven jtschoonhoven requested a review from a team as a code owner March 9, 2026 17:08
Comment on lines +68 to +83
merge_scorers: bool = Field(
default=False,
description="If True, scorers are merged and treated as a single scorer.",
)
merged_scorers_prompt_header: str | None = Field(
default=None,
description="Text prepended before the merged classifier prompts.",
)
merged_scorers_prompt_footer: str | None = Field(
default=None,
description="Text appended after the merged classifier prompts.",
)
merged_scorers_prompt_section_header: str = Field(
default="{display_name}",
description="Text to prepend before each merged scorer prompt (use `{display_name}` to access the scorer's name).",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does all of this make sense if scorers are not LLMAsAJudgeScorers? Right now we only support those, but soon we will support custom code scorers, for which "merging" may not make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These only apply to LLMAsAJudgeScorers. Alternatively I could:

  • Clarify that in the attribute names and descriptions
  • Hardcode this behavior in scoring_worker.py for classifiers
  • Something else

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should hard-code the behavior in scoring_worker.py, not expose it as settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants