feat(weave): support merged scorers in monitors#6262
feat(weave): support merged scorers in monitors#6262jtschoonhoven wants to merge 7 commits intomasterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=2a1fb22df25b6a7150827880d2079a7834ea8c73 |
| trace_id = req.trace_id or generate_id() | ||
| parent_id = req.parent_id |
There was a problem hiding this comment.
Passing through the parent_id allows us to make this completion a child call within a trace.
| @register_object | ||
| class ClassifierScorer(LLMAsAJudgeScorer): | ||
| """A classifier LLM scorer that tags calls (displayed as pills in the UI).""" |
There was a problem hiding this comment.
Currently ClassifierScorer shares all its behavior with LLMAsAJudgeScorer but we still need to be able to distinguish between them.
Instead of making this its own class I considered adding an is_classifier attribute on LLMAsAJudgeScorer. But I expect these classes to diverge in the future and it will be nice to encapsulate classifier-specific behavior in its own class.
35779b7 to
eed7750
Compare
| # Optionally inject extra fields in score_args | ||
| inject_exception: bool = Field( | ||
| default=False, | ||
| description="Whether `call.exception` should be automatically added to `score_args`.", | ||
| ) | ||
| inject_source_code_on_exception: bool = Field( | ||
| default=False, | ||
| description="Whether the source code for the op should be automatically added to `score_args` (when `call.exception` is set).", | ||
| ) |
There was a problem hiding this comment.
For trace analysis we want scorers to have enough context to accurately classify failed calls.
Rather than make this configurable I could instead hardcode this behavior in the scoring worker. But I'm erring on the side of making things configurable.
| merge_scorers: bool = Field( | ||
| default=False, | ||
| description="If True, scorers are merged and treated as a single scorer.", | ||
| ) | ||
| merged_scorers_prompt_header: str | None = Field( | ||
| default=None, | ||
| description="Text prepended before the merged classifier prompts.", | ||
| ) | ||
| merged_scorers_prompt_footer: str | None = Field( | ||
| default=None, | ||
| description="Text appended after the merged classifier prompts.", | ||
| ) | ||
| merged_scorers_prompt_section_header: str = Field( | ||
| default="{display_name}", | ||
| description="Text to prepend before each merged scorer prompt (use `{display_name}` to access the scorer's name).", | ||
| ) |
There was a problem hiding this comment.
Does all of this make sense if scorers are not LLMAsAJudgeScorers? Right now we only support those, but soon we will support custom code scorers, for which "merging" may not make sense.
There was a problem hiding this comment.
These only apply to LLMAsAJudgeScorers. Alternatively I could:
- Clarify that in the attribute names and descriptions
- Hardcode this behavior in
scoring_worker.pyfor classifiers - Something else
There was a problem hiding this comment.
I think we should hard-code the behavior in scoring_worker.py, not expose it as settings.
Description
Updates to support trace analysis features:
ClassifierScorerthat inherits fromLLMAsAJudgeScorerMonitor.merge*attributes to control how Scorers are combined. See below.Monitor.is_tracedso that tracing within monitors is configurable.Scorer.display_nameproperty (not necessary but nice-to-have)LLMAsAJudgeScorer.inject_exceptionandLLMAsAJudgeScorer.inject_source_code_on_exceptionwhich can be used to make that data available to the scorer.completions_createfunction to receive aparent_idso these completions can be traced correctly.Testing
All these changes are backwards-compatible and default to the old behavior. Tested locally.