You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LangSmith supports various evaluation types for different stages of development and deployment. Understanding when to use each helps build a comprehensive evaluation strategy.
7
+
8
+
## Offline evaluation types
9
+
10
+
Offline evaluation tests applications on curated datasets before deployment. By running evaluations on examples with reference outputs, teams can compare versions, validate functionality, and build confidence before exposing changes to users.
11
+
12
+
Run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)) or server-side via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground) or [automations](/langsmith/rules).
13
+
14
+

15
+
16
+
### Benchmarking
17
+
18
+
_Benchmarking_ compares multiple application versions on a curated dataset to identify the best performer. This process involves creating a dataset of representative inputs, defining performance metrics, and testing each version.
19
+
20
+
Benchmarking requires dataset curation with gold-standard reference outputs and well-designed comparison metrics. Examples:
21
+
-**RAG Q&A bot**: Dataset of questions and reference answers, with an LLM-as-judge evaluator checking semantic equivalence between actual and reference answers.
22
+
-**ReACT agent**: Dataset of user requests and reference tool calls, with a heuristic evaluator verifying all expected tool calls were made.
23
+
24
+
### Unit tests
25
+
26
+
_Unit tests_ verify the correctness of individual system components. In LLM contexts, [unit tests are often rule-based assertions](https://hamel.dev/blog/posts/evals/#level-1-unit-tests) on inputs or outputs (e.g., verifying LLM-generated code compiles, JSON loads successfully) that validate basic functionality.
27
+
28
+
Unit tests typically expect consistent passing results, making them suitable for CI pipelines. When running in CI, configure caching to minimize LLM API calls and associated costs.
29
+
30
+
### Regression tests
31
+
32
+
_Regression tests_ measure performance consistency across application versions over time. They ensure new versions do not degrade performance on cases the current version handles correctly, and ideally demonstrate improvements over the baseline. These tests typically run when making updates expected to affect user experience (e.g., model or architecture changes).
33
+
34
+
LangSmith's comparison view highlights regressions (red) and improvements (green) relative to the baseline, enabling quick identification of changes.
_Backtesting_ evaluates new application versions against historical production data. Production logs are converted into a dataset, then newer versions process these examples to assess performance on past, realistic user inputs.
41
+
42
+
This approach is commonly used for evaluating new model releases. For example, when a new model becomes available, test it on the most recent production runs and compare results to actual production outcomes.
43
+
44
+
### Pairwise evaluation
45
+
46
+
_Pairwise evaluation_ compares outputs from two versions by determining relative quality rather than assigning absolute scores. For some tasks, [determining "version A is better than B"](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/) is easier than scoring each version independently.
47
+
48
+
This approach proves particularly useful for LLM-as-judge evaluations on subjective tasks. For example, in summarization, determining "Which summary is clearer and more concise?" is often simpler than assigning numeric clarity scores.
49
+
50
+
Learn [how run pairwise evaluations](/langsmith/evaluate-pairwise).
51
+
52
+
## Online evaluation types
53
+
54
+
Online evaluation assesses production application outputs in near real-time. Without reference outputs, these evaluations focus on detecting issues, monitoring quality trends, and identifying edge cases that inform future offline testing.
55
+
56
+
Online evaluators typically run server-side. LangSmith provides built-in [LLM-as-judge evaluators](/langsmith/llm-as-judge) for configuration, and supports custom code evaluators that run within LangSmith.
57
+
58
+

59
+
60
+
### Real-time monitoring
61
+
62
+
Monitor application quality continuously as users interact with the system. Online evaluations run automatically on production traffic, providing immediate feedback on each interaction. This enables detection of quality degradation, unusual patterns, or unexpected behaviors before they impact significant user populations.
63
+
64
+
### Anomaly detection
65
+
66
+
Identify outliers and edge cases that deviate from expected patterns. Online evaluators can flag runs with unusual characteristics—extremely long or short responses, unexpected error rates, or outputs that fail safety checks—for human review and potential addition to offline datasets.
67
+
68
+
### Production feedback loop
69
+
70
+
Use insights from production to improve offline evaluation. Online evaluations surface real-world issues and usage patterns that may not appear in curated datasets. Failed production runs become candidates for dataset examples, creating an iterative cycle where production experience continuously refines testing coverage.
The following sections help you create datasets, run evaluations, and analyze results:
9
+
LangSmith supports two types of evaluations based on when and where they run:
10
+
11
+
<CardGroupcols={2}>
12
+
<Card
13
+
title="Offline Evaluation"
14
+
icon="flask"
15
+
>
16
+
**Test before you ship**
17
+
18
+
Run evaluations on curated datasets during development to compare versions, benchmark performance, and catch regressions.
19
+
</Card>
20
+
21
+
<Card
22
+
title="Online Evaluation"
23
+
icon="radar"
24
+
>
25
+
**Monitor in production**
26
+
27
+
Evaluate real user interactions in real-time to detect issues and measure quality on live traffic.
28
+
</Card>
29
+
</CardGroup>
30
+
31
+
32
+
## Evaluation workflow
33
+
34
+
<Tabs>
35
+
<Tabtitle="Offline evaluation flow">
36
+
37
+
<Steps>
38
+
<Steptitle="Create a dataset">
39
+
Create a [dataset](/langsmith/manage-datasets) with <Tooltiptip="Individual test cases with inputs and reference outputs">[examples](/langsmith/evaluation-concepts#examples)</Tooltip> from manually curated test cases, historical production traces, or synthetic data generation.
40
+
</Step>
41
+
42
+
<Steptitle="Define evaluators">
43
+
Create <Tooltiptip="Functions that score how well your application performs">[evaluators](/langsmith/evaluation-concepts#evaluators)</Tooltip> to score performance:
Execute your application on the dataset to create an <Tooltiptip="Results of evaluating a specific application version on a dataset">[experiment](/langsmith/evaluation-concepts#experiment)</Tooltip>. Configure [repetitions, concurrency, and caching](/langsmith/experiment-configuration) to optimize runs.
52
+
</Step>
53
+
54
+
<Steptitle="Analyze results">
55
+
Compare experiments for [benchmarking](/langsmith/evaluation-types#benchmarking), [unit tests](/langsmith/evaluation-types#unit-tests), [regression tests](/langsmith/evaluation-types#regression-tests), or [backtesting](/langsmith/evaluation-types#backtesting).
56
+
</Step>
57
+
</Steps>
58
+
59
+
</Tab>
60
+
61
+
<Tabtitle="Online evaluation flow">
62
+
63
+
<Steps>
64
+
<Steptitle="Deploy your application">
65
+
Each interaction creates a <Tooltiptip="A single execution trace including inputs, outputs, and intermediate steps">[run](/langsmith/evaluation-concepts#runs)</Tooltip> without reference outputs.
66
+
</Step>
67
+
68
+
<Steptitle="Configure online evaluators">
69
+
Set up [evaluators](/langsmith/online-evaluations) to run automatically on production traces: safety checks, format validation, quality heuristics, and reference-free LLM-as-judge. Apply [filters and sampling rates](/langsmith/online-evaluations#4-optional-configure-a-sampling-rate) to control costs.
70
+
</Step>
71
+
72
+
<Steptitle="Monitor in real-time">
73
+
Evaluators run automatically on [runs](/langsmith/evaluation-concepts#runs) or <Tooltiptip="Collections of related runs forming multi-turn conversations">[threads](/langsmith/online-evaluations#configure-multi-turn-online-evaluators)</Tooltip>, providing real-time monitoring, anomaly detection, and alerting.
74
+
</Step>
75
+
76
+
<Steptitle="Establish a feedback loop">
77
+
Add failing production traces to your [dataset](/langsmith/manage-datasets), create targeted evaluators, validate fixes with offline experiments, and redeploy.
78
+
</Step>
79
+
</Steps>
80
+
81
+
</Tab>
82
+
</Tabs>
83
+
84
+
<Tip>
85
+
For more on the differences between offline and online evaluation, refer to the [Evaluation concepts](/langsmith/evaluation-concepts#quick-reference-offline-vs-online-evaluation) page.
86
+
</Tip>
87
+
88
+
## Get started
10
89
11
90
<Columnscols={3}>
12
91
13
92
<Card
14
-
title="Evaluation concepts"
15
-
icon="circle-info"
16
-
href="/langsmith/evaluation-concepts"
93
+
title="Evaluation quickstart"
94
+
icon="rocket"
95
+
href="/langsmith/evaluation-quickstart"
17
96
arrow="true"
18
97
>
19
-
Review core terminology and concepts to understand how evaluations work in LangSmith.
98
+
Get started with offline evaluation.
20
99
</Card>
21
100
22
101
<Card
@@ -29,12 +108,12 @@ The following sections help you create datasets, run evaluations, and analyze re
29
108
</Card>
30
109
31
110
<Card
32
-
title="Run evaluations"
111
+
title="Run offline evaluations"
33
112
icon="microscope"
34
113
href="/langsmith/evaluate-llm-application"
35
114
arrow="true"
36
115
>
37
-
Evaluate your applications with different evaluators and techniques to measure quality.
116
+
Explore evaluation types, techniques, and frameworks for comprehensive testing.
38
117
</Card>
39
118
40
119
<Card
@@ -47,12 +126,12 @@ The following sections help you create datasets, run evaluations, and analyze re
47
126
</Card>
48
127
49
128
<Card
50
-
title="Collect feedback"
51
-
icon="comments"
52
-
href="/langsmith/annotation-queues"
129
+
title="Run online evaluations"
130
+
icon="radar"
131
+
href="/langsmith/online-evaluations"
53
132
arrow="true"
54
133
>
55
-
Gather human feedback through annotation queues and inline annotation on outputs.
134
+
Monitor production quality in real-time from the Observability tab.
LangSmith supports several configuration options for experiments:
7
+
8
+
-[Repetitions](#repetitions)
9
+
-[Concurrency](#concurrency)
10
+
-[Caching](#caching)
11
+
12
+
### Repetitions
13
+
14
+
_Repetitions_ run an experiment multiple times to account for LLM output variability. Since LLM outputs are non-deterministic, multiple repetitions provide a more accurate performance estimate.
15
+
16
+
Configure repetitions by passing the `num_repetitions` argument to `evaluate` / `aevaluate` ([Python](https://docs.smith.langchain.com/reference/python/evaluation/langsmith.evaluation._runner.evaluate), [TypeScript](https://docs.smith.langchain.com/reference/js/interfaces/evaluation.EvaluateOptions#numrepetitions)). Each repetition re-runs both the target function and all evaluators.
17
+
18
+
Learn more in the [repetitions how-to guide](/langsmith/repetition).
19
+
20
+
### Concurrency
21
+
22
+
_Concurrency_ controls how many examples run simultaneously during an experiment. Configure it by passing the `max_concurrency` argument to `evaluate` / `aevaluate`. The semantics differ between the two functions:
23
+
24
+
#### `evaluate`
25
+
26
+
The `max_concurrency` argument specifies the maximum number of concurrent threads for running both the target function and evaluators.
27
+
28
+
#### `aevaluate`
29
+
30
+
The `max_concurrency` argument uses a semaphore to limit concurrent tasks. `aevaluate` creates a task for each example, where each task runs the target function and all evaluators for that example. The `max_concurrency` argument specifies the maximum number of concurrent examples to process.
31
+
32
+
### Caching
33
+
34
+
_Caching_ stores API call results to disk to speed up future experiments. Set the `LANGSMITH_TEST_CACHE` environment variable to a valid folder path with write access. Future experiments that make identical API calls will reuse cached results instead of making new requests.
Copy file name to clipboardExpand all lines: src/langsmith/pytest.mdx
+7-5Lines changed: 7 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,12 +3,14 @@ title: How to run evaluations with pytest (beta)
3
3
sidebarTitle: Run evaluations with pytest
4
4
---
5
5
6
-
The LangSmith pytest plugin lets Python developers define their datasets and evaluations as pytest test cases. Compared to the standard evaluation flow, this is useful when:
6
+
The LangSmith pytest plugin lets Python developers define their datasets and evaluations as pytest test cases.
7
7
8
-
* Each example requires different evaluation logic
9
-
* You want to assert binary expectations, and both track these assertions in LangSmith and raise assertion errors locally (e.g. in CI pipelines)
10
-
* You want pytest-like terminal outputs
11
-
* You already use pytest to test your app and want to add LangSmith tracking
8
+
Compared to the standard evaluation flow, this is useful when:
9
+
10
+
***Each example requires different evaluation logic**: Standard evaluation flows assume consistent application and evaluator execution across all dataset examples. For more complex systems or comprehensive evaluations, specific system subsets may require evaluation with particular input types and metrics. These heterogeneous evaluations are simpler to write as distinct test case suites that track together.
11
+
***You want to assert binary expectations**: Track assertions in LangSmith and raise assertion errors locally (e.g. in CI pipelines). Testing tools help when both evaluating system outputs and asserting basic properties about them.
12
+
***You want pytest-like terminal outputs**: Get familiar pytest output formatting
13
+
***You already use pytest to test your app**: Add LangSmith tracking to existing pytest workflows
12
14
13
15
<Warning>
14
16
The pytest integration is in beta and is subject to change in upcoming releases.
Copy file name to clipboardExpand all lines: src/langsmith/vitest-jest.mdx
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,9 +9,9 @@ LangSmith provides integrations with Vitest and Jest that allow JavaScript and T
9
9
10
10
Compared to the `evaluate()` evaluation flow, this is useful when:
11
11
12
-
* Each example requires different evaluation logic
13
-
* You want to assert binary expectations, and both track these assertions in LangSmith and raise assertion errors locally (e.g. in CI pipelines)
14
-
* You want to take advantage of mocks, watch mode, local results, or other features of the Vitest/Jest ecosystems
12
+
***Each example requires different evaluation logic**: Standard evaluation flows assume consistent application and evaluator execution across all dataset examples. For more complex systems or comprehensive evaluations, specific system subsets may require evaluation with particular input types and metrics. These heterogeneous evaluations are simpler to write as distinct test case suites that track together.
13
+
***You want to assert binary expectations**: Track assertions in LangSmith and raise assertion errors locally (e.g. in CI pipelines). Testing tools help when both evaluating system outputs and asserting basic properties about them.
14
+
***You want to take advantage of mocks, watch mode, local results, or other features of the Vitest/Jest ecosystems**
0 commit comments