Update eval intro docs for clarity online offline user journey (#1518)

katmayb · web-flow · commit aa6982cf8118 · 2025-12-01T09:44:26.000-05:00
Fixes DOC-456 ## Preview https://langchain-5e9cc07a-preview-evalsi-1764016425-db5351a.mintlify.app/langsmith/evaluation-concepts
diff --git a/src/docs.json b/src/docs.json
@@ -1078,6 +1078,7 @@
                     {
                       "group": "Evaluation types",
                       "pages": [
+                        "langsmith/evaluation-types",
                         "langsmith/code-evaluator",
                         "langsmith/llm-as-judge",
                         "langsmith/composite-evaluators",
@@ -1097,6 +1098,7 @@
                     {
                       "group": "Evaluation techniques",
                       "pages": [
+                        "langsmith/experiment-configuration",
                         "langsmith/define-target-function",
                         "langsmith/evaluate-on-intermediate-steps",
                         "langsmith/multiple-scores",
diff --git a/src/langsmith/evaluation-concepts.mdx b/src/langsmith/evaluation-concepts.mdx
diff --git a/src/langsmith/evaluation-types.mdx b/src/langsmith/evaluation-types.mdx
@@ -0,0 +1,70 @@
+---
+title: Evaluation types
+sidebarTitle: Evaluation types
+---
+
+LangSmith supports various evaluation types for different stages of development and deployment. Understanding when to use each helps build a comprehensive evaluation strategy.
+
+## Offline evaluation types
+
+Offline evaluation tests applications on curated datasets before deployment. By running evaluations on examples with reference outputs, teams can compare versions, validate functionality, and build confidence before exposing changes to users.
+
+Run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)) or server-side via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground) or [automations](/langsmith/rules).
+
+![Offline](/langsmith/images/offline.png)
+
+### Benchmarking
+
+_Benchmarking_ compares multiple application versions on a curated dataset to identify the best performer. This process involves creating a dataset of representative inputs, defining performance metrics, and testing each version.
+
+Benchmarking requires dataset curation with gold-standard reference outputs and well-designed comparison metrics. Examples:
+- **RAG Q&A bot**: Dataset of questions and reference answers, with an LLM-as-judge evaluator checking semantic equivalence between actual and reference answers.
+- **ReACT agent**: Dataset of user requests and reference tool calls, with a heuristic evaluator verifying all expected tool calls were made.
+
+### Unit tests
+
+_Unit tests_ verify the correctness of individual system components. In LLM contexts, [unit tests are often rule-based assertions](https://hamel.dev/blog/posts/evals/#level-1-unit-tests) on inputs or outputs (e.g., verifying LLM-generated code compiles, JSON loads successfully) that validate basic functionality.
+
+Unit tests typically expect consistent passing results, making them suitable for CI pipelines. When running in CI, configure caching to minimize LLM API calls and associated costs.
+
+### Regression tests
+
+_Regression tests_ measure performance consistency across application versions over time. They ensure new versions do not degrade performance on cases the current version handles correctly, and ideally demonstrate improvements over the baseline. These tests typically run when making updates expected to affect user experience (e.g., model or architecture changes).
+
+LangSmith's comparison view highlights regressions (red) and improvements (green) relative to the baseline, enabling quick identification of changes.
+
+![Comparison view](/langsmith/images/comparison-view.png)
+
+### Backtesting
+
+_Backtesting_ evaluates new application versions against historical production data. Production logs are converted into a dataset, then newer versions process these examples to assess performance on past, realistic user inputs.
+
+This approach is commonly used for evaluating new model releases. For example, when a new model becomes available, test it on the most recent production runs and compare results to actual production outcomes.
+
+### Pairwise evaluation
+
+_Pairwise evaluation_ compares outputs from two versions by determining relative quality rather than assigning absolute scores. For some tasks, [determining "version A is better than B"](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/) is easier than scoring each version independently.
+
+This approach proves particularly useful for LLM-as-judge evaluations on subjective tasks. For example, in summarization, determining "Which summary is clearer and more concise?" is often simpler than assigning numeric clarity scores.
+
+Learn [how run pairwise evaluations](/langsmith/evaluate-pairwise).
+
+## Online evaluation types
+
+Online evaluation assesses production application outputs in near real-time. Without reference outputs, these evaluations focus on detecting issues, monitoring quality trends, and identifying edge cases that inform future offline testing.
+
+Online evaluators typically run server-side. LangSmith provides built-in [LLM-as-judge evaluators](/langsmith/llm-as-judge) for configuration, and supports custom code evaluators that run within LangSmith.
+
+![Online](/langsmith/images/online.png)
+
+### Real-time monitoring
+
+Monitor application quality continuously as users interact with the system. Online evaluations run automatically on production traffic, providing immediate feedback on each interaction. This enables detection of quality degradation, unusual patterns, or unexpected behaviors before they impact significant user populations.
+
+### Anomaly detection
+
+Identify outliers and edge cases that deviate from expected patterns. Online evaluators can flag runs with unusual characteristics—extremely long or short responses, unexpected error rates, or outputs that fail safety checks—for human review and potential addition to offline datasets.
+
+### Production feedback loop
+
+Use insights from production to improve offline evaluation. Online evaluations surface real-world issues and usage patterns that may not appear in curated datasets. Failed production runs become candidates for dataset examples, creating an iterative cycle where production experience continuously refines testing coverage.
diff --git a/src/langsmith/evaluation.mdx b/src/langsmith/evaluation.mdx
@@ -1,22 +1,101 @@
 ---
-title: LangSmith Evaluations
+title: LangSmith Evaluation
 sidebarTitle: Overview
 mode: wide
 ---
 
 import HostingSetup from '/snippets/langsmith/platform-setup-note.mdx';
 
-The following sections help you create datasets, run evaluations, and analyze results:
+LangSmith supports two types of evaluations based on when and where they run:
+
+<CardGroup cols={2}>
+  <Card
+    title="Offline Evaluation"
+    icon="flask"
+  >
+    **Test before you ship**
+
+    Run evaluations on curated datasets during development to compare versions, benchmark performance, and catch regressions.
+  </Card>
+
+  <Card
+    title="Online Evaluation"
+    icon="radar"
+  >
+    **Monitor in production**
+
+    Evaluate real user interactions in real-time to detect issues and measure quality on live traffic.
+  </Card>
+</CardGroup>
+
+
+## Evaluation workflow
+
+<Tabs>
+<Tab title="Offline evaluation flow">
+
+<Steps>
+  <Step title="Create a dataset">
+    Create a [dataset](/langsmith/manage-datasets) with <Tooltip tip="Individual test cases with inputs and reference outputs">[examples](/langsmith/evaluation-concepts#examples)</Tooltip> from manually curated test cases, historical production traces, or synthetic data generation.
+  </Step>
+
+  <Step title="Define evaluators">
+    Create <Tooltip tip="Functions that score how well your application performs">[evaluators](/langsmith/evaluation-concepts#evaluators)</Tooltip> to score performance:
+    - [Human](/langsmith/evaluation-concepts#human) review
+    - [Code](/langsmith/evaluation-concepts#code) rules
+    - [LLM-as-judge](/langsmith/llm-as-judge)
+    - [Pairwise](/langsmith/evaluate-pairwise) comparison
+  </Step>
+
+  <Step title="Run an experiment">
+    Execute your application on the dataset to create an <Tooltip tip="Results of evaluating a specific application version on a dataset">[experiment](/langsmith/evaluation-concepts#experiment)</Tooltip>. Configure [repetitions, concurrency, and caching](/langsmith/experiment-configuration) to optimize runs.
+  </Step>
+
+  <Step title="Analyze results">
+    Compare experiments for [benchmarking](/langsmith/evaluation-types#benchmarking), [unit tests](/langsmith/evaluation-types#unit-tests), [regression tests](/langsmith/evaluation-types#regression-tests), or [backtesting](/langsmith/evaluation-types#backtesting).
+  </Step>
+</Steps>
+
+</Tab>
+
+<Tab title="Online evaluation flow">
+
+<Steps>
+  <Step title="Deploy your application">
+    Each interaction creates a <Tooltip tip="A single execution trace including inputs, outputs, and intermediate steps">[run](/langsmith/evaluation-concepts#runs)</Tooltip> without reference outputs.
+  </Step>
+
+  <Step title="Configure online evaluators">
+    Set up [evaluators](/langsmith/online-evaluations) to run automatically on production traces: safety checks, format validation, quality heuristics, and reference-free LLM-as-judge. Apply [filters and sampling rates](/langsmith/online-evaluations#4-optional-configure-a-sampling-rate) to control costs.
+  </Step>
+
+  <Step title="Monitor in real-time">
+    Evaluators run automatically on [runs](/langsmith/evaluation-concepts#runs) or <Tooltip tip="Collections of related runs forming multi-turn conversations">[threads](/langsmith/online-evaluations#configure-multi-turn-online-evaluators)</Tooltip>, providing real-time monitoring, anomaly detection, and alerting.
+  </Step>
+
+  <Step title="Establish a feedback loop">
+    Add failing production traces to your [dataset](/langsmith/manage-datasets), create targeted evaluators, validate fixes with offline experiments, and redeploy.
+  </Step>
+</Steps>
+
+</Tab>
+</Tabs>
+
+<Tip>
+For more on the differences between offline and online evaluation, refer to the [Evaluation concepts](/langsmith/evaluation-concepts#quick-reference-offline-vs-online-evaluation) page.
+</Tip>
+
+## Get started
 
 <Columns cols={3}>
 
   <Card
-    title="Evaluation concepts"
-    icon="circle-info"
-    href="/langsmith/evaluation-concepts"
+    title="Evaluation quickstart"
+    icon="rocket"
+    href="/langsmith/evaluation-quickstart"
     arrow="true"
   >
-    Review core terminology and concepts to understand how evaluations work in LangSmith.
+    Get started with offline evaluation.
   </Card>
 
   <Card
@@ -29,12 +108,12 @@ The following sections help you create datasets, run evaluations, and analyze re
   </Card>
 
   <Card
-    title="Run evaluations"
+    title="Run offline evaluations"
     icon="microscope"
     href="/langsmith/evaluate-llm-application"
     arrow="true"
   >
-    Evaluate your applications with different evaluators and techniques to measure quality.
+    Explore evaluation types, techniques, and frameworks for comprehensive testing.
   </Card>
 
   <Card
@@ -47,12 +126,12 @@ The following sections help you create datasets, run evaluations, and analyze re
   </Card>
 
   <Card
-    title="Collect feedback"
-    icon="comments"
-    href="/langsmith/annotation-queues"
+    title="Run online evaluations"
+    icon="radar"
+    href="/langsmith/online-evaluations"
     arrow="true"
   >
-    Gather human feedback through annotation queues and inline annotation on outputs.
+    Monitor production quality in real-time from the Observability tab.
   </Card>
 
   <Card
diff --git a/src/langsmith/experiment-configuration.mdx b/src/langsmith/experiment-configuration.mdx
@@ -0,0 +1,34 @@
+---
+title: Experiment configuration
+sidebarTitle: Experiment configuration
+---
+
+LangSmith supports several configuration options for experiments:
+
+- [Repetitions](#repetitions)
+- [Concurrency](#concurrency)
+- [Caching](#caching)
+
+### Repetitions
+
+_Repetitions_ run an experiment multiple times to account for LLM output variability. Since LLM outputs are non-deterministic, multiple repetitions provide a more accurate performance estimate.
+
+Configure repetitions by passing the `num_repetitions` argument to `evaluate` / `aevaluate` ([Python](https://docs.smith.langchain.com/reference/python/evaluation/langsmith.evaluation._runner.evaluate), [TypeScript](https://docs.smith.langchain.com/reference/js/interfaces/evaluation.EvaluateOptions#numrepetitions)). Each repetition re-runs both the target function and all evaluators.
+
+Learn more in the [repetitions how-to guide](/langsmith/repetition).
+
+### Concurrency
+
+_Concurrency_ controls how many examples run simultaneously during an experiment. Configure it by passing the `max_concurrency` argument to `evaluate` / `aevaluate`. The semantics differ between the two functions:
+
+#### `evaluate`
+
+The `max_concurrency` argument specifies the maximum number of concurrent threads for running both the target function and evaluators.
+
+#### `aevaluate`
+
+The `max_concurrency` argument uses a semaphore to limit concurrent tasks. `aevaluate` creates a task for each example, where each task runs the target function and all evaluators for that example. The `max_concurrency` argument specifies the maximum number of concurrent examples to process.
+
+### Caching
+
+_Caching_ stores API call results to disk to speed up future experiments. Set the `LANGSMITH_TEST_CACHE` environment variable to a valid folder path with write access. Future experiments that make identical API calls will reuse cached results instead of making new requests.
diff --git a/src/langsmith/pytest.mdx b/src/langsmith/pytest.mdx
@@ -3,12 +3,14 @@ title: How to run evaluations with pytest (beta)
 sidebarTitle: Run evaluations with pytest
 ---
 
-The LangSmith pytest plugin lets Python developers define their datasets and evaluations as pytest test cases. Compared to the standard evaluation flow, this is useful when:
+The LangSmith pytest plugin lets Python developers define their datasets and evaluations as pytest test cases.
 
-* Each example requires different evaluation logic
-* You want to assert binary expectations, and both track these assertions in LangSmith and raise assertion errors locally (e.g. in CI pipelines)
-* You want pytest-like terminal outputs
-* You already use pytest to test your app and want to add LangSmith tracking
+Compared to the standard evaluation flow, this is useful when:
+
+* **Each example requires different evaluation logic**: Standard evaluation flows assume consistent application and evaluator execution across all dataset examples. For more complex systems or comprehensive evaluations, specific system subsets may require evaluation with particular input types and metrics. These heterogeneous evaluations are simpler to write as distinct test case suites that track together.
+* **You want to assert binary expectations**: Track assertions in LangSmith and raise assertion errors locally (e.g. in CI pipelines). Testing tools help when both evaluating system outputs and asserting basic properties about them.
+* **You want pytest-like terminal outputs**: Get familiar pytest output formatting
+* **You already use pytest to test your app**: Add LangSmith tracking to existing pytest workflows
 
 <Warning>
 The pytest integration is in beta and is subject to change in upcoming releases.
diff --git a/src/langsmith/vitest-jest.mdx b/src/langsmith/vitest-jest.mdx
@@ -9,9 +9,9 @@ LangSmith provides integrations with Vitest and Jest that allow JavaScript and T
 
 Compared to the `evaluate()` evaluation flow, this is useful when:
 
-* Each example requires different evaluation logic
-* You want to assert binary expectations, and both track these assertions in LangSmith and raise assertion errors locally (e.g. in CI pipelines)
-* You want to take advantage of mocks, watch mode, local results, or other features of the Vitest/Jest ecosystems
+* **Each example requires different evaluation logic**: Standard evaluation flows assume consistent application and evaluator execution across all dataset examples. For more complex systems or comprehensive evaluations, specific system subsets may require evaluation with particular input types and metrics. These heterogeneous evaluations are simpler to write as distinct test case suites that track together.
+* **You want to assert binary expectations**: Track assertions in LangSmith and raise assertion errors locally (e.g. in CI pipelines). Testing tools help when both evaluating system outputs and asserting basic properties about them.
+* **You want to take advantage of mocks, watch mode, local results, or other features of the Vitest/Jest ecosystems**
 
 <Info>
 Requires JS/TS SDK version `langsmith>=0.3.1`.