Production-ready tracing and evaluations for a weather chat app built with Next.js and the Vercel AI SDK, instrumented with Braintrust for online/offline scoring.
- Next.js app with Vercel AI SDK tools and streaming responses
- Braintrust tracing: root span for each request, tool sub-spans, automatic model I/O tracing
- Online (“in-app”) evaluators scored at the end of each user request
- Offline evaluations via Braintrust
Evalwith shared scorers
- Node 18+
- Braintrust account and API key
- OpenAI API key (or use Braintrust AI providers proxy)
Create .env.local in the project root:
BRAINTRUST_API_KEY=<your-braintrust-api-key>
BRAINTRUST_PROJECT_NAME=<your-braintrust-project-name>
OPENAI_API_KEY=<your-openai-api-key>
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.braintrust.dev/otel
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <your-braintrust-api-key>, x-bt-parent=project_id:<your-braintrust-project-name>"
npm install
npm run dev
# open http://localhost:3000
-
app/(preview)/api/chat/route.ts- Wraps the Vercel AI SDK OpenAI model with
wrapAISDKModel - Wraps the
POSThandler in atracedspan namedPOST /api/chat - Logs input/output and simple online scores (
fahrenheit_presence,contains_number) - Adds asynchronous LLM-judge and content scores via
logger.updateSpan - Supports
?mode=textto return plain text (useful for experiments)
- Wraps the Vercel AI SDK OpenAI model with
-
components/tools.ts- Weather tools are wrapped with
wrapTracedso tool calls appear as child spans
- Weather tools are wrapped with
-
lib/braintrust.ts- Initializes the Braintrust logger and re-exports helpers:
traced,wrapTraced,wrapAISDKModel,currentSpan
- Initializes the Braintrust logger and re-exports helpers:
-
lib/scorers.ts- Shared scorer implementations used by both online tracing and offline evals:
contentAccuracyScore: synonym- and partial-match tolerant; adds lenient score floorsweatherLLMJudgeScore: lenient weather-domain LLM judge (usesopenai("gpt-4o-mini"))generalLLMJudgeScore: general lenient LLM judge (usesopenai("gpt-4o-mini"))
- All include calibration metadata and bounded scores in [0, 1]
- Shared scorer implementations used by both online tracing and offline evals:
-
scripts/eval.agent.ts- Offline evaluation using
Evalwith a set of test cases - Calls the local API with
http://localhost:3000/api/chat?mode=textfor clean, plain-text outputs - Uses the shared scorers from
lib/scorers.ts
- Offline evaluation using
In route.ts, we log simple online metrics and also asynchronously compute LLM-judge and content scores after the model finishes:
- Simple scores:
fahrenheit_presence: 1 if response mentions Fahrenheit (orF), else 0contains_number: 1 if response contains any digit, else 0
- LLM-judge scores (async, non-blocking):
weather_llm_judge: lenient, weather-focused judgegeneral_llm_judge: lenient, general-purpose judgecontent_accuracy: tolerant phrase-based accuracy with calibration
These scores are attached to the same root span with logger.updateSpan.
Run a full evaluation across curated test cases with shared scorers:
npm run eval:agent
This will create a new Braintrust experiment (visible in your project) with:
- Scores:
content_accuracy,general_llm_judge,weather_llm_judge - Per-datapoint metadata: reasons, calibration details, and feedback
By default, the Vercel AI SDK returns a stream with frames. To store clean text in experiments, the API supports:
POST /api/chat?mode=text
This returns a concatenated text stream as the HTTP response body, which the evaluation script uses.
Edit lib/scorers.ts:
- Switch the judge model by changing
openai("gpt-4o-mini")to another (e.g.,openai("gpt-4o")). - Adjust leniency by tweaking the soft-floor thresholds in each scorer’s calibration.
- No logs in Braintrust:
- Ensure
BRAINTRUST_API_KEYandBRAINTRUST_PROJECT_NAMEare set in.env.local - Confirm the app is running and requests are hitting
/api/chat
- Ensure
- Evals fail with missing keys:
scripts/eval.agent.tsloads.env.localviadotenv; confirm the file exists and contains keys
- Frame-like experiment outputs:
- Ensure eval is calling
http://localhost:3000/api/chat?mode=text
- Ensure eval is calling
- Logging is best-effort and non-blocking: if online LLM-judge scoring fails, the user response is still returned
- Tool calls are traced with preserved hierarchy under the request’s root span
npm run dev # Start Next.js
npm run eval:agent # Run offline evaluation