fix hanging evals #1223

clutchski · 2026-01-02T19:34:34Z

This bug adds timeouts / retries to prevent network errors from hanging evals.

report from customer:

We have running eval jobs (500+ rows/PDF docs) on a VM that consistently freezes ~15 minutes after reaching 100% completion and gets stuck in experiment.summarize() . It occurs in fetch_base_experiment() on a POST to /api/base_experiment/get_id (https://github.com/braintrustdata/braintrust-sdk/blob/main/py/src/braintrust/logger.py#L3606C9-L3606C78). This only reproduces with large datasets on the VM and does not occur with smaller datasets or when run locally.
Root Cause
We think the issue is that fetch_base_experiment() uses app_conn() (Vercel IP), which is called at experiment start (registration) and then not used again until summarize(). During the long eval run, other logging traffic goes through api_conn() (AWS IP), and leaves app_conn() idle for 15+ minutes.
Azure NAT gateways have a ~4-minute idle timeout, and would silently drop idle connections. Seems like when summarize() reuses the stale connection, the TCP session has already been removed by the NAT, leading to hangs and eventual connection failure. The customer confirmed via network capture that TCP retransmissions fail at this point, which is consistent with a stale NAT mapping.

this tightly couples our retry logic to braintrust state, which is weird.

clutchski changed the title ~~Matt/long eval~~ fix hanging evals Jan 2, 2026

clutchski added 3 commits January 2, 2026 14:50

remove ping caching from http adapter.

243daff

this tightly couples our retry logic to braintrust state, which is weird.

add retry logic comment

77f7096

add timeouts to add http calls

30c1fa0

clutchski force-pushed the matt/long-eval branch from ab4019c to 30c1fa0 Compare January 2, 2026 19:51

clutchski marked this pull request as ready for review January 2, 2026 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix hanging evals #1223

fix hanging evals #1223

Uh oh!

clutchski commented Jan 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix hanging evals #1223

Are you sure you want to change the base?

fix hanging evals #1223

Uh oh!

Conversation

clutchski commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clutchski commented Jan 2, 2026 •

edited

Loading