Skip to content

Conversation

@clutchski
Copy link
Collaborator

@clutchski clutchski commented Jan 2, 2026

This bug adds timeouts / retries to prevent network errors from hanging evals.

report from customer:

We have running eval jobs (500+ rows/PDF docs) on a VM that consistently freezes ~15 minutes after reaching 100% completion and gets stuck in experiment.summarize() . It occurs in fetch_base_experiment() on a POST to /api/base_experiment/get_id (https://github.com/braintrustdata/braintrust-sdk/blob/main/py/src/braintrust/logger.py#L3606C9-L3606C78). This only reproduces with large datasets on the VM and does not occur with smaller datasets or when run locally.
Root Cause
We think the issue is that fetch_base_experiment() uses app_conn() (Vercel IP), which is called at experiment start (registration) and then not used again until summarize(). During the long eval run, other logging traffic goes through api_conn() (AWS IP), and leaves app_conn() idle for 15+ minutes.
Azure NAT gateways have a ~4-minute idle timeout, and would silently drop idle connections. Seems like when summarize() reuses the stale connection, the TCP session has already been removed by the NAT, leading to hangs and eventual connection failure. The customer confirmed via network capture that TCP retransmissions fail at this point, which is consistent with a stale NAT mapping.

@clutchski clutchski changed the title Matt/long eval fix hanging evals Jan 2, 2026
this tightly couples our retry logic to braintrust state,
which is weird.
@clutchski clutchski marked this pull request as ready for review January 2, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants