feat: classify run failure error codes and improve error logging#1340
Draft
pranaygp wants to merge 16 commits intographite-base/1340from
Draft
feat: classify run failure error codes and improve error logging#1340pranaygp wants to merge 16 commits intographite-base/1340from
pranaygp wants to merge 16 commits intographite-base/1340from
Conversation
* Fix connector line showing for run events * Improve row toggle * improve error reporting * add sorting for events * add changeset * update search to be best match
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Contributor
🦋 Changeset detectedLatest commit: e7068bf The changes in this PR will be included in the next version bump. This PR includes changesets to release 21 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Contributor
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (141 failed)astro (15 failed):
example (16 failed):
express (8 failed):
fastify (15 failed):
hono (13 failed):
nextjs-turbopack (10 failed):
nextjs-webpack (10 failed):
nitro (14 failed):
nuxt (13 failed):
sveltekit (13 failed):
vite (14 failed):
🐘 Local Postgres (1 failed)sveltekit-stable (1 failed):
🌍 Community Worlds (56 failed)mongodb (3 failed):
redis (2 failed):
turso (51 failed):
📋 Other (1 failed)e2e-local-postgres-nest-stable (1 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
❌ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
❌ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
The e2e tests spawn a CLI subprocess for every inspect/cancel/health call. Each subprocess performs an npm registry version check on startup, which can hang under load and exceed the 20s spawn timeout, causing SIGTERM. Add WORKFLOW_NO_UPDATE_CHECK=1 env var support to skip the check, and set it in the e2e test harness.
The health command's `endpoint` flag and the shared `env` flag both declared `char: 'e'`, causing ambiguity. Remove the short flag from `endpoint` so `-e` unambiguously maps to `--env`.
Set git user.name and user.email so that the 'Version Packages' commit created by changesets/action is attributed to the app's bot account instead of the default github-actions[bot].
Set setupGitUser: false on changesets/action to prevent it from overwriting our git config with the hardcoded github-actions[bot] identity. The git identity is now configured in a prior step using the app-slug output from actions/create-github-app-token.
spawn({timeout}) sends SIGTERM by default, but the CLI process ignores it
(likely due to undici keeping active connections alive). The process then
runs for minutes until vitest or GitHub Actions eventually kills it.
Using killSignal: 'SIGKILL' ensures the process is forcefully terminated
when the 20s timeout fires, since SIGKILL cannot be caught or ignored.
…processes" This reverts commit 6d4637e.
* fix: separate infrastructure vs user code error handling in runtime and step handler Transient network errors (ECONNRESET, etc.) during infrastructure calls (event listing, event creation) were caught by a shared try/catch that also handles user code errors, incorrectly marking runs as run_failed or steps as step_failed instead of letting the queue redeliver. - runtime.ts: Move infrastructure calls outside the user-code try/catch so errors propagate to the queue handler for automatic retry - step-handler.ts: Same structural separation — only stepFn.apply() is wrapped in the try/catch that produces step_failed/step_retrying - helpers.ts: Add isTransientNetworkError() and update withServerErrorRetry to retry network errors in addition to 5xx responses - helpers.test.ts: Add tests for network error detection and retry * add changeset * remove withServerErrorRetry and isTransientNetworkError Redundant with undici RetryAgent which already handles 5xx retries and network error retries at the HTTP dispatcher level. * address review feedback: move getEncryptionKeyForRun out of user-code try/catch, re-add 5xx/410 safety net in step handler, relax e2e test assertion * remove serverError5xxRetryWorkflow e2e test This test validated withServerErrorRetry's in-process retry behavior, which was removed. Queue-level retry with process-scoped fault injection is unreliable across serverless instances and too slow for e2e timeouts. * remove serverError5xxRetryWorkflow and fault injection helpers from e2e workflows * remove inline comment about deleted test
- Add RUN_ERROR_CODES (USER_ERROR, RUNTIME_ERROR) to @workflow/errors - Populate errorCode in run_failed events via classifyRunError() - Update web UI StatusBadge to show amber dot for infrastructure errors - Improve world-local queue error logging (concise, no body dump) - Improve schema validation error messages (concise, verbose behind DEBUG) - Add e2e tests for error code flow and infrastructure error retry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6655edc to
e7068bf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
USER_ERROR,RUNTIME_ERROR) torun_failedevents, populating the existing but previously unusederrorCodefieldDetails
Error codes
After #1339's structural separation, the
run_failedtry/catch only catches:USER_ERROR(throws from workflow functions, propagated step failures)RUNTIME_ERROR(corrupted event log, missing timestamps — internal bugs)Infrastructure errors (ECONNRESET, 5xx, schema validation) never produce
run_failedat all — they propagate to the queue for retry.The error code flows through the existing (previously unused) plumbing:
eventData.errorCode→StructuredError.code→WorkflowRunFailedError.cause.codeNote on storage:
errorCodeis stored inline as a plain DynamoDB attribute on the run entity — it does NOT go through refs/encryption. Only theerrorobject (message + stack) goes throughrefTracker→errorRef. TheerrorCodeis a sibling field ineventData, extracted and stored separately by the server (events.ts:846).Web UI
RUNTIME_ERROR: amber dot + "Internal Error" tooltip headerUSER_ERROR/ absent (backward compat): red dot + "Error Details" tooltip headerLogging improvements
See examples below.
Error log examples (captured from e2e test runs)
Runtime error logs (before → after)
Before:
After (now includes
errorCode):Queue error logs (before → after)
Before (dumped full request body with traceCarrier, runId, stepId, etc.):
After (concise, actionable):
Schema validation error messages (before → after)
Before (full Zod error dump + CBOR debug context always included):
After (concise issue list, verbose context only when
DEBUGenv var is set):Debug curl reproduction (before → after)
Before: only shown when
DEBUG=1(exact string match)After: shown when
DEBUGis set to any truthy value (consistent withdebugpackage)Test plan
classifyRunError(7 tests, all pass)errorWorkflowNested— assertserror.cause.code === 'USER_ERROR'andrunData.error.code === 'USER_ERROR'errorRetryFatal— assertserror.cause.code === 'USER_ERROR'infraErrorRetryWorkflow— validates infra errors onrun_completedretry via queue (notrun_failed)🤖 Generated with Claude Code