Skip to content

feat: classify run failure error codes and improve error logging#1340

Draft
pranaygp wants to merge 16 commits intographite-base/1340from
pgp/run-failed-schema-vailidation-error
Draft

feat: classify run failure error codes and improve error logging#1340
pranaygp wants to merge 16 commits intographite-base/1340from
pgp/run-failed-schema-vailidation-error

Conversation

@pranaygp
Copy link
Collaborator

@pranaygp pranaygp commented Mar 12, 2026

Summary

  • Adds error code classification (USER_ERROR, RUNTIME_ERROR) to run_failed events, populating the existing but previously unused errorCode field
  • Improves error logging across world-local queue, world-vercel schema validation, and runtime to be more concise and user-friendly
  • Stacked on top of fix: separate infrastructure vs user code error handling #1339 which structurally separates infrastructure vs user code error handling

Details

Error codes

After #1339's structural separation, the run_failed try/catch only catches:

  • User code errorsUSER_ERROR (throws from workflow functions, propagated step failures)
  • WorkflowRuntimeErrorRUNTIME_ERROR (corrupted event log, missing timestamps — internal bugs)

Infrastructure errors (ECONNRESET, 5xx, schema validation) never produce run_failed at all — they propagate to the queue for retry.

The error code flows through the existing (previously unused) plumbing:
eventData.errorCodeStructuredError.codeWorkflowRunFailedError.cause.code

Note on storage: errorCode is stored inline as a plain DynamoDB attribute on the run entity — it does NOT go through refs/encryption. Only the error object (message + stack) goes through refTrackererrorRef. The errorCode is a sibling field in eventData, extracted and stored separately by the server (events.ts:846).

Web UI

  • RUNTIME_ERROR: amber dot + "Internal Error" tooltip header
  • USER_ERROR / absent (backward compat): red dot + "Error Details" tooltip header
  • Error code shown as a label in the tooltip

Logging improvements

See examples below.

Error log examples (captured from e2e test runs)

Runtime error logs (before → after)

Before:

[Workflow] Error while running workflow {
  workflowRunId: 'wrun_01KKFXW09EHA9M3QXWNEJNC52Z',
  errorName: 'Error',
  errorStack: 'Error: Nested workflow error\n    at errorNested3 ...'
}

After (now includes errorCode):

[Workflow] Error while running workflow {
  workflowRunId: 'wrun_01KKFY7MARP7D16PN69HCMYNVQ',
  errorCode: 'USER_ERROR',
  errorName: 'Error',
  errorStack: 'Error: Nested workflow error\n    at errorNested3 ...'
}

Queue error logs (before → after)

Before (dumped full request body with traceCarrier, runId, stepId, etc.):

[local world] Failed to queue message {
  queueName: '__wkf_step_...',
  text: '"WorkflowAPIError: Injected 5xx"',
  status: 500,
  headers: { ... },
  body: '{"workflowName":"workflow//./workflows/99_e2e//serverError5xxRetryWorkflow",
    "workflowRunId":"wrun_01KKF...",
    "workflowStartedAt":1773282422605,
    "stepId":"step_01KKF...",
    "traceCarrier":{"traceparent":"00-778ab...","baggage":"workflow.run_id=wrun_01KKF..."},
    "requestedAt":"2026-03-12T02:27:02.778Z"}'
}

After (concise, actionable):

[world-local] Queue message failed (attempt 1/3, status 500): "WorkflowAPIError: Injected 5xx" {
  queueName: '__wkf_step_...',
  messageId: 'msg_01KKF...'
}

Schema validation error messages (before → after)

Before (full Zod error dump + CBOR debug context always included):

Schema validation failed for POST /v2/runs/wrun_.../events:

[
  {
    "expected": "object",
    "code": "invalid_type",
    "path": ["run", "error"],
    "message": "Invalid input: expected object, received undefined"
  }
]

Response context: Content-Type: application/cbor, 1589 bytes (CBOR), preview: {
  event: { runId: 'wrun_...', eventId: 'evnt_...', correlationId: 'wrun_...',
    eventType: 'run_failed', eventData: { error: { _ref: 's3rf:team_...', _type: 'RemoteRef' } },
    createdAt: 20... }
}

After (concise issue list, verbose context only when DEBUG env var is set):

Schema validation failed for POST /v2/runs/wrun_.../events:
  run.error: Invalid input: expected object, received undefined

Debug curl reproduction (before → after)

Before: only shown when DEBUG=1 (exact string match)
After: shown when DEBUG is set to any truthy value (consistent with debug package)

Test plan

  • Unit tests for classifyRunError (7 tests, all pass)
  • All 478 core unit tests pass
  • E2E: errorWorkflowNested — asserts error.cause.code === 'USER_ERROR' and runData.error.code === 'USER_ERROR'
  • E2E: errorRetryFatal — asserts error.cause.code === 'USER_ERROR'
  • E2E: infraErrorRetryWorkflow — validates infra errors on run_completed retry via queue (not run_failed)
  • Visual: verify amber vs red badge in web UI

🤖 Generated with Claude Code

karthikscale3 and others added 2 commits March 11, 2026 16:51
* Fix connector line showing for run events

* Improve row toggle

* improve error reporting

* add sorting for events

* add changeset

* update search to be best match
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@vercel
Copy link
Contributor

vercel bot commented Mar 12, 2026

@changeset-bot
Copy link

changeset-bot bot commented Mar 12, 2026

🦋 Changeset detected

Latest commit: e7068bf

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 21 packages
Name Type
@workflow/errors Patch
@workflow/core Patch
@workflow/web Patch
@workflow/world-local Patch
@workflow/world-vercel Patch
@workflow/cli Patch
@workflow/nest Patch
@workflow/vitest Patch
@workflow/builders Patch
workflow Patch
@workflow/world-postgres Patch
@workflow/next Patch
@workflow/nitro Patch
@workflow/web-shared Patch
@workflow/world-testing Patch
@workflow/astro Patch
@workflow/rollup Patch
@workflow/sveltekit Patch
@workflow/vite Patch
@workflow/ai Patch
@workflow/nuxt Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions
Copy link
Contributor

github-actions bot commented Mar 12, 2026

🧪 E2E Test Results

Some tests failed

Summary

Passed Failed Skipped Total
❌ ▲ Vercel Production 430 141 67 638
✅ 💻 Local Development 612 0 84 696
✅ 📦 Local Production 612 0 84 696
❌ 🐘 Local Postgres 611 1 84 696
✅ 🪟 Windows 55 0 3 58
❌ 🌍 Community Worlds 118 56 15 189
❌ 📋 Other 146 1 27 174
Total 2584 199 364 3147

❌ Failed Tests

▲ Vercel Production (141 failed)

astro (15 failed):

  • promiseRaceStressTestWorkflow
  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling error propagation step errors basic step error preserves message and stack trace
  • error handling retry behavior FatalError fails immediately without retries
  • error handling retry behavior infrastructure error on run_completed retries via queue (not run_failed)
  • hookCleanupTestWorkflow - hook token reuse after workflow completion
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running
  • stepFunctionPassingWorkflow - step function references can be passed as arguments (without closure vars)
  • Calculator.calculate - static workflow method using static step methods from another class
  • ChainableService.processWithThis - static step methods using this to reference the class
  • customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE
  • cancelRun - cancelling a running workflow
  • cancelRun via CLI - cancelling a running workflow
  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep

example (16 failed):

  • addTenWorkflow
  • parallelSleepWorkflow
  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling error propagation step errors cross-file step error preserves message and function names in stack
  • error handling retry behavior FatalError fails immediately without retries
  • error handling retry behavior RetryableError respects custom retryAfter delay
  • hookCleanupTestWorkflow - hook token reuse after workflow completion
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running
  • stepFunctionWithClosureWorkflow - step function with closure variables passed as argument
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step
  • Calculator.calculate - static workflow method using static step methods from another class
  • ChainableService.processWithThis - static step methods using this to reference the class
  • stepFunctionAsStartArgWorkflow - step function reference passed as start() argument
  • cancelRun via CLI - cancelling a running workflow
  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep

express (8 failed):

  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling retry behavior FatalError fails immediately without retries
  • hookCleanupTestWorkflow - hook token reuse after workflow completion
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running
  • crossContextSerdeWorkflow - classes defined in step code are deserializable in workflow context
  • cancelRun - cancelling a running workflow
  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep

fastify (15 failed):

  • addTenWorkflow
  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling error propagation step errors cross-file step error preserves message and function names in stack
  • error handling retry behavior FatalError fails immediately without retries
  • error handling retry behavior RetryableError respects custom retryAfter delay
  • error handling retry behavior infrastructure error on run_completed retries via queue (not run_failed)
  • hookCleanupTestWorkflow - hook token reuse after workflow completion
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step
  • Calculator.calculate - static workflow method using static step methods from another class
  • ChainableService.processWithThis - static step methods using this to reference the class
  • customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE
  • cancelRun - cancelling a running workflow
  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep

hono (13 failed):

  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling error propagation step errors basic step error preserves message and stack trace
  • error handling retry behavior FatalError fails immediately without retries
  • hookCleanupTestWorkflow - hook token reuse after workflow completion
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running
  • stepFunctionPassingWorkflow - step function references can be passed as arguments (without closure vars)
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step
  • Calculator.calculate - static workflow method using static step methods from another class
  • ChainableService.processWithThis - static step methods using this to reference the class
  • customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE
  • cancelRun - cancelling a running workflow
  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep

nextjs-turbopack (10 failed):

  • parallelSleepWorkflow
  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling error propagation step errors cross-file step error preserves message and function names in stack
  • error handling retry behavior FatalError fails immediately without retries
  • error handling retry behavior RetryableError respects custom retryAfter delay
  • hookCleanupTestWorkflow - hook token reuse after workflow completion
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step
  • cancelRun - cancelling a running workflow

nextjs-webpack (10 failed):

  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling retry behavior FatalError fails immediately without retries
  • hookCleanupTestWorkflow - hook token reuse after workflow completion
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step
  • Calculator.calculate - static workflow method using static step methods from another class
  • ChainableService.processWithThis - static step methods using this to reference the class
  • customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE
  • cancelRun via CLI - cancelling a running workflow

nitro (14 failed):

  • addTenWorkflow
  • parallelSleepWorkflow
  • promiseRaceStressTestWorkflow
  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling error propagation step errors basic step error preserves message and stack trace
  • error handling retry behavior FatalError fails immediately without retries
  • hookCleanupTestWorkflow - hook token reuse after workflow completion
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running
  • Calculator.calculate - static workflow method using static step methods from another class
  • ChainableService.processWithThis - static step methods using this to reference the class
  • customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE
  • cancelRun - cancelling a running workflow
  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep

nuxt (13 failed):

  • addTenWorkflow
  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling error propagation step errors cross-file step error preserves message and function names in stack
  • error handling retry behavior FatalError fails immediately without retries
  • error handling retry behavior infrastructure error on run_completed retries via queue (not run_failed)
  • hookCleanupTestWorkflow - hook token reuse after workflow completion
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running
  • stepFunctionWithClosureWorkflow - step function with closure variables passed as argument
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step
  • crossContextSerdeWorkflow - classes defined in step code are deserializable in workflow context
  • cancelRun - cancelling a running workflow
  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep

sveltekit (13 failed):

  • parallelSleepWorkflow
  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling error propagation step errors basic step error preserves message and stack trace
  • error handling retry behavior FatalError fails immediately without retries
  • hookCleanupTestWorkflow - hook token reuse after workflow completion
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step
  • Calculator.calculate - static workflow method using static step methods from another class
  • ChainableService.processWithThis - static step methods using this to reference the class
  • customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE
  • cancelRun - cancelling a running workflow
  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep

vite (14 failed):

  • promiseRaceStressTestWorkflow
  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling retry behavior FatalError fails immediately without retries
  • error handling retry behavior infrastructure error on run_completed retries via queue (not run_failed)
  • hookCleanupTestWorkflow - hook token reuse after workflow completion
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step
  • Calculator.calculate - static workflow method using static step methods from another class
  • ChainableService.processWithThis - static step methods using this to reference the class
  • customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE
  • cancelRun - cancelling a running workflow
  • cancelRun via CLI - cancelling a running workflow
  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep
🐘 Local Postgres (1 failed)

sveltekit-stable (1 failed):

  • webhookWorkflow
🌍 Community Worlds (56 failed)

mongodb (3 failed):

  • hookWorkflow is not resumable via public webhook endpoint
  • webhookWorkflow
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously

redis (2 failed):

  • hookWorkflow is not resumable via public webhook endpoint
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously

turso (51 failed):

  • addTenWorkflow
  • addTenWorkflow
  • wellKnownAgentWorkflow (.well-known/agent)
  • should work with react rendering in step
  • promiseAllWorkflow
  • promiseRaceWorkflow
  • promiseAnyWorkflow
  • importedStepOnlyWorkflow
  • hookWorkflow
  • hookWorkflow is not resumable via public webhook endpoint
  • webhookWorkflow
  • sleepingWorkflow
  • parallelSleepWorkflow
  • nullByteWorkflow
  • workflowAndStepMetadataWorkflow
  • fetchWorkflow
  • promiseRaceStressTestWorkflow
  • error handling error propagation workflow errors nested function calls preserve message and stack trace
  • error handling error propagation workflow errors cross-file imports preserve message and stack trace
  • error handling error propagation step errors basic step error preserves message and stack trace
  • error handling error propagation step errors cross-file step error preserves message and function names in stack
  • error handling retry behavior regular Error retries until success
  • error handling retry behavior FatalError fails immediately without retries
  • error handling retry behavior RetryableError respects custom retryAfter delay
  • error handling retry behavior maxRetries=0 disables retries
  • error handling retry behavior infrastructure error on run_completed retries via queue (not run_failed)
  • error handling catchability FatalError can be caught and detected with FatalError.is()
  • hookCleanupTestWorkflow - hook token reuse after workflow completion
  • concurrent hook token conflict - two workflows cannot use the same hook token simultaneously
  • hookDisposeTestWorkflow - hook token reuse after explicit disposal while workflow still running
  • stepFunctionPassingWorkflow - step function references can be passed as arguments (without closure vars)
  • stepFunctionWithClosureWorkflow - step function with closure variables passed as argument
  • closureVariableWorkflow - nested step functions with closure variables
  • spawnWorkflowFromStepWorkflow - spawning a child workflow using start() inside a step
  • health check (queue-based) - workflow and step endpoints respond to health check messages
  • pathsAliasWorkflow - TypeScript path aliases resolve correctly
  • Calculator.calculate - static workflow method using static step methods from another class
  • AllInOneService.processNumber - static workflow method using sibling static step methods
  • ChainableService.processWithThis - static step methods using this to reference the class
  • thisSerializationWorkflow - step function invoked with .call() and .apply()
  • customSerializationWorkflow - custom class serialization with WORKFLOW_SERIALIZE/WORKFLOW_DESERIALIZE
  • instanceMethodStepWorkflow - instance methods with "use step" directive
  • crossContextSerdeWorkflow - classes defined in step code are deserializable in workflow context
  • stepFunctionAsStartArgWorkflow - step function reference passed as start() argument
  • cancelRun - cancelling a running workflow
  • cancelRun via CLI - cancelling a running workflow
  • pages router addTenWorkflow via pages router
  • pages router promiseAllWorkflow via pages router
  • pages router sleepingWorkflow via pages router
  • hookWithSleepWorkflow - hook payloads delivered correctly with concurrent sleep
  • sleepWithSequentialStepsWorkflow - sequential steps work with concurrent sleep (control)
📋 Other (1 failed)

e2e-local-postgres-nest-stable (1 failed):

  • webhookWorkflow

Details by Category

❌ ▲ Vercel Production
App Passed Failed Skipped
❌ astro 36 15 7
❌ example 35 16 7
❌ express 43 8 7
❌ fastify 36 15 7
❌ hono 38 13 7
❌ nextjs-turbopack 46 10 2
❌ nextjs-webpack 46 10 2
❌ nitro 37 14 7
❌ nuxt 38 13 7
❌ sveltekit 38 13 7
❌ vite 37 14 7
✅ 💻 Local Development
App Passed Failed Skipped
✅ astro-stable 49 0 9
✅ express-stable 49 0 9
✅ fastify-stable 49 0 9
✅ hono-stable 49 0 9
✅ nextjs-turbopack-canary 55 0 3
✅ nextjs-turbopack-stable 55 0 3
✅ nextjs-webpack-canary 55 0 3
✅ nextjs-webpack-stable 55 0 3
✅ nitro-stable 49 0 9
✅ nuxt-stable 49 0 9
✅ sveltekit-stable 49 0 9
✅ vite-stable 49 0 9
✅ 📦 Local Production
App Passed Failed Skipped
✅ astro-stable 49 0 9
✅ express-stable 49 0 9
✅ fastify-stable 49 0 9
✅ hono-stable 49 0 9
✅ nextjs-turbopack-canary 55 0 3
✅ nextjs-turbopack-stable 55 0 3
✅ nextjs-webpack-canary 55 0 3
✅ nextjs-webpack-stable 55 0 3
✅ nitro-stable 49 0 9
✅ nuxt-stable 49 0 9
✅ sveltekit-stable 49 0 9
✅ vite-stable 49 0 9
❌ 🐘 Local Postgres
App Passed Failed Skipped
✅ astro-stable 49 0 9
✅ express-stable 49 0 9
✅ fastify-stable 49 0 9
✅ hono-stable 49 0 9
✅ nextjs-turbopack-canary 55 0 3
✅ nextjs-turbopack-stable 55 0 3
✅ nextjs-webpack-canary 55 0 3
✅ nextjs-webpack-stable 55 0 3
✅ nitro-stable 49 0 9
✅ nuxt-stable 49 0 9
❌ sveltekit-stable 48 1 9
✅ vite-stable 49 0 9
✅ 🪟 Windows
App Passed Failed Skipped
✅ nextjs-turbopack 55 0 3
❌ 🌍 Community Worlds
App Passed Failed Skipped
✅ mongodb-dev 3 0 2
❌ mongodb 52 3 3
✅ redis-dev 3 0 2
❌ redis 53 2 3
✅ turso-dev 3 0 2
❌ turso 4 51 3
❌ 📋 Other
App Passed Failed Skipped
✅ e2e-local-dev-nest-stable 49 0 9
❌ e2e-local-postgres-nest-stable 48 1 9
✅ e2e-local-prod-nest-stable 49 0 9

📋 View full workflow run


Some E2E test jobs failed:

  • Vercel Prod: failure
  • Local Dev: success
  • Local Prod: success
  • Local Postgres: failure
  • Windows: success

Check the workflow run for details.

TooTallNate and others added 10 commits March 12, 2026 12:21
The e2e tests spawn a CLI subprocess for every inspect/cancel/health call.
Each subprocess performs an npm registry version check on startup, which
can hang under load and exceed the 20s spawn timeout, causing SIGTERM.

Add WORKFLOW_NO_UPDATE_CHECK=1 env var support to skip the check, and
set it in the e2e test harness.
The health command's `endpoint` flag and the shared `env` flag both
declared `char: 'e'`, causing ambiguity. Remove the short flag from
`endpoint` so `-e` unambiguously maps to `--env`.
Set git user.name and user.email so that the 'Version Packages' commit
created by changesets/action is attributed to the app's bot account
instead of the default github-actions[bot].
Set setupGitUser: false on changesets/action to prevent it from
overwriting our git config with the hardcoded github-actions[bot]
identity. The git identity is now configured in a prior step using
the app-slug output from actions/create-github-app-token.
spawn({timeout}) sends SIGTERM by default, but the CLI process ignores it
(likely due to undici keeping active connections alive). The process then
runs for minutes until vitest or GitHub Actions eventually kills it.

Using killSignal: 'SIGKILL' ensures the process is forcefully terminated
when the 20s timeout fires, since SIGKILL cannot be caught or ignored.
* fix: separate infrastructure vs user code error handling in runtime and step handler

Transient network errors (ECONNRESET, etc.) during infrastructure calls
(event listing, event creation) were caught by a shared try/catch that
also handles user code errors, incorrectly marking runs as run_failed
or steps as step_failed instead of letting the queue redeliver.

- runtime.ts: Move infrastructure calls outside the user-code try/catch
  so errors propagate to the queue handler for automatic retry
- step-handler.ts: Same structural separation — only stepFn.apply() is
  wrapped in the try/catch that produces step_failed/step_retrying
- helpers.ts: Add isTransientNetworkError() and update withServerErrorRetry
  to retry network errors in addition to 5xx responses
- helpers.test.ts: Add tests for network error detection and retry

* add changeset

* remove withServerErrorRetry and isTransientNetworkError

Redundant with undici RetryAgent which already handles 5xx retries
and network error retries at the HTTP dispatcher level.

* address review feedback: move getEncryptionKeyForRun out of user-code try/catch, re-add 5xx/410 safety net in step handler, relax e2e test assertion

* remove serverError5xxRetryWorkflow e2e test

This test validated withServerErrorRetry's in-process retry behavior,
which was removed. Queue-level retry with process-scoped fault injection
is unreliable across serverless instances and too slow for e2e timeouts.

* remove serverError5xxRetryWorkflow and fault injection helpers from e2e workflows

* remove inline comment about deleted test
- Add RUN_ERROR_CODES (USER_ERROR, RUNTIME_ERROR) to @workflow/errors
- Populate errorCode in run_failed events via classifyRunError()
- Update web UI StatusBadge to show amber dot for infrastructure errors
- Improve world-local queue error logging (concise, no body dump)
- Improve schema validation error messages (concise, verbose behind DEBUG)
- Add e2e tests for error code flow and infrastructure error retry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants