Skip to content

Move heavy hydration I/O from orchestrator Lambda into agent container (follow-up to PR #52 interactive-agents design) #53

@scoropeza

Description

@scoropeza

Background

PR #52 ships the rev-6 async-only architecture (single AgentCore runtime, polling CLI, notification fan-out, SSE deletion). The design doc that PR introduces — docs/design/INTERACTIVE_AGENTS.md — locks in architectural decision AD-11: hydration (heavy I/O: GitHub issue/PR fetch, prompt assembly, Memory retrieval, guardrail screening, S3 blueprint reads) should live inside the agent container at startup, not in the orchestrator Lambda.

PR #52 defers the actual relocation to a follow-up for scope reasons. This issue tracks it.

Current state

Hydration lives in cdk/src/handlers/shared/context-hydration.ts (1,190 LOC) and is invoked from OrchestratorFn at cdk/src/handlers/shared/orchestrator.ts:256:

const hydratedContext = await hydrateContext(task, { /* ... */ });

The result is passed into the agent container's invocation payload. agent/src/pipeline.py already has a local-batch fallback path (if hydrated_context: … else: …) that demonstrates the target shape — it fetches the GitHub issue and assembles the prompt in-container when the orchestrator didn't.

Why move it

Industry precedent

Long-running async coding agents universally put heavy I/O in the worker, not the dispatcher:

System Hydration locus Source
Cursor background agents Inside the ephemeral VM https://docs.cursor.com/background-agents
GitHub Copilot coding agent Inside the ephemeral GitHub Actions runner https://docs.github.com/en/enterprise-cloud@latest/copilot/how-tos/use-copilot-agents/coding-agent/customize-the-agent-environment
Devin (Cognition) Inside per-agent isolated VM https://cognition.ai/blog/devin-can-now-manage-devins
LangGraph Platform Worker-side (API server only validates + enqueues) https://blog.langchain.com/context-engineering-for-agents/
Temporal Inside Activities (Workers), not Workflows https://temporal.io/blog/how-many-activities-should-i-use-in-my-temporal-workflow
OpenAI Assistants API Resolved at run-time inside execution https://platform.openai.com/docs/assistants/deep-dive
AWS Step Functions + Bedrock Dispatcher-side (exception — but for short-lived agent invocations) https://aws.amazon.com/blogs/compute/effectively-building-ai-agents-on-aws-serverless/

6 of 7 systems surveyed put heavy hydration worker-side. The lone exception (AWS Step Functions + Bedrock Agents) targets short-lived request-response workloads, not multi-hour container runs — our workload is the former category's opposite.

Temporal explicitly calls dispatcher-side I/O an anti-pattern: "Workflows should not do I/O; delegate to Activities in Workers" (https://raphaelbeamonte.com/posts/good-practices-for-writing-temporal-workflows-and-activities/). Our orchestrator Lambda is textbook "Workflow doing I/O."

Concrete risks of the current architecture

  1. 15-minute Lambda ceiling can be hit. hydrateContext performs serial network hops: SecretsManager → GitHub REST issue (30s timeout) → GitHub comments (30s) → GraphQL review threads (paginated for PR tasks) → Memory service → Bedrock ApplyGuardrail on the assembled prompt. A PR-review task with 200+ review threads + a cold Memory retrieval + backpressured guardrail can realistically hit 2-5 minutes. On Lambda timeout, the task sits in HYDRATING until the stranded-task reconciler's 20-minute threshold catches it — an effectively silent latency bomb.

  2. IAM blast radius. The orchestrator Lambda currently holds:

    • secretsmanager:GetSecretValue on PAT secrets
    • bedrock:ApplyGuardrail
    • AgentCore Memory read
    • S3 blueprint read

    The agent container already holds (or should hold) these for its own tool execution. Moving hydration shrinks the orchestrator's blast radius without adding anything new to the container.

  3. Compute portability. context-hydration.ts imports @aws-sdk/client-bedrock-runtime, @aws-sdk/client-secrets-manager, uses fetch with Node-20 AbortSignal.timeout, and relies on Lambda warm-start caching. An ECS/Fargate swap (already partially scaffolded in cdk/src/handlers/shared/strategies/ecs-strategy.ts) means either re-invoking this TS module from every compute backend or dual-maintaining a Python port anyway. Agent-side hydration consolidates to one implementation.

  4. Phase 2 + resume semantics. bgagent ask --agent and any future checkpoint-resume feature need the container to be idempotently hydratable from a task_id alone. If hydration only lives in the orchestrator, restart paths must round-trip through it — duplicating logic or accepting that restarts lose context.

Why defer (why this isn't in PR #52)

  1. Zero user-visible behavior change. Pure architectural relocation. Every UX feature works identically before and after.

  2. Large re-implementation surface:

    • 1,190 LOC of TypeScript to port to Python (new boto3 surfaces for Bedrock guardrail, AgentCore Memory, Secrets Manager; new GraphQL GitHub client)
    • 1,514 LOC of tests across 89 test cases to port from Jest to pytest
    • Total: ~2,700 LOC
  3. PR feat(interactive-agents): Phase 1a/1b/2 (Phase 3 design only) #52 narrative coherence. PR feat(interactive-agents): Phase 1a/1b/2 (Phase 3 design only) #52 tells one story: "rev-6 async-only architecture: delete SSE, fix UX, ship notifications." The chunks shipping SSE deletion and UX polish all fit cleanly. Adding hydration relocation turns that into two unrelated architectural moves and dilutes reviewer attention.

  4. Drift risk is bounded. The contract is explicit: HydratedContext interface in context-hydration.ts + the matching Pydantic model in agent/src/models.py with SUPPORTED_HYDRATED_CONTEXT_VERSION = 1. The Python side enforces a version check on receive.

Proposed design — hybrid split

Not a literal move. Keep the fail-fast benefit of dispatcher-side validation; move only the heavy I/O:

Stays in orchestrator (preflight, ~150-200 LOC):

  • PAT validity check (reject revoked tokens before allocating a compute slot)
  • Repo existence check
  • Guardrail screen on the raw task_description at submission

Moves to agent container (agent/src/hydration.py, ~1000 LOC):

  • GitHub issue body + comments fetch (REST)
  • GitHub PR body + review threads fetch (GraphQL)
  • Memory retrieval
  • Blueprint merge
  • Prompt assembly
  • Guardrail screen on the assembled prompt (where most of the latency lives today)

This split preserves fail-fast on trusted input while moving expensive, variable-latency I/O into the 8-hour runtime budget where it can emit progress events and fail visibly.

Acceptance criteria

  • agent/src/hydration.py handles all of: GitHub issue/PR fetch (REST + GraphQL), prompt assembly, Memory retrieval, blueprint merge, assembled-prompt guardrail
  • cdk/src/handlers/shared/context-hydration.ts shrinks to just the preflight (PAT check, repo check, raw-description guardrail) or is split/renamed accordingly
  • Orchestrator Lambda IAM role loses: bedrock:ApplyGuardrail (except for preflight scope), AgentCore Memory read, Secrets Manager PAT read
  • Agent runtime role adds: whatever the orchestrator loses
  • Test parity: every cdk/test/handlers/shared/context-hydration.test.ts case has a pytest equivalent in agent/tests/test_hydration.py
  • SUPPORTED_HYDRATED_CONTEXT_VERSION in agent/src/models.py bumped if the payload shape changes
  • Design doc INTERACTIVE_AGENTS.md AD-11 updated to remove the "deferred" status note
  • Rollout: feature-flag (HYDRATE_IN_AGENT=false/true) allows deploying the code and flipping the switch separately. Remove flag after one release cycle on default-true.

Out of scope for this issue

Risk & rollout

  • Risk: re-implementation bugs. Mitigated by (a) feature flag, (b) test-case parity requirement, (c) staged rollout in backgroundagent-dev first.
  • Rollback: flip HYDRATE_IN_AGENT=false env var on the Lambda; orchestrator keeps calling hydrateContext as today. No schema migration needed.

Cited research

This issue is backed by a cross-system survey and adversarial review:

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions