Move heavy hydration I/O from orchestrator Lambda into agent container (follow-up to PR #52 interactive-agents design)

### Background

[PR #52](https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/pull/52) ships the rev-6 async-only architecture (single AgentCore runtime, polling CLI, notification fan-out, SSE deletion). The design doc that PR introduces — [`docs/design/INTERACTIVE_AGENTS.md`](https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/blob/main/docs/design/INTERACTIVE_AGENTS.md) — locks in architectural decision AD-11: **hydration** (heavy I/O: GitHub issue/PR fetch, prompt assembly, Memory retrieval, guardrail screening, S3 blueprint reads) should live **inside the agent container at startup**, not in the orchestrator Lambda.

PR #52 defers the actual relocation to a follow-up for scope reasons. This issue tracks it.

### Current state

Hydration lives in `cdk/src/handlers/shared/context-hydration.ts` (1,190 LOC) and is invoked from `OrchestratorFn` at `cdk/src/handlers/shared/orchestrator.ts:256`:

```ts
const hydratedContext = await hydrateContext(task, { /* ... */ });
```

The result is passed into the agent container's invocation payload. `agent/src/pipeline.py` already has a local-batch fallback path (`if hydrated_context: … else: …`) that demonstrates the target shape — it fetches the GitHub issue and assembles the prompt in-container when the orchestrator didn't.

### Why move it

#### Industry precedent

Long-running async coding agents universally put heavy I/O in the worker, not the dispatcher:

| System | Hydration locus | Source |
|---|---|---|
| **Cursor background agents** | Inside the ephemeral VM | <https://docs.cursor.com/background-agents> |
| **GitHub Copilot coding agent** | Inside the ephemeral GitHub Actions runner | <https://docs.github.com/en/enterprise-cloud@latest/copilot/how-tos/use-copilot-agents/coding-agent/customize-the-agent-environment> |
| **Devin (Cognition)** | Inside per-agent isolated VM | <https://cognition.ai/blog/devin-can-now-manage-devins> |
| **LangGraph Platform** | Worker-side (API server only validates + enqueues) | <https://blog.langchain.com/context-engineering-for-agents/> |
| **Temporal** | Inside Activities (Workers), not Workflows | <https://temporal.io/blog/how-many-activities-should-i-use-in-my-temporal-workflow> |
| **OpenAI Assistants API** | Resolved at run-time inside execution | <https://platform.openai.com/docs/assistants/deep-dive> |
| AWS Step Functions + Bedrock | Dispatcher-side (exception — but for short-lived agent invocations) | <https://aws.amazon.com/blogs/compute/effectively-building-ai-agents-on-aws-serverless/> |

**6 of 7 systems** surveyed put heavy hydration worker-side. The lone exception (AWS Step Functions + Bedrock Agents) targets short-lived request-response workloads, not multi-hour container runs — our workload is the former category's opposite.

Temporal explicitly calls dispatcher-side I/O an anti-pattern: *"Workflows should not do I/O; delegate to Activities in Workers"* (<https://raphaelbeamonte.com/posts/good-practices-for-writing-temporal-workflows-and-activities/>). Our orchestrator Lambda is textbook "Workflow doing I/O."

#### Concrete risks of the current architecture

1. **15-minute Lambda ceiling can be hit.** `hydrateContext` performs serial network hops: SecretsManager → GitHub REST issue (30s timeout) → GitHub comments (30s) → GraphQL review threads (paginated for PR tasks) → Memory service → Bedrock `ApplyGuardrail` on the assembled prompt. A PR-review task with 200+ review threads + a cold Memory retrieval + backpressured guardrail can realistically hit 2-5 minutes. On Lambda timeout, the task sits in `HYDRATING` until the stranded-task reconciler's 20-minute threshold catches it — an effectively silent latency bomb.

2. **IAM blast radius.** The orchestrator Lambda currently holds:
   - `secretsmanager:GetSecretValue` on PAT secrets
   - `bedrock:ApplyGuardrail`
   - AgentCore Memory read
   - S3 blueprint read

   The agent container already holds (or should hold) these for its own tool execution. Moving hydration shrinks the orchestrator's blast radius without adding anything new to the container.

3. **Compute portability.** `context-hydration.ts` imports `@aws-sdk/client-bedrock-runtime`, `@aws-sdk/client-secrets-manager`, uses `fetch` with Node-20 `AbortSignal.timeout`, and relies on Lambda warm-start caching. An ECS/Fargate swap (already partially scaffolded in `cdk/src/handlers/shared/strategies/ecs-strategy.ts`) means either re-invoking this TS module from every compute backend or dual-maintaining a Python port anyway. Agent-side hydration consolidates to one implementation.

4. **Phase 2 + resume semantics.** `bgagent ask --agent` and any future checkpoint-resume feature need the container to be idempotently hydratable from a `task_id` alone. If hydration only lives in the orchestrator, restart paths must round-trip through it — duplicating logic or accepting that restarts lose context.

### Why defer (why this isn't in PR #52)

1. **Zero user-visible behavior change.** Pure architectural relocation. Every UX feature works identically before and after.

2. **Large re-implementation surface:**
   - 1,190 LOC of TypeScript to port to Python (new boto3 surfaces for Bedrock guardrail, AgentCore Memory, Secrets Manager; new GraphQL GitHub client)
   - 1,514 LOC of tests across 89 test cases to port from Jest to pytest
   - Total: ~2,700 LOC

3. **PR #52 narrative coherence.** PR #52 tells one story: *"rev-6 async-only architecture: delete SSE, fix UX, ship notifications."* The chunks shipping SSE deletion and UX polish all fit cleanly. Adding hydration relocation turns that into two unrelated architectural moves and dilutes reviewer attention.

4. **Drift risk is bounded.** The contract is explicit: `HydratedContext` interface in `context-hydration.ts` + the matching Pydantic model in `agent/src/models.py` with `SUPPORTED_HYDRATED_CONTEXT_VERSION = 1`. The Python side enforces a version check on receive.

### Proposed design — hybrid split

Not a literal move. Keep the **fail-fast** benefit of dispatcher-side validation; move only the **heavy I/O**:

**Stays in orchestrator (preflight, ~150-200 LOC):**
- PAT validity check (reject revoked tokens before allocating a compute slot)
- Repo existence check
- Guardrail screen on the **raw** `task_description` at submission

**Moves to agent container (`agent/src/hydration.py`, ~1000 LOC):**
- GitHub issue body + comments fetch (REST)
- GitHub PR body + review threads fetch (GraphQL)
- Memory retrieval
- Blueprint merge
- Prompt assembly
- Guardrail screen on the **assembled** prompt (where most of the latency lives today)

This split preserves fail-fast on trusted input while moving expensive, variable-latency I/O into the 8-hour runtime budget where it can emit progress events and fail visibly.

### Acceptance criteria

- [ ] `agent/src/hydration.py` handles all of: GitHub issue/PR fetch (REST + GraphQL), prompt assembly, Memory retrieval, blueprint merge, assembled-prompt guardrail
- [ ] `cdk/src/handlers/shared/context-hydration.ts` shrinks to just the preflight (PAT check, repo check, raw-description guardrail) or is split/renamed accordingly
- [ ] Orchestrator Lambda IAM role loses: `bedrock:ApplyGuardrail` (except for preflight scope), AgentCore Memory read, Secrets Manager PAT read
- [ ] Agent runtime role adds: whatever the orchestrator loses
- [ ] Test parity: every `cdk/test/handlers/shared/context-hydration.test.ts` case has a pytest equivalent in `agent/tests/test_hydration.py`
- [ ] `SUPPORTED_HYDRATED_CONTEXT_VERSION` in `agent/src/models.py` bumped if the payload shape changes
- [ ] Design doc `INTERACTIVE_AGENTS.md` AD-11 updated to remove the "deferred" status note
- [ ] Rollout: feature-flag (`HYDRATE_IN_AGENT=false`/`true`) allows deploying the code and flipping the switch separately. Remove flag after one release cycle on default-true.

### Out of scope for this issue

- Changing the `HydratedContext` schema (unless the hybrid-split forces it)
- ECS/Fargate runtime support (separate issue)
- Any user-facing feature work (covered by PR #52 and subsequent chunks)

### Risk & rollout

- **Risk:** re-implementation bugs. Mitigated by (a) feature flag, (b) test-case parity requirement, (c) staged rollout in `backgroundagent-dev` first.
- **Rollback:** flip `HYDRATE_IN_AGENT=false` env var on the Lambda; orchestrator keeps calling `hydrateContext` as today. No schema migration needed.

### Cited research

This issue is backed by a cross-system survey and adversarial review:

- Industry precedent table above — primary sources linked
- Temporal: <https://temporal.io/blog/how-many-activities-should-i-use-in-my-temporal-workflow>
- LangGraph context engineering: <https://blog.langchain.com/context-engineering-for-agents/>
- AWS prescriptive guidance: <https://docs.aws.amazon.com/prescriptive-guidance/latest/agentic-ai-serverless/pattern-agentic-ai-orchestration.html>
- AWS AgentCore samples: <https://aws.amazon.com/blogs/machine-learning/getting-started-with-amazon-bedrock-agents-custom-orchestrator/>
- Cursor background agents docs: <https://docs.cursor.com/background-agents>
- GitHub Copilot coding agent docs: <https://docs.github.com/en/enterprise-cloud@latest/copilot/how-tos/use-copilot-agents/coding-agent/customize-the-agent-environment>
- Devin architecture: <https://cognition.ai/blog/devin-can-now-manage-devins>
- Production patterns: <https://tianpan.co/blog/2026-03-07-async-agent-workflows-long-running-task-design>

### Related

- PR #52 (rev-6 async-only architecture) — context
- `docs/design/INTERACTIVE_AGENTS.md` AD-11 — design intent


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move heavy hydration I/O from orchestrator Lambda into agent container (follow-up to PR #52 interactive-agents design) #53

Background

Current state

Why move it

Industry precedent

Concrete risks of the current architecture

Why defer (why this isn't in PR #52)

Proposed design — hybrid split

Acceptance criteria

Out of scope for this issue

Risk & rollout

Cited research

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

System	Hydration locus	Source
Cursor background agents	Inside the ephemeral VM	https://docs.cursor.com/background-agents
GitHub Copilot coding agent	Inside the ephemeral GitHub Actions runner	https://docs.github.com/en/enterprise-cloud@latest/copilot/how-tos/use-copilot-agents/coding-agent/customize-the-agent-environment
Devin (Cognition)	Inside per-agent isolated VM	https://cognition.ai/blog/devin-can-now-manage-devins
LangGraph Platform	Worker-side (API server only validates + enqueues)	https://blog.langchain.com/context-engineering-for-agents/
Temporal	Inside Activities (Workers), not Workflows	https://temporal.io/blog/how-many-activities-should-i-use-in-my-temporal-workflow
OpenAI Assistants API	Resolved at run-time inside execution	https://platform.openai.com/docs/assistants/deep-dive
AWS Step Functions + Bedrock	Dispatcher-side (exception — but for short-lived agent invocations)	https://aws.amazon.com/blogs/compute/effectively-building-ai-agents-on-aws-serverless/

Move heavy hydration I/O from orchestrator Lambda into agent container (follow-up to PR #52 interactive-agents design) #53

Description

Background

Current state

Why move it

Industry precedent

Concrete risks of the current architecture

Why defer (why this isn't in PR #52)

Proposed design — hybrid split

Acceptance criteria

Out of scope for this issue

Risk & rollout

Cited research

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions