You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #52 ships the rev-6 async-only architecture (single AgentCore runtime, polling CLI, notification fan-out, SSE deletion). The design doc that PR introduces — docs/design/INTERACTIVE_AGENTS.md — locks in architectural decision AD-11: hydration (heavy I/O: GitHub issue/PR fetch, prompt assembly, Memory retrieval, guardrail screening, S3 blueprint reads) should live inside the agent container at startup, not in the orchestrator Lambda.
PR #52 defers the actual relocation to a follow-up for scope reasons. This issue tracks it.
Current state
Hydration lives in cdk/src/handlers/shared/context-hydration.ts (1,190 LOC) and is invoked from OrchestratorFn at cdk/src/handlers/shared/orchestrator.ts:256:
The result is passed into the agent container's invocation payload. agent/src/pipeline.py already has a local-batch fallback path (if hydrated_context: … else: …) that demonstrates the target shape — it fetches the GitHub issue and assembles the prompt in-container when the orchestrator didn't.
Why move it
Industry precedent
Long-running async coding agents universally put heavy I/O in the worker, not the dispatcher:
6 of 7 systems surveyed put heavy hydration worker-side. The lone exception (AWS Step Functions + Bedrock Agents) targets short-lived request-response workloads, not multi-hour container runs — our workload is the former category's opposite.
15-minute Lambda ceiling can be hit.hydrateContext performs serial network hops: SecretsManager → GitHub REST issue (30s timeout) → GitHub comments (30s) → GraphQL review threads (paginated for PR tasks) → Memory service → Bedrock ApplyGuardrail on the assembled prompt. A PR-review task with 200+ review threads + a cold Memory retrieval + backpressured guardrail can realistically hit 2-5 minutes. On Lambda timeout, the task sits in HYDRATING until the stranded-task reconciler's 20-minute threshold catches it — an effectively silent latency bomb.
IAM blast radius. The orchestrator Lambda currently holds:
secretsmanager:GetSecretValue on PAT secrets
bedrock:ApplyGuardrail
AgentCore Memory read
S3 blueprint read
The agent container already holds (or should hold) these for its own tool execution. Moving hydration shrinks the orchestrator's blast radius without adding anything new to the container.
Compute portability.context-hydration.ts imports @aws-sdk/client-bedrock-runtime, @aws-sdk/client-secrets-manager, uses fetch with Node-20 AbortSignal.timeout, and relies on Lambda warm-start caching. An ECS/Fargate swap (already partially scaffolded in cdk/src/handlers/shared/strategies/ecs-strategy.ts) means either re-invoking this TS module from every compute backend or dual-maintaining a Python port anyway. Agent-side hydration consolidates to one implementation.
Phase 2 + resume semantics.bgagent ask --agent and any future checkpoint-resume feature need the container to be idempotently hydratable from a task_id alone. If hydration only lives in the orchestrator, restart paths must round-trip through it — duplicating logic or accepting that restarts lose context.
Drift risk is bounded. The contract is explicit: HydratedContext interface in context-hydration.ts + the matching Pydantic model in agent/src/models.py with SUPPORTED_HYDRATED_CONTEXT_VERSION = 1. The Python side enforces a version check on receive.
Proposed design — hybrid split
Not a literal move. Keep the fail-fast benefit of dispatcher-side validation; move only the heavy I/O:
Stays in orchestrator (preflight, ~150-200 LOC):
PAT validity check (reject revoked tokens before allocating a compute slot)
Repo existence check
Guardrail screen on the rawtask_description at submission
Moves to agent container (agent/src/hydration.py, ~1000 LOC):
GitHub issue body + comments fetch (REST)
GitHub PR body + review threads fetch (GraphQL)
Memory retrieval
Blueprint merge
Prompt assembly
Guardrail screen on the assembled prompt (where most of the latency lives today)
This split preserves fail-fast on trusted input while moving expensive, variable-latency I/O into the 8-hour runtime budget where it can emit progress events and fail visibly.
cdk/src/handlers/shared/context-hydration.ts shrinks to just the preflight (PAT check, repo check, raw-description guardrail) or is split/renamed accordingly
Orchestrator Lambda IAM role loses: bedrock:ApplyGuardrail (except for preflight scope), AgentCore Memory read, Secrets Manager PAT read
Agent runtime role adds: whatever the orchestrator loses
Test parity: every cdk/test/handlers/shared/context-hydration.test.ts case has a pytest equivalent in agent/tests/test_hydration.py
SUPPORTED_HYDRATED_CONTEXT_VERSION in agent/src/models.py bumped if the payload shape changes
Design doc INTERACTIVE_AGENTS.md AD-11 updated to remove the "deferred" status note
Rollout: feature-flag (HYDRATE_IN_AGENT=false/true) allows deploying the code and flipping the switch separately. Remove flag after one release cycle on default-true.
Out of scope for this issue
Changing the HydratedContext schema (unless the hybrid-split forces it)
Background
PR #52 ships the rev-6 async-only architecture (single AgentCore runtime, polling CLI, notification fan-out, SSE deletion). The design doc that PR introduces —
docs/design/INTERACTIVE_AGENTS.md— locks in architectural decision AD-11: hydration (heavy I/O: GitHub issue/PR fetch, prompt assembly, Memory retrieval, guardrail screening, S3 blueprint reads) should live inside the agent container at startup, not in the orchestrator Lambda.PR #52 defers the actual relocation to a follow-up for scope reasons. This issue tracks it.
Current state
Hydration lives in
cdk/src/handlers/shared/context-hydration.ts(1,190 LOC) and is invoked fromOrchestratorFnatcdk/src/handlers/shared/orchestrator.ts:256:The result is passed into the agent container's invocation payload.
agent/src/pipeline.pyalready has a local-batch fallback path (if hydrated_context: … else: …) that demonstrates the target shape — it fetches the GitHub issue and assembles the prompt in-container when the orchestrator didn't.Why move it
Industry precedent
Long-running async coding agents universally put heavy I/O in the worker, not the dispatcher:
6 of 7 systems surveyed put heavy hydration worker-side. The lone exception (AWS Step Functions + Bedrock Agents) targets short-lived request-response workloads, not multi-hour container runs — our workload is the former category's opposite.
Temporal explicitly calls dispatcher-side I/O an anti-pattern: "Workflows should not do I/O; delegate to Activities in Workers" (https://raphaelbeamonte.com/posts/good-practices-for-writing-temporal-workflows-and-activities/). Our orchestrator Lambda is textbook "Workflow doing I/O."
Concrete risks of the current architecture
15-minute Lambda ceiling can be hit.
hydrateContextperforms serial network hops: SecretsManager → GitHub REST issue (30s timeout) → GitHub comments (30s) → GraphQL review threads (paginated for PR tasks) → Memory service → BedrockApplyGuardrailon the assembled prompt. A PR-review task with 200+ review threads + a cold Memory retrieval + backpressured guardrail can realistically hit 2-5 minutes. On Lambda timeout, the task sits inHYDRATINGuntil the stranded-task reconciler's 20-minute threshold catches it — an effectively silent latency bomb.IAM blast radius. The orchestrator Lambda currently holds:
secretsmanager:GetSecretValueon PAT secretsbedrock:ApplyGuardrailThe agent container already holds (or should hold) these for its own tool execution. Moving hydration shrinks the orchestrator's blast radius without adding anything new to the container.
Compute portability.
context-hydration.tsimports@aws-sdk/client-bedrock-runtime,@aws-sdk/client-secrets-manager, usesfetchwith Node-20AbortSignal.timeout, and relies on Lambda warm-start caching. An ECS/Fargate swap (already partially scaffolded incdk/src/handlers/shared/strategies/ecs-strategy.ts) means either re-invoking this TS module from every compute backend or dual-maintaining a Python port anyway. Agent-side hydration consolidates to one implementation.Phase 2 + resume semantics.
bgagent ask --agentand any future checkpoint-resume feature need the container to be idempotently hydratable from atask_idalone. If hydration only lives in the orchestrator, restart paths must round-trip through it — duplicating logic or accepting that restarts lose context.Why defer (why this isn't in PR #52)
Zero user-visible behavior change. Pure architectural relocation. Every UX feature works identically before and after.
Large re-implementation surface:
PR feat(interactive-agents): Phase 1a/1b/2 (Phase 3 design only) #52 narrative coherence. PR feat(interactive-agents): Phase 1a/1b/2 (Phase 3 design only) #52 tells one story: "rev-6 async-only architecture: delete SSE, fix UX, ship notifications." The chunks shipping SSE deletion and UX polish all fit cleanly. Adding hydration relocation turns that into two unrelated architectural moves and dilutes reviewer attention.
Drift risk is bounded. The contract is explicit:
HydratedContextinterface incontext-hydration.ts+ the matching Pydantic model inagent/src/models.pywithSUPPORTED_HYDRATED_CONTEXT_VERSION = 1. The Python side enforces a version check on receive.Proposed design — hybrid split
Not a literal move. Keep the fail-fast benefit of dispatcher-side validation; move only the heavy I/O:
Stays in orchestrator (preflight, ~150-200 LOC):
task_descriptionat submissionMoves to agent container (
agent/src/hydration.py, ~1000 LOC):This split preserves fail-fast on trusted input while moving expensive, variable-latency I/O into the 8-hour runtime budget where it can emit progress events and fail visibly.
Acceptance criteria
agent/src/hydration.pyhandles all of: GitHub issue/PR fetch (REST + GraphQL), prompt assembly, Memory retrieval, blueprint merge, assembled-prompt guardrailcdk/src/handlers/shared/context-hydration.tsshrinks to just the preflight (PAT check, repo check, raw-description guardrail) or is split/renamed accordinglybedrock:ApplyGuardrail(except for preflight scope), AgentCore Memory read, Secrets Manager PAT readcdk/test/handlers/shared/context-hydration.test.tscase has a pytest equivalent inagent/tests/test_hydration.pySUPPORTED_HYDRATED_CONTEXT_VERSIONinagent/src/models.pybumped if the payload shape changesINTERACTIVE_AGENTS.mdAD-11 updated to remove the "deferred" status noteHYDRATE_IN_AGENT=false/true) allows deploying the code and flipping the switch separately. Remove flag after one release cycle on default-true.Out of scope for this issue
HydratedContextschema (unless the hybrid-split forces it)Risk & rollout
backgroundagent-devfirst.HYDRATE_IN_AGENT=falseenv var on the Lambda; orchestrator keeps callinghydrateContextas today. No schema migration needed.Cited research
This issue is backed by a cross-system survey and adversarial review:
Related
docs/design/INTERACTIVE_AGENTS.mdAD-11 — design intent