- Vision: Symbiotic Intelligence for Co-Evolution
- System Architecture
- Cortex-Action-Memory
- Core Execution Loop
- Evolution Loop
- Evaluation Loop
- Summary: The Co-Evolution Architecture
Most AI assistants today remain at the "tool" stage: they can generate content, but they often cannot reliably drive tasks to validated outcomes, operate within accountable workflows, or improve through a sustained feedback loop. Sico reframes AI agents as Digital Workers: long-lived, structured capability units that can be managed, evaluated, and continuously improved through real work.
- Digital Workers execute repeatable tasks with increasing reliability and consistency.
- Humans (Operators) set goals, evaluate outcomes, and provide corrections.
- System distills these corrections and task-level signals into two complementary forms of improvement: reusable execution experience (strategies, playbooks, memories) that takes effect on the next run, and structured training data that feeds back into the base model so the worker's intrinsic capability grows over longer cycles.
A Digital Worker's capabilities improve along two complementary tracks, both driven by real-world execution and Operator guidance:
- Training-free evolution: This track accumulates reusable strategies, playbooks, and memories around the model. These improvements can take effect within the current session or in subsequent executions.
- Training-based evolution: This track converts execution outcomes, Operator corrections, and task trajectories into high-quality training data for SFT/RL pipelines, enabling the base model to improve over longer cycles.
Operator Digital Worker
│ │
│── set goal ──────────────────────────────> │
│ │── execute (traced)
│ │── produce outcomes
│ <── request intervention (when uncertain) ─ │
│── provide corrections ──────────────────> │
│ │
│ ┌──────────────────────────────┐ │
│ │ Experience Store │ │
│ │ trajectories, corrections, │<───│
│ │ outcomes, strategies │ │
│ └──────────────┬───────────────┘ │
│ │ │
│ training-free feedback │
│ │ retrieve & apply │
│ ├───────────────────>│ (enriched context,
│ │ │ updated strategies)
│ training-based feedback │
│ │ SFT / RL │
│ └───────────────────>│ stronger base model
│ │
│ <── improved capability ───────────────────│
Each task execution generates signals about effective strategies, failed steps, and environment responses. Sico routes these signals into two feedback loops: a training-free loop that distills them into reusable experience the Digital Worker can retrieve and apply in future executions (§5.1), and a training-based loop that converts them into high-quality training data for improving the base model through SFT/RL (§5.2). Together, these loops reduce repeated errors in the short term while raising baseline competence over the long term.
As execution quality improves, the need for repetitive Operator intervention is expected to decrease. Prior human corrections are incorporated into both the worker's strategy selection process and the model improvement pipeline. Over time, Operator judgment continuously shapes worker behavior, while Digital Workers become increasingly capable of handling routine execution with less supervision.
Sico separates user-facing serving, Core orchestration, and delegated execution into clear ownership boundaries. The Backend owns HTTP/SSE ingress, authentication, and primary persistence. Core owns turn orchestration, workspace state, LLM/tool coordination, and Task Runtime execution. Sandboxes are leased per run only when isolated execution is needed. The static topology diagram shows deployed service boundaries; the runtime topology below simplifies the chat execution path.
At runtime, the frontend sends the chat request and receives the SSE stream over HTTP. Deployments may proxy this traffic before Backend, but the simplified diagram treats it as a Frontend-Backend link. Backend invokes Core's StreamChat gRPC endpoint to start the turn, while user-visible streaming is decoupled through Kafka and SSE. Core writes platform-owned state through reverse gRPC; §4.9 expands the exact services. Inside Core, ChatService initializes the workspace, routes the request, and runs ChatAgent; delegated work enters Task Runtime and may call the LLM Hub or leased sandbox APIs. The detailed per-turn sequence is expanded in §4.1.
flowchart TB
Frontend["Frontend"] <-->|"HTTP/SSE"| Backend["Backend"]
subgraph Core["Core"]
direction TB
ChatService["ChatService"] -->|"init"| Workspace["Workspace"]
ChatService -->|"build / run"| ChatAgent["ChatAgent"]
ChatAgent -->|"read / write"| Workspace
ChatAgent -->|"delegate"| TaskRuntime["Task Runtime"]
ChatAgent -->|"LLM"| LLMHub["LLM Hub"]
TaskRuntime -->|"LLM"| LLMHub
end
Backend -->|"gRPC StreamChat"| ChatService
ChatService -->|"reverse gRPC writes"| Backend
ChatService -->|"publish"| Kafka["Kafka"]
Kafka -->|"consume"| Backend
TaskRuntime -->|"sandbox APIs"| Sandbox["Sandbox"]
LLMHub -->|"provider APIs"| LLMProviders["LLM Providers"]
The Workspace provides shared execution state for a turn. Initialization materializes skills, knowledge, playbooks, and attachments before routing (§4.2); ChatAgent writes plans and intermediate artifacts during reasoning; delegated batches publish outputs under results/ for later steps to consume. This keeps cross-component coordination file-based and auditable, while platform persistence remains centralized behind Backend reverse gRPC services.
All cross-service contracts are defined in proto/ and generated to four targets:
| Target | Generator | Output |
|---|---|---|
| Go gRPC stubs | protoc |
backend/internal/transport/grpc/pb/ |
| Go HTTP DTOs | protoc + protoc-go-inject-tag |
backend/internal/transport/http/dto/ |
| Go reverse gRPC | protoc |
backend/internal/transport/reverse_grpc/pb/ |
| Python stubs | betterproto2 |
core/app/pb/ |
Backend serves two kinds of callers: human Operators using the web client and machine clients running inside Sandboxes. The two audiences use different middleware stacks:
| Audience | Mechanism | Replay protection |
|---|---|---|
| Users (web client -> operator-facing management API) | JWT (HS512) + Casbin RBAC | JWT token store (Redis when REDIS_HOST is set, else in-process cache), invalidated on logout via JWTAuth.DestroyToken |
| Sandbox clients (machine → API) | HMAC-SHA256 with X-Sico-* headers, per-client secret from SANDBOX_CLIENT_SECRET_<CLIENT_ID> |
Redis nonce store |
The JWT middleware applies to user-facing APIs, with a small whitelist for login, health, and public LLM runtime routes. Sandbox client endpoints (/api/sico/sandbox/apply, /release) use HMAC-SHA256 instead of JWT; secrets are compared with hmac.Equal.
All stateful systems below are provisioned automatically by make compose-up or make kind-up:
| Dependency | Role |
|---|---|
| MySQL | Primary store (GORM), schema managed by golang-migrate, auto-applied at startup. |
| Redis | Cache, distributed locks, JWT blacklist, sandbox lease pool, sandbox nonce store. |
| Qdrant | Vector store for Mem0-backed long-term memory. |
| Kafka | Event bus for Core → Backend streaming chunks (decouples gRPC from SSE). |
| SeaweedFS | Blob storage for uploads, artifacts, and workspace assets. |
| Nginx | Single reverse-proxy entry point in front of Frontend and Backend. |
| LLM providers | OpenAI, Azure, Anthropic, Gemini, OpenRouter, etc., accessed via the LLM Hub (§3.1). |
Core never connects to MySQL directly. Primary relational persistence is mediated by Backend through reverse gRPC, while Core keeps execution artifacts and memory-related state in workspace files, local stores, and Mem0/Qdrant-backed memory. This keeps the primary data model centralized in Backend without requiring Core to own schema migrations.
A Digital Worker is not a single prompt or a model wrapper. It is a structured capability unit with three layers:
┌───────────────────────────────────────────────────────────┐
│ Digital Worker │
├──────────────┬──────────────────┬─────────────────────────┤
│ Cortex │ Action │ Memory & Sense │
│ Reasoning │ Skills, tools, │ Project knowledge, │
│ & planning │ sandbox envs │ execution experience │
└──────────────┴──────────────────┴─────────────────────────┘
All LLM traffic flows through the LLM Hub (core/app/llmhubs/), a unified runtime with adapters for multiple providers:
- Model resolution: built-in models are loaded from Core YAML configs; for DB-sourced custom models, Backend resolves the model per request and passes a
RuntimeModelDefinition(including decrypted secrets) alongside the gRPC call, so the current main path does not require Backend DB models to be globally registered in Core - Adapter pattern: selects the right adapter based on
provider_template_typefrom six implementations. Four target specific vendor protocols (Azure OpenAI, OpenAI-compatible, Anthropic, Gemini); two are generic, config-driven adapters (HTTP-JSON, HTTP-binary) that let an operator wire an arbitrary HTTP model endpoint into the hub purely through field mapping and JSONPath extraction, with HTTP-binary streaming returned artifacts (images, audio) to blob storage. - ChatClient: bridges the Microsoft Agent Framework's
BaseChatClientinterface to LLMHub, handling tool calls, image input, streaming, and reasoning effort control
The agent execution loop (ChatAgent.run_stream()) builds on top of ChatClient: ChatClient handles LLM communication, while ChatAgent orchestrates the full execution cycle (workspace setup, tool binding, streaming, and cleanup). ChatAgent leverages the Agent Framework's FunctionInvocationLayer for automatic tool call orchestration: the LLM outputs a function call -> the Framework executes it -> the result is injected back -> the LLM continues. This enables multi-step reasoning with tool use in a single streaming pass.
Planning is implemented through autonomous LLM tool calls, not hard-coded workflows. The LLM uses three plan tools (plan_read, plan_write, plan_tool_call_message_update) to create and manage execution plans in real time. Plans support cancellation (via marker files polled every 2 seconds) and status tracking (pending, in_progress, completed, failed, require_human_input).
Core defines a set of 16 built-in tools. Rather than handing the whole set to every turn, the chat agent receives a route-scoped subset (fast / inspect / task); role-level differentiation comes from skills, knowledge, playbooks, and runtime context (workspace, plan, sandbox session) on top of that shared definition (the route is decided by the intent check in §4.3, and route gating lives in core/app/biz/chat/router.py; see §4.6):
| Category | Tools |
|---|---|
| Workspace context | context |
| File I/O | read, write_file, edit, grep, remove |
| HTTP & web | webfetch, curl, download |
| Document parsing | parse_document |
| Planning | plan_read, plan_write, plan_tool_call_message_update |
| Memory | search_memory |
| Reporting | report |
| Task inspection | get_task_detail |
Durable "real work" (running commands, driving sandboxes, executing skills) is intentionally not a built-in tool. It is funneled through a single delegate tool added to the task route, which hands the work to the Delegated Task Runtime (§4.7). curl is dual-purpose: the agent uses it for arbitrary HTTP fetches, while sandbox HTTP APIs are driven by the task runtime, not by the chat agent directly.
A Skill is a packaged capability defined by a SKILL.md file (YAML frontmatter + Markdown) plus optional runtime scripts and config. Skills can be scoped to a project (shared by all agents) or to a specific agent.
Sico compiles skills ahead of time rather than re-interpreting SKILL.md at runtime. When a skill is uploaded or updated, the backend calls Core to run the Skill Resolver (§3.2 → Skill Resolver), an LLM pass that compiles the human-written skill into two artifacts:
resolved/cortex/: the agent-facing reference files (theSKILL.mdand any docs/schemas it points to), copied into the workspace for the LLM to read.resolved/actions.json: a deterministic, executable action manifest (argv steps with typed parameters and placeholders) that the task runtime executes with zero LLM calls at run time.
At the start of each chat, workspace initialization copies the resolved cortex files for all relevant skills into the agent's working directory and generates an index.json. The skill list is appended to the user message, and the LLM autonomously decides which skills to read and, when execution is needed, dispatches them through delegate.
The Skill Resolver (core/app/biz/skill/resolver.py) is what makes skills cheap and reproducible at run time: a skill author writes a normal SKILL.md, and the resolver compiles it once, at upload time, into the actions.json manifest described above. The design has four notable properties:
- Zero-LLM runtime. The expensive interpretation (what to run, in what order, with which parameters) happens once during the build-time LLM pass. At run time the task runtime only reads
actions.jsonand executes argv steps, so skill execution is deterministic, fast, and auditable. - Structured, validated output. The resolver emits a Pydantic-validated
ResolvedSkillOutput(cortex+actions), where each action carriesinfra_requirements(e.g.sandbox.android,sandbox.windows), typedparameters, andsteps(argvwith built-in placeholders like{workspace_dir},{result_dir}and sandbox placeholders such as{sandbox.android}). Invalid output is retried with the error fed back into the prompt, up to 3 attempts (_MAX_RESOLVER_ATTEMPTS); on persistent failure it falls back to a cortex-only skill with no actions. - Incremental re-resolution. On re-upload the resolver diffs the previous and current skill files (budgeted to
_MAX_TOTAL_DIFF_BYTES) and passes the diff plus the previousactions.jsoninto the prompt, so unchanged skills reuse prior output and changed skills are adapted incrementally instead of recompiled from scratch. - Versioned persistence. The backend
skilldomain stores each resolution as a newSkillVersionModelrow, so skill definitions are versioned and a current version is always resolvable.
At run time, SkillLoader (core/app/biz/task_runtime/skill_loader.py) projects each resolved action into a CapabilityCard (name, parameters, infra requirements, visibility). These cards are the shared catalogue that every skill consumer chooses from when turning an instruction into a concrete skill dispatch — whether an LLM planner (e.g. the general adapter, the sub-agent loop in §4.7) or a deterministic adapter that matches cards by rule (e.g. the workbook adapter).
Sandbox capabilities are exposed as HTTP APIs on each sandbox instance. The chat agent never acquires or drives a sandbox directly; instead, when a delegated task declares a required_sandbox, the task runtime's SandboxCoordinator leases one sandbox for that run and drives its HTTP API (tap, install, reset, …), so the sandbox runtime only needs to ship its HTTP server, not per-endpoint agent-side wrappers. See §4.7 and §4.8 for details.
A Digital Worker needs different kinds of memory at very different time scales. Rather than putting everything into a single vector store, Sico splits memory into five layers, each backed by the storage that best fits its access pattern.
| Memory Type | Mechanism | Scope | Storage |
|---|---|---|---|
| Short-term (in-turn) | LLM context window + plan scratchpad | Current task | LLM context + local FS (plan.json) |
| Recent history | Last 3 turns of conversation.json (text-only), prepended to the prompt |
Same (user, agent_instance) |
Local FS (CHAT_FS) |
| Long-term | Mem0 facts extracted per turn, retrieved on demand by search_memory |
Cross-session, per (user, agent_instance) |
Qdrant vector store |
| Project knowledge | Knowledge bases and workspace files parsed with MarkItDown, then materialized into the agent workspace | Shared within a project or scoped to an agent | Local FS / object storage |
| Execution experience | Playbooks produced by the Reflector → Curator pipeline (§5.1) | Per (project, agent_instance) |
Local FS + Backend knowledge service |
time scale layer who decides what to load
────────────── ───────────────────── ────────────────────────
this LLM call ───► in-turn context window ◄── ChatAgent (always)
+ plan.json scratchpad
last few turns ───► recent 3 turns (text-only) ◄── ChatAgent (always)
conversation.json files
cross-session ───► Mem0 / Qdrant facts ◄── LLM (calls search_memory)
keyed by (user, agent)
per-task setup ───► workspace skills/knowledge ◄── workspace_init + LLM read
+ playbook (§5.1)
Long-term memory is the only memory layer backed by a vector database. It uses Mem0 as the orchestration layer, with Qdrant as the vector store. In the default deployment config, Azure OpenAI serves as both the embedder (text-embedding-3-small, 1536-dim) and the fact-extraction LLM, both are swappable via mem0_config.yaml. The flow is:
chat turn ends
└─► _enqueue_memories(user_message, assistant_message)
└─► AsyncJobRunner (background worker pool, non-blocking)
└─► Mem0.add(messages, user_id, agent_id)
├─ extraction LLM: distill atomic facts
├─ embedder: encode each fact
├─ Qdrant: upsert (vector, payload{user_id, agent_id, ...})
└─ internal LLM: decide ADD vs UPDATE vs NOOP
Two design choices matter:
- Write is fully asynchronous. Memory writes go through
AsyncJobRunner(a 16-workerasynciopool). If Mem0 / Qdrant / the extraction LLM is slow or unavailable, the user-facing SSE stream is unaffected: at worst, this turn's memory write is lost. - Read is on-demand, not auto-injected. Memory is not stuffed into the system prompt every turn. Instead, the LLM autonomously decides when to call the built-in
search_memorytool. The tool runs a similarity search in Qdrant filtered by(user_id, agent_id)(defaultthreshold=0.5,top_k=5) and returns the matching facts to the LLM. Each call is recorded as atool_callon the plan, so Operators can see what the agent recalled and when.
This RAG-style "pull" model trades a little recall reliability (the LLM may forget to query) for a large gain in context cleanliness: long sessions do not get polluted by unrelated old memories.
These two are easy to confuse but have disjoint responsibilities:
| Long-term memory (Mem0) | Playbook (Experience Learning) | |
|---|---|---|
| Stores | User/conversation facts ("user is in Shanghai timezone") | Execution strategies ("for task X, prefer approach Y") |
| Produced by | Mem0 extraction on every turn | Reflector → Curator after task completion |
| Isolation | (user, agent_instance) |
(project, agent_instance) |
| Injection | LLM calls search_memory on demand |
Rendered as Markdown files in the agent workspace, optionally fused into user message |
| Quality signal | None | helpful / harmful citation counts |
Memory remembers who the user is and what was said; the Playbook remembers how to do this kind of work.
Both the project knowledge layer and the parse_document built-in tool (§4.6) share one ingestion path: a DocExtractor (core/app/document/). The abstract interface (base.py) exposes extract(file_path) and extract_from_url(url), each returning a (full_text, summary) pair; the default extract_from_url downloads a SAS/HTTP source to a temp file and delegates to extract. The shipped implementation (MarkitdownDocExtractor, markitdown.py) uses the open-source MarkItDown library to convert heterogeneous formats (PDF, Word, Excel, etc.) into Markdown, then runs a single capped LLM call (input truncated to 50K chars, summary ≤1024 tokens) to produce the summary, degrading to an empty summary if that call fails. This is what turns uploaded knowledge bases and ad-hoc attachments into the plain-text workspace files the agent can read.
Cortex, Action, and Memory are the static anatomy of a Digital Worker. What makes the worker alive is the way these three layers are wired into three running loops - and these loops are the actual core of Sico.
| Loop | Section | What it does | What it consumes / produces |
|---|---|---|---|
| Execution Loop | §4 | Turn an Operator goal into traced agent execution | Goal -> trajectory + outcome |
| Evolution Loop | §5 | Train the Cortex and distill trajectories into reusable strategies | Outcomes -> training signals / Playbook -> stronger next run |
| Evaluation Loop | §6 | Attribute why a task failed (planned, not yet shipped) | Failed trajectory -> L1–L4 verdict -> input back into Evolution |
The three loops are exactly the operational form of the vision in §1:
- Execution Loop represents the runtime layer where Digital Workers perform tasks. The Operator specifies the goal, and the Cortex-Action-Memory stack executes with workspace tools, delegated task-runtime runs, and, when a run declares
required_sandbox, an observable Sandbox. This loop generates structured execution traces, including actions, intermediate states, tool outputs, and environmental feedback. - Evolution Loop converts execution traces into reusable capability. Successful strategies and recurring failure patterns are extracted from prior runs and incorporated into the worker's future prompt context. In this way, capability accumulation happens at the platform layer, rather than relying only on model-weight updates.
- Evaluation Loop provides the governance and improvement mechanism. Failure attribution classifies errors into categories such as Task Instruction Issue, Digital Worker (DW) Capability Issue, and Environment Issue. These structured signals help the Operator determine the appropriate correction and provide targeted input for the Evolution Loops.
Together, the three loops form a continuous improvement cycle: Execution produces experience, Evolution converts experience into reusable capability, and Evaluation identifies where the worker or the environment should be improved. When applied repeatedly to the same Digital Worker, this cycle enables co-evolution between the Operator and the Digital Worker.
Chat is where Operator intent meets Digital Worker execution. It coordinates four communication mechanisms (gRPC, reverse gRPC, Kafka, SSE) and orchestrates workspace setup, agent reasoning, tool execution, planning, and cleanup in a single streaming pass.
sequenceDiagram
autonumber
actor FE as Frontend
participant BE as Backend
box Core · Chat Orchestration
participant CS as ChatService
participant RT as router.py<br/>hard_guard_route · llm_intent_check
participant WS as init_workspace
participant CA as ChatAgent
end
box Core · Task Runtime
participant TM as TaskManager
participant SUB as Submitter
participant SCH as BatchScheduler
participant RC as RunCoordinator
participant SC as SandboxCoordinator
participant EX as Executors<br/>Tool/Skill/SubAgent
end
participant SBX as Sandbox
participant LLM as LLM Hub
participant KAFKA as Kafka
FE->>BE: HTTP chat request + SSE
BE->>CS: gRPC StreamChat
CS->>WS: init_workspace<br/>(skills · knowledge · playbooks)
CS->>RT: route turn
opt hard_guard_route = UNSPECIFIED
RT->>LLM: llm_intent_check
LLM-->>RT: FAST / INSPECT / TASK
end
RT-->>CS: route + tools_for_route
CS->>CA: run_stream
loop reasoning · tool calls
CA->>LLM: chat completions (stream)
LLM-->>CA: text/tool-call chunks
CA->>WS: read/write plan.json · files
CA->>KAFKA: publish chunks
KAFKA-->>BE: consume chunks
BE-->>FE: SSE events
end
opt durable work (delegate)
CA->>TM: submit_prepared(batch)
TM->>SUB: submit
SUB->>SCH: run
SCH->>RC: execute(run)
opt required_sandbox
RC->>SC: acquire
SC->>SBX: lease · reset (stores SandboxLeaseRef)
end
RC->>EX: dispatch (DispatchRouter)
EX->>WS: read inputs · write results/
opt required_sandbox
EX->>SBX: drive sandbox HTTP API using leased endpoint
end
opt sub_agent
EX->>LLM: bounded structured calls
end
EX-->>RC: TaskResult
opt sandbox leased
SC->>SBX: release
end
RC-->>SCH: terminal state
SCH-->>SUB: results
SUB-->>TM: BatchResult
TM-->>CA: BatchResult
end
CS->>KAFKA: publish END event
KAFKA-->>BE: consume final event
BE-->>FE: SSE final event
- Frontend sends the chat request to Backend over HTTP and keeps an SSE stream open for user-visible updates.
- Backend forwards the turn to Core through
StreamChat; the response stream itself remains decoupled through Kafka and SSE. ChatServiceinitializes the workspace before routing, materializing skills, knowledge, Playbook snapshots, and attachments so every route starts with the same execution context.ChatServiceasks the router to classify the turn.hard_guard_routehandles obvious cases; otherwisellm_intent_checkcalls the LLM Hub and returnsFAST,INSPECT, orTASKplus the route-scoped tool set.ChatServicestartsChatAgent.run_stream(). During the main reasoning loop,ChatAgentstreams through the LLM Hub, reads and writes workspace files such asplan.json, and publishes chunks to Kafka for Backend to deliver as SSE events.- If the agent calls
delegate, durable work enters Task Runtime.TaskManager,Submitter,BatchScheduler, andRunCoordinatorclaim and execute each run;SandboxCoordinatorleases and releases a sandbox when required; executors read inputs and write outputs underresults/. - When execution reaches a terminal state, Core publishes the final event through Kafka for SSE delivery, persists platform-owned state through reverse gRPC, then performs non-blocking cleanup and background work such as sandbox release, plan cleanup, Mem0 memory write, and Experience Learning ingestion.
init_workspace() runs at the start of every turn (before routing and before the agent
reasons) and refreshes a workspace keyed by (agent_instance_id, user_id), not by turn. Each
turn it clears and re-materializes reusable context (skills, knowledge, playbooks), clears the
workspace history/ scratch directory, and retains attachments plus prior delegated outputs
across turns:
agent_instance/{agent_instance_id}/user/{user_id}/
turn/
{turn_id}/
plan.json # Plan state (created during execution)
conversation.json # Full turn transcript (written after execution)
workspace/ # keyed by (agent_instance_id, user_id); refreshed each turn
attachments/ # retained across turns
{file_name} # Downloaded from SAS URLs
{file_name}_url.txt # Original URL reference
index.json # [{name, path, source_turn_id}, ...]
knowledge/ # cleared + re-copied each turn
{doc_id}/ or {link_id}/
index.json
playbooks/ # cleared + re-copied each turn (if Experience Learning enabled)
{section_name}.md # Rendered from Playbook bullets
skills/ # cleared + re-copied each turn
{skill_id}/SKILL.md # Copied from project + agent skill stores
index.json # [{id, name, description, actions}, ...]
results/ # retained across turns; outputs of the `delegate` task tool
{batch_id}/ # run records (status, payloads)
artifacts/ # files produced by the runs
case_sources/ # retained across turns
parsed_documents/ # archived workbook / parsed-document manifests (*.json + *.jsonl)
Skills injection: The skills section, capability cards rendered from each skill's resolved
actions and backed by skills/index.json, is appended to the user message so the LLM sees what
capabilities are available. The LLM autonomously decides which skills to read (via the read
tool) and follow.
Playbook injection: If Experience Learning is enabled, previously learned strategies are
rendered as .md files in the workspace (§5.1.3).
Recent conversation history: The last 3 turns of text are not read from workspace/history/
by the agent. init_workspace() clears that directory in the current implementation; prompt
history is loaded directly from the persisted turn store at prompt-build time
(_load_recent_history → CHAT_FS.read_conversation) and prepended to the prompt.
After the workspace is assembled but before the ChatAgent is built, Core decides which route the turn takes. The route determines the tool surface the agent is given (fast / inspect / task; see §4.6), so misrouting either starves a real task of the delegate tool or hands a simple greeting an oversized toolset. Routing is two-stage and lives in core/app/biz/chat/router.py:
Stage 1: hard guard (cheap heuristic). hard_guard_route(user_prompt, has_attachments) is a pure keyword + attachment check that runs first and costs no LLM call:
| Signal | Route |
|---|---|
| Empty prompt with no attachments | FAST (nothing to act on) |
Task keywords (execute, run all, batch, …) |
TASK |
Short greeting / thanks (hello, hi, hey, thanks; ≤24 chars, no attachments) |
FAST |
| Anything else (including an empty prompt with attachments) | UNSPECIFIED (defer to stage 2) |
A confident hard-guard hit (FAST or TASK) is used directly with confidence = 1.0 and skips the LLM intent check entirely, keeping common cases (a greeting, an obvious batch request) off the critical path.
Stage 2: LLM intent check. Only when the hard guard returns UNSPECIFIED does Core call llm_intent_check, a single-round LLM classifier with structured output (ChatIntentCheckerOutput { route, confidence, reason }). It is fed rich context so the decision reflects what the turn can actually do: the user prompt and attachments, the available task adapters and direct tools, the workspace skills section, prior conversation, and prior rerun / parsed-workbook sources.
Defensive default. Routing must never block a turn. Any failure in the LLM path (non-zero invocation, empty response, JSON parse error, or schema-validation failure) falls back to route = TASK, confidence = 0.0. The bias is deliberate: when unsure, expose the fuller task toolset rather than risk withholding delegate from genuine work. The chosen route is logged as chat_route_decided (with confidence and reason) for observability.
The agent loop is built on the Microsoft Agent Framework. Two key abstractions divide responsibility:
| Component | Role |
|---|---|
| ChatClient | LLM communication layer to bridge Agent Framework's BaseChatClient to LLMHub, and to handle tool call/result serialization, image input, streaming, reasoning effort control. |
| ChatAgent | Execution orchestrator to prepare messages, run the streaming loop, and manage text buffering, plan finalization, and cleanup. |
ChatAgent.run_stream() drives the main loop:
prepared_messages = system_prompt + last_3_turns_history + user_message
async for update in client.get_response(prepared_messages, stream=True, options={
tools: route_tools, # per-route built-ins + delegate_* adapter tools
tool_choice: "auto",
allow_multiple_tool_calls: True, # parallel tool execution
reasoning.effort: "high", # extended thinking
}):
├── Check plan cancellation (every 2 seconds via marker file)
├── If text update -> buffer (flush at 32 chars or before non-text content)
├── If tool call / tool result -> log + flush buffered text (not forwarded to client)
└── Text / plan / error updates -> response_queue -> reverse gRPC + Redis cache + Kafka
The Agent Framework's FunctionInvocationLayer automates the tool call cycle: when the LLM emits a function_call, the Framework executes the corresponding tool, injects the function_result back into the conversation, and lets the LLM continue. This loop repeats until the LLM produces a final text response or hits max_iterations.
Text buffering: Pure text updates are accumulated until 32 characters before flushing, reducing SSE push frequency. Non-text content (tool calls, tool results) triggers an immediate flush of any buffered text; the tool-call and tool-result events themselves are logged and recorded in the turn's conversation.json, not forwarded to the client over SSE and not persisted as individual messages.
Retry: The agent retries once on failure (max_attempts = 2).
Planning is implemented through autonomous LLM tool calls, not hard-coded workflows. The LLM decides when to create, read, and update plans during execution.
Three plan tools are registered in BUILTIN_TOOLS:
| Tool | Purpose |
|---|---|
plan_read |
Read the current plan. The system prompt instructs the LLM to call this frequently - before starting work, after completing a step, and when uncertain about the next action. |
plan_write |
Create or update the plan. Enforces: only one step in_progress at a time, must complete current step before starting next, require_human_input pauses execution for Operator input. |
plan_tool_call_message_update |
Update the progress message of an existing tool call (when the original message is too long or outdated). |
PlanEditor: Each tool records its own progress via ctx.plan_editor:
PlanEditor # Writes plan.json, notifies ChatService -> PLAN event -> Kafka -> SSE
Plan data model:
Plan
├── title: str
├── steps: [PlanStep]
│ ├── title: str
│ ├── status: pending | in_progress | completed | failed | require_human_input | cancelled
│ └── tool_calls: [ToolCall]
│ ├── tool_name, message (recommended <20 words), tool_call_id
│ └── deliverables: [ToolDeliverable] (e.g., acquired Sandbox ID)
└── extra: PlanExtra (user, agent, timestamp)
The LLM-facing plan_write schema only accepts the first five statuses (pending, in_progress, completed, failed, require_human_input). cancelled is reserved for the system: the LLM cannot emit it directly. It is produced by Plan.to_cancelled() whenever a cancellation marker file exists for the current turn (see below), so any subsequent plan_read reflects the cancelled state.
Cancellation: Frontend calls Backend's CancelPlan HTTP API -> Backend forwards via gRPC to Core -> Core writes a marker file (CHAT_FS.plan.write_cancelled_marker) -> the agent loop polls is_plan_cancelled every 2 seconds (CHECK_CANCELLED_PLAN_INTERVAL_SECONDS) and breaks out of generation on detection. From that point on, plan_read returns the plan projected through to_cancelled(), surfacing the cancelled status to both the agent prompt and the frontend.
Frontend interaction: Each plan update flows through PlanEditor.notify_plan_updated() -> ChatService -> PLAN-type ChatResponse -> Kafka -> SSE -> Frontend renders progress. Frontend can also poll Backend's GetPlan HTTP API (which proxies to Core via gRPC) for the full plan state.
Tools are organized into three categories. The built-in set is route-scoped: tools_for_route (core/app/biz/chat/router.py) gives the fast route no tools, the inspect route a read-only subset, and the task route the full set plus the delegate tool:
| Category | Tools | Registration |
|---|---|---|
| Built-in | context, plan_read, plan_write, plan_tool_call_message_update, read, grep, write_file, edit, remove, report, webfetch, curl, parse_document, download, search_memory, get_task_detail |
BUILTIN_TOOLS list, exposed per route by tools_for_route |
| Task delegation | delegate (the kind argument selects the adapter, e.g. general, workbook) |
build_adapter_tools(adapters) (TASK route only) |
| Sandbox actions | Performed per task run inside the task runtime, not by agent-side tools | Owned by SandboxCoordinator (see §4.7) |
Every tool receives a ToolContext via function_invocation_kwargs, providing access to the current user, agent instance, and plan editor.
Sandbox leasing: sandbox reserve / acquire / reset / release is owned by the task runtime's SandboxCoordinator, which leases one sandbox per task run that declares a required_sandbox, publishes the ACQUIRED_SANDBOX deliverable card, and releases the lease when the run finishes - even if it fails or is cancelled (§4.7).
The built-in tools in §4.6 let the chat agent read, edit, and report on its workspace, but they deliberately stop short of durable side-effecting work: running commands, executing skills, and driving sandboxes. That work is delegated to a separate Task Runtime, reached through a single delegate tool on the task route. This keeps the chat agent's tool surface small and observable while giving "real work" its own scheduled, retried, crash-recoverable execution layer.
chat agent (task route)
│ delegate(kind, options_json)
▼
Adapter (general | workbook) core/app/biz/chat/adapters/
│ build_tasks() → PreparedTaskBatch (one or more TaskSpec)
▼
TaskManager.submit_prepared() core/app/biz/task_runtime/manager.py
│ Submitter: plan sandboxes, create batch + per-run records
▼
Scheduler → RunCoordinator (per run)
│ claim (fencing token) → acquire sandbox → execute → write result → release
▼
DispatchRouter → executor by kind:
├── tool (echo, file_convert, run_command via a command backend)
├── skill (execute resolved actions.json, zero LLM)
└── sub_agent (bounded LLM loop over an allow-listed capability set)
│
▼ BatchResult (per-run statuses + summaries) returned synchronously to delegate
The chat coroutine awaits the delegate call: it suspends until every run in the batch reaches a terminal state, then receives the aggregated payload as the tool result. The task runs themselves execute as separate asyncio tasks, with progress streamed back onto the plan while the chat agent waits.
delegate exposes one tool whose kind argument is a closed Literal over the registered adapters (build_default_adapters currently registers general and workbook); the options_json argument is a JSON string that decodes to that adapter's Pydantic options schema. Each adapter turns intent into a concrete PreparedTaskBatch:
general: takes natural-languageinstructionsand runs a single planner LLM call that maps each instruction to a dispatch (tool,skill, orsub_agent), choosing from the CapabilityCards exposed by resolved skills (§3.2 → Skill Resolver).workbook: extracts rows from a workbook (xlsx/csv/JSONL) and expands each case into aTaskSpecstamped with a concrete skill dispatch, used for structured batch execution such as Android test suites.
The sub_agent dispatch is a bounded, sandboxed LLM loop (core/app/biz/task_runtime/executors/sub_agent.py). Each step makes one structured-output LLM call that either calls a capability or returns a final answer; the loop is capped by max_steps (default DEFAULT_MAX_STEPS = 12) and may only call capabilities on its dispatch's allow-list. This gives a delegated task its own constrained reasoning agent without exposing arbitrary tools or unbounded iteration.
The task runtime separates what is being executed from where command-like work runs:
| Axis | Choices | Meaning |
|---|---|---|
| Dispatch kind | tool, skill, sub_agent |
The semantic unit of work selected by an adapter or sub-agent planner. |
| Command backend | local, docker, k8s |
The physical execution environment for command-like work. |
This matters because run_command is not exposed as a chat built-in tool. It is a task-runtime tool selected only through delegated planning and executed by ToolExecutor through the configured CommandBackend. The runtime tool catalog currently includes:
| Runtime tool | Behavior |
|---|---|
run_command |
Executes an exact shell command from args.command through the configured command backend. |
file_convert |
Converts workspace-relative Excel .xlsx / .xlsm files to CSV artifacts. |
echo |
Emits a literal message, mainly for smoke tests and placeholder runs. |
Only run_command is lowered to a CommandSpec and sent through the CommandBackend; echo and file_convert run in process inside ToolExecutor.
Skill execution uses the same backend axis: a resolved skill action is lowered to argv steps from resolved/actions.json, then SkillExecutor runs those steps through the configured CommandBackend. A sub_agent does not get arbitrary shell access; it can only call capabilities on its allow-list, and capability calls are bridged back to the same tool / skill executors.
CommandBackend selection is deployment-driven:
| Backend | How it runs | Isolation and storage notes |
|---|---|---|
local |
Runs commands as child processes on the Core host. | No process/container isolation; the workspace is the host directory. This is the zero-config default for direct local development. |
docker |
Runs each command in a throwaway docker run --rm container with bind mounts. |
Docker is opt-in via TASK_RUNTIME_BACKEND=docker; it is never auto-selected just because Docker is installed. |
k8s |
Runs commands in a per-run Kubernetes sandbox pod (ensure -> exec -> delete). |
Auto-selected when Core is running in-cluster unless TASK_RUNTIME_BACKEND overrides it. |
For container-style backends (docker / k8s), the shared workspace is mounted read-only for command execution, and durable outputs should be written under $SICO_RESULT_DIR; the runtime then collects and publishes those files as artifacts. This command backend mechanism is distinct from Android emulator sandbox leasing: Android / GUI sandboxes are acquired only for runs that declare a required_sandbox, while command backends decide where shell commands and resolved skill steps execute.
Runs are not fire-and-forget coroutines; they are persisted records governed by an explicit state machine (core/app/biz/task_runtime/state_machine.py):
- States: runs move
QUEUED → RUNNING →a terminal state (COMPLETED,FAILED,CANCELLED,TIMED_OUT,BLOCKED); a batch can settle asPARTIALwhen runs have mixed outcomes. Only retryable-terminal runs may reopen toQUEUED, guarded by compare-and-set. - Fencing tokens:
claim_runreturns a token thatwrite_resultmust present, so a stale worker cannot overwrite a run that was reclaimed after a crash or timeout. - Idempotency: batch/run creation is keyed by an idempotency key, so a retried submission does not duplicate work.
- Recovery: a
StaleReconcilerreopens or fails runs orphaned by a crashed worker.
The task runtime owns no MySQL connection of its own. It persists batch/run state, claims, results, and progress through a dedicated reverse gRPC service, ReverseTaskRuntimeService (§4.9), backed in production by DbRunStore (with FileRunStore for tests). Sandbox leasing follows the same pattern: SandboxCoordinator reserves, acquires, resets, and releases sandboxes per run via the backend's reverse sandbox service, and guarantees release on every terminal outcome (§4.8).
A Sandbox is an isolated, observable environment where Digital Workers execute real operations - mobile app testing, Windows automation, or general compute tasks.
Currently Sico ships the Android emulator sandbox (MuMu Player-based, ADB + HTTP API) for mobile app automation. The sandbox subsystem is designed to be extensible - additional runtime types can be added by implementing a provider adapter and exposing an HTTP control API; the task runtime reaches each sandbox through its http_api_base_url without new agent-side tool code.
Assign (Web Client) Reserve + Acquire Reset
───────────────── ────────────────── ──────────
Admin assigns sandbox Task runtime leases one Soft-reset the
instances to an agent sandbox per task run that environment before
instance via a Redis declares required_sandbox the run executes
lease pool (SandboxCoordinator)
Use Release
────────── ──────────────
The run drives the sandbox Lease returned to the pool
HTTP API (tap, install, …) when the run finishes
Automatic cleanup: SandboxCoordinator releases each run's lease when the run reaches a terminal state - with retries (release), cross-instance fallback (release_stale), and bulk cleanup (release_many) - so sandboxes are never leaked even on failure or cancellation.
Sandbox capabilities are exposed as HTTP endpoints on each sandbox instance, not as a per-endpoint set of agent-side FunctionTools. The flow is owned end-to-end by the task runtime (§4.7):
- The chat agent delegates a task (the
delegatetool withkind="general"orkind="workbook") whose spec declares arequired_sandbox. SandboxCoordinatorreserves, acquires, and resets one sandbox for that run and exposes itshttp_api_base_url.- The run drives the sandbox HTTP API (e.g.
POST /input/tap,POST /apps/install-url) and the coordinator releases the lease when the run completes.
This keeps the agent-facing tool surface small and uniform across sandbox types: adding a new sandbox runtime requires implementing its HTTP API, not generating a new family of tool wrappers. A typed, per-endpoint generator (OpenAPI → FunctionTool) is on the roadmap but not part of the current release.
Sandboxes provide operator-facing observability during execution:
- VNC/H264 live streams: Backend proxies WebSocket streams, allowing Operators to watch runs in the browser
- Optional screenshots at key nodes: actions can attach visual state when screenshot capture is enabled and available
- Structured operation traces: tool calls, tool results, plan state, and available observations are recorded for audit and learning
Four mechanisms work in concert during a single chat turn:
gRPC (Backend -> Core, :50053): StreamChat accepts the ChatRequest and returns immediately with an empty ChatDirectResponse. This is intentionally not a server-streaming RPC - the actual response stream is decoupled via Kafka.
Reverse gRPC (Core -> Backend, :50054): Core persists conversation messages, plan updates, task-runtime batch/run state, and other platform-owned writes by calling reverse services such as ReverseConversationService.create_message(). Tool-call and tool-result events are not pushed through this channel; they live in structured logs and the turn's conversation.json workspace file. The Backend's reverse gRPC server registers four services (ReverseConversationRPC, ReverseKnowledgeRPC, ReverseSandboxRPC, ReverseTaskRuntimeRPC); on the Core side these are exposed as singletons, each bound at startup to a grpc.insecure_channel opened against REVERSE_GRPC_ADDRESS. (A ReverseLLMHubService client stub also exists but is not yet wired into the Backend server.)
Kafka event bus (Core -> Backend): Each response chunk is wrapped in a TopicMessage with a sequence number and published to the core-backend Kafka topic. Backend subscribes, buffers messages by (conversationId, turnId), and flushes in sequence order (tolerating gaps up to GAP_MAX = 5 before force-flushing). Internal messages (is_internal) are filtered out - only user-visible content reaches the frontend.
SSE (Backend -> Frontend): Backend maintains a ChatConnection per active chat turn, holding an ordered buffer and a sender. On each Kafka flush, messages are serialized as ChatStreamResponse JSON and pushed via SSE. A keepalive (sent by Core every 5 seconds) prevents connection timeout during long tool executions.
Reconnection recovery: Core caches each in-progress response in Redis (ongoing-chat:conversation:{id}:turn:{turnId}, TTL 3 days) before publishing it to Kafka. If a client disconnects and reconnects while the turn is still active, the Backend replays cached messages before resuming the live stream. On normal completion, Core deletes the ongoing-chat cache keys.
The Evolution Loop is how Sico operationalizes the Co-Evolution vision. It spans two complementary tracks that improve a Digital Worker along different axes and on different time scales:
- Action & Memory/Sense Evolution (training-free) closes the loop back into the surrounding system: the strategies the agent applies, the playbooks it reads, and the sense / tools it relies on. This is the Experience Learning subsystem (AEE / EPE) that distills reusable strategies from trajectories without touching model weights (§5.1).
- Cortex Evolution (training-based) closes the loop back into the model itself. Real-world execution outcomes are systematically reviewed, distilled into structured learning signals, and fed into SFT / RL pipelines so that baseline reasoning and decision-making capabilities continuously improve (§5.2).
Together the two tracks give a Digital Worker two distinct ways to improve through use: training-free evolution raises the ceiling of what the existing model can reliably accomplish in production, while training-based evolution raises the floor of intrinsic capability.
Action & Memory/Sense Evolution improves the agent around the model, focusing on the strategies the agent follows, the playbooks it consults, and how it uses tools and sense, all without updating model weights.
This is realized through Experience Learning, a framework that observes how Digital Workers execute tasks, distills reusable strategies from those executions, and feeds them back into future runs. Over time, tasks that once required human intervention can gradually become reliable, autonomous executions.
Experience Learning is described through two engines that operate at different time scales:
- AEE (Adaptive Experience Engine) focuses on in-task course correction. In the current implementation, this is a prompt-mediated path: when a step fails, the running agent diagnoses the failure, re-reads relevant workspace and Playbook files, and retries without writing to the Playbook or invoking a separate Reflector-Curator pipeline.
- EPE (Experience Process Engine) focuses on durable capability accumulation. It runs asynchronously after task completion. The full Reflector-to-Curator pipeline analyzes the trajectory and writes its output into the Playbook, which is a bullet-structured and sectioned artifact scoped to the project and agent context. These accumulated experiences can then be reused across future task executions.
Task execution (or step failure)
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Trajectory │──────> │ Reflector │──────> │ Curator │
│ (execution │ │ (diagnose │ │ (strategy │
│ trace) │ │ outcomes) │ │ updates) │
└──────────────┘ └──────────────┘ └──────┬───────┘
│
DeltaBatch
│
┌──────────────┴──────────────┐
│ │
AEE (real-time) EPE (offline)
│ │
▼ ▼
Inject directly ┌──────────────┐
into next retry │ Playbook │──> Persist + Register
(skip playbook) │ (strategy │ via reverse gRPC
│ handbook) │
└──────┬───────┘
│
▼
Next agent execution
(enriched prompts)
The diagram shows the intended two-path architecture. In the current implementation, only EPE runs the Reflector-Curator pipeline after task completion. AEE closes the loop in-context through system-prompt instructions during the active task, then EPE can make the lesson durable after the session.
Trajectory: A structured execution trace captured as TrajectoryData including model outputs, tool calls, tool results, observations, environment state, and outcome signals when available.
Reflector: An LLM-based analyst that diagnoses execution outcomes. It identifies root causes, extracts atomic learnings scored by confidence, and evaluates whether previously cited strategies were helpful, harmful, or neutral.
Curator: An LLM-based update component that transforms Reflector analysis into Playbook mutations. It produces a DeltaBatch of atomic operations (ADD, UPDATE, TAG, REMOVE). The Curator prompt instructs it to follow rules:
- Keep each strategy atomic.
- Prefer
UPDATEoverADDto avoid duplication. - Avoid deleting strategies with
helpful > 3unless there is strong contrary evidence. - Avoid strategies that depend on fragile, environment-specific details.
Playbook: A collection of Bullets (strategy entries) organized by section. Each Bullet tracks helpfulness counts, vector embeddings (for deduplication), and active/invalid status. Playbooks serialize to three formats: JSON (full state with embeddings and timestamps for persistence), TOON (tab-delimited compact encoding for direct LLM prompt injection), and Markdown (one <section>.md file per section, written into the agent workspace for the agent to read and Operators to audit).
Deduplication: Unchecked duplication in the Playbook would waste prompt tokens, introduce conflicting guidance, and hurt auditability. Experience Learning adds an embedding-based second line of defense: vector-embed all Bullets, flag pairs above a cosine similarity threshold (default 0.85), and ask the Curator to emit consolidation ops (MERGE / DELETE / KEEP / UPDATE); KEEP decisions are persisted so the same pairs are not re-evaluated. This runs as an offline maintenance pass, not on the critical path of every execution.
Learned strategies feed back into agent execution through two paths:
Path A: Cross-task accumulation (EPE)
EPE decouples learning from the live chat path with a write-after / read-before pattern:
- Write (post-chat, fire-and-forget). When a chat session completes,
ChatServiceschedules_try_experience_playbook_ingestionas a background task: it loads the turn'sconversation.json, converts it toTrajectoryData, and runs the full Reflector → Curator → persist pipeline viaadd_playbook(). APLAYBOOK_INGESTIONmessage is emitted into the conversation so Operators can see that learning happened. Because it runs off the response path, slow LLM calls or persistence failures never block the chat reply. - Read (pre-chat snapshot). Before the next session starts,
workspace_initsnapshots the current Playbook into the workspace as one Markdown file per section underplaybooks/. The system prompt instructs the agent to read these files before acting and to re-read them when steps fail.
The two halves are intentionally asynchronous: the snapshot is taken once at session start, so EPE writes that land during a session do not affect the running agent: they take effect on the next session.
Path B: In-task self-correction (AEE)
When a step fails inside an active session, AEE closes the loop within the same task: the agent diagnoses the root cause, re-reads the relevant files under playbooks/ (and any execution logs already in the workspace), and retries with a better strategy. The lesson learned in this turn is not written back to the Playbook on its own; it becomes durable only after EPE distills it from the post-session trajectory.
In the current implementation this loop is realized in-context via a system-prompt directive: the running LLM plays both reflector and fixer, with no separate Reflector / Curator invocation on this path. A future iteration may upgrade it into a genuine online Reflector-Curator call without changing AEE's role.
This framework draws on recent work on evolving agent contexts, including ACE (Zhang et al., 2025) and Flex (Cai et al., 2025). From both, Sico inherits the idea that agent capability can be grown by accumulating reusable strategies in an explicit, auditable text artifact rather than by updating model weights. Four production-driven choices distinguish Sico's instantiation:
-
Two complementary time scales. Sico operates the loop at two time scales: AEE closes the loop in-task on failure (re-reading the Playbook snapshot already in the workspace and retrying with a fix), while EPE runs the full Reflector-Curator pipeline asynchronously after the session and persists strategies into the Playbook (§5.1.3). A single failed step can be corrected within the same task (AEE), and once EPE has analyzed the trajectory the lesson is durably captured for future tasks, without the live chat path paying the cost of an online pipeline call.
-
Multimodal, sandbox-grounded trajectories. A
TrajectoryStep.statemay carry a screenshot URL captured from observable Sandboxes (Android emulator today). When present, the Reflector resolves the URL to a base64 data URI and emits image blocks interleaved with the trajectory text, so a vision-capable LLM can use available UI state in GUI and mobile-automation domains. -
Operator-correction pathway reserved in the Reflector interface (partial). Sico's Reflector already accepts a
ground_truthparameter alongside execution feedback, so Operator corrections are designed to enter the Reflector-Curator pipeline as a first-class signal. -
Embedded in a larger architecture, not a standalone optimizer. In Sico the Playbook is not the central artifact but one of the five memory layers described in §3.3, alongside in-turn scratchpad, recent history, Mem0 long-term facts, and project knowledge. It also participates in the three coordinated loops of §3.4, where the planned Evaluation Loop feeds L1–L4 failure attribution back into Experience Learning (§6.3). The same mechanism is exposed as a portable module for non-Sico agent stacks in §5.1.5.
The experience learning system is not limited to Sico's built-in chat agent. Any agentic system can use it through the integration pattern:
- Inject: Format the Playbook as context for the external agent using
wrap_playbook_for_external_agent(). - Execute: The external agent runs its task normally.
- Learn: Convert the execution results into a
TrajectoryDataand callExperienceRunner.learn_from_trajectory().
See core/app/experiences/integrations/base.py for a complete example.
Status: training pipeline not in this open-source release. The SFT / RL pipeline and weight-update workflow are run internally. The open-source components produce and persist the upstream signals (execution traces and Operator corrections) so they can feed an external training pipeline of the Operator's choosing.
Cortex Evolution targets the base model itself, rather than the surrounding context. Instead of treating model training as a one-time event, execution outcomes are collected, converted into structured learning signals, and submitted to an offline training pipeline that periodically produces updated model weights.
A Digital Worker starts from a baseline model that has been fine-tuned on structured domain knowledge for its role. During production use, the Execution Loop (§4) emits trajectories and Operator corrections; the planned Evaluation Loop (§6), once shipped, will tag failure cases with L1–L4 attribution; the Experience Learning pipeline (§5.1) records which strategies proved helpful or harmful. These artifacts are the upstream signals that the training pipeline consumes. SFT and RL are used as the optimization step inside that pipeline; the open-source codebase covers signal generation and persistence, not the trainer itself.
Compared with training-free evolution, this track operates on a longer cycle and updates a smaller set of artifacts (model weights), but it is the only mechanism that can raise the baseline capability of the worker. The two tracks are designed to be complementary: training-free evolution adapts the system between training runs, while training-based evolution periodically lifts the floor from which training-free evolution operates.
Status: planned, not yet shipped. Evaluation is not part of the current Sico release. This section describes the design that will land in a follow-up version. It is included here because Evaluation is the missing half of the Co-Evolution loop introduced in §1.2 - Experience Learning teaches the worker, Evaluation tells the platform whether the teaching worked.
If Experience Learning is how a Digital Worker gets better, Evaluation is how the platform knows where it is failing and why. The planned module focuses on failure attribution: given a failed task trajectory, it identifies the root cause and turns it into a structured signal for the Operator, Experience Learning, and future training pipelines.
Evaluation is intentionally narrow. It will not compute generic quality scores or recommend "hire/fire" verdicts. Its job is to produce an auditable explanation of what went wrong, where it happened, and who or what should be improved.
The input is a failed trajectory: plan steps, tool calls, tool results, reflections, and screenshots when available. The output is a structured attribution result with the key failing step, a short analysis, a concrete suggestion, and a confidence score.
Every attribution output traces a complete path from L1 down to L4 (no skipped levels), so distributions are directly comparable across runs and across DW types.
| Level | Purpose | Examples |
|---|---|---|
| L1 Problem Owner | Who is responsible? | Task Instruction Issue · DW Capability Issue · Environment Issue |
| L2 Module | Which subsystem failed? (only for capability issues) | Perception · Understanding & Planning · Execution · Verification |
| L3 Error Type | What kind of error? | UI Element Recognition Error · Planning Error · App/System Error |
| L4 Sub-error Type | The concrete failure mode | Similar Element Confusion · Wrong Step Ordering · App Crash |
The error tree is pluggable per DW type, so a new DW role can ship its own taxonomy without touching the attributor itself.
Attribution outputs are designed to become control signals:
- Into Experience Learning. L4 sub-error types and suggestions can become Playbook updates through the Reflector → Curator pipeline (§5.1).
- Into the Operator. Aggregated L1–L4 distributions tell humans what kind of help a Digital Worker needs: better instructions, a more stable sandbox, or a model upgrade.
- Into training. When a training pipeline is available, attribution results can help select and label hard examples for SFT / RL.
Together with Experience Learning, Evaluation closes the diagnostic half of the Co-Evolution loop: Experience Learning teaches the worker, Evaluation tells the platform what still needs teaching.
Sico's architecture is designed around a single principle: capability should compound through operator-guided execution.
┌───────────────────────────────────────────────────────────────┐
│ CO-EVOLUTION LOOP │
│ │
│ Operator --goal--> Agent Execution │
│ ├─ LLM Hub (multi-provider Cortex) │
│ ├─ Tools + Skills (curated Action) │
│ ├─ Knowledge + Memory & Sense │
│ ├─ Library enrichment <───────────────┐ │
│ ├─ Execute with Sandboxes when needed │ │
│ ├─ On failure -> real-time exp <───┐ │ │
│ ├─ Stream results via Kafka -> SSE │ │ │
│ ├─ Persist via reverse gRPC │ │ │
│ └─ Experience Learning: │ │ │
│ trajectory -> Reflector │ │ │
│ -> Curator -> playbook ────────┴──┘ │
└───────────────────────────────────────────────────────────────┘
Every component serves this loop:
- Reverse gRPC keeps Core decoupled from primary relational persistence, so execution can scale without Core owning MySQL credentials or schema migrations.
- LLM Hub makes the Cortex provider-agnostic, so the best model for each task can be selected without code changes.
- Skills, Knowledge, Playbooks, and runtime context differentiate roles on top of a shared built-in tool set, so Digital Workers are specialized without fragmenting the core action surface.
- Sandboxes provide isolated, observable execution for runs that require external environments such as Android emulators, so those parts of the work are reproducible and auditable.
- Experience Learning closes the loop, turning every execution into a learning opportunity that improves the next run.
The result is a platform where Digital Workers don't just run tasks - they get better at them, guided by the humans who work alongside them. And the Operator's role evolves from performing repetitive work to steering the evolution of a digital workforce.

