Skip to content

feature: define a durable state and recovery contract across router and dashboard surfaces #1606

@Xunzhuo

Description

@Xunzhuo

Describe the feature

Establish one repository-wide control-plane state contract for router runtime and dashboard surfaces, then use that contract to drive gradual hardening of the highest-risk stateful paths.

Today, state ownership is fragmented across memory defaults, workspace-local JSON and SQLite files, in-process registries, temp-owned runtime status files, subprocess stdout parsing, and frontend localStorage. This issue proposes a staged cleanup that makes durability, restart behavior, and telemetry semantics explicit instead of implicit.

Suggested scope for the first workstream:

  • inventory the major runtime and dashboard state surfaces and classify each one as ephemeral, restart-safe local, shared durable workflow state, or audit/analytics telemetry
  • make the highest-risk router surfaces restart-safe, especially response storage, router replay, vector-store metadata, file metadata, and runtime status
  • move dashboard workflow and progress state toward server-owned durable records instead of browser localStorage, in-memory job maps, or log-derived status
  • treat CLI-mounted .vllm-sr workspace state as a local-dev adapter instead of the only implicit persistence contract

Primary layer

global level

Why this layer?

This gap spans router defaults, dashboard backend/frontend behavior, CLI-mounted local state, and shared recovery/telemetry rules. It is intentionally cross-cutting rather than owned by one signal, plugin, or single subsystem.

Why do you need this feature?

The current repository is already beyond purely stateless routing. As traffic, workflows, and contributors grow, implicit state semantics create scale risks:

  • user-visible features can silently lose state on restart because defaults still look production-ready while remaining memory-backed
  • restart behavior is inconsistent across router replay, Response API, vector-store/file metadata, ML pipeline jobs, model-research campaigns, OpenClaw room state, and dashboard chat/session behavior
  • dashboard health and long-running workflow progress still depend on log scraping, temp files, or subprocess stdout parsing in several paths
  • contributors have no single source of truth for deciding where new state should live or how recovery should work

A durable inventory plus staged follow-up changes would make roadmap planning, issue splitting, and cross-stack collaboration substantially cheaper.

Additional context

Repository evidence and follow-up assets already prepared in-tree:

  • docs/agent/tech-debt/td-034-runtime-and-dashboard-state-durability-and-telemetry-contract.md
  • docs/agent/plans/pl-0011-runtime-and-dashboard-state-durability-and-telemetry-ratchet.md

Representative code paths called out by the review:

  • src/semantic-router/pkg/config/canonical_defaults.go
  • src/semantic-router/pkg/responsestore/{factory.go,memory_store.go}
  • src/semantic-router/pkg/routerreplay/store/{factory.go,memory.go}
  • src/semantic-router/pkg/vectorstore/{manager.go,filestore.go}
  • src/semantic-router/pkg/startupstatus/status.go
  • dashboard/backend/{evaluation/db.go,mlpipeline/runner.go,modelresearch/manager.go}
  • dashboard/backend/handlers/{status_collectors.go,openclaw.go,openclaw_rooms.go}
  • dashboard/frontend/src/hooks/useConversationStorage.ts
  • src/vllm-sr/README.md

Suggested child issues after triage:

  1. Publish the router/dashboard state inventory and durability taxonomy.
  2. Make router-side metadata and replay/response state restart-safe.
  3. Unify dashboard workflow persistence and typed progress/health telemetry.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions