feature: define a durable state and recovery contract across router and dashboard surfaces

## Describe the feature

Establish one repository-wide control-plane state contract for router runtime and dashboard surfaces, then use that contract to drive gradual hardening of the highest-risk stateful paths.

Today, state ownership is fragmented across memory defaults, workspace-local JSON and SQLite files, in-process registries, temp-owned runtime status files, subprocess stdout parsing, and frontend localStorage. This issue proposes a staged cleanup that makes durability, restart behavior, and telemetry semantics explicit instead of implicit.

Suggested scope for the first workstream:
- inventory the major runtime and dashboard state surfaces and classify each one as ephemeral, restart-safe local, shared durable workflow state, or audit/analytics telemetry
- make the highest-risk router surfaces restart-safe, especially response storage, router replay, vector-store metadata, file metadata, and runtime status
- move dashboard workflow and progress state toward server-owned durable records instead of browser localStorage, in-memory job maps, or log-derived status
- treat CLI-mounted `.vllm-sr` workspace state as a local-dev adapter instead of the only implicit persistence contract

## Primary layer

global level

## Why this layer?

This gap spans router defaults, dashboard backend/frontend behavior, CLI-mounted local state, and shared recovery/telemetry rules. It is intentionally cross-cutting rather than owned by one signal, plugin, or single subsystem.

## Why do you need this feature?

The current repository is already beyond purely stateless routing. As traffic, workflows, and contributors grow, implicit state semantics create scale risks:
- user-visible features can silently lose state on restart because defaults still look production-ready while remaining memory-backed
- restart behavior is inconsistent across router replay, Response API, vector-store/file metadata, ML pipeline jobs, model-research campaigns, OpenClaw room state, and dashboard chat/session behavior
- dashboard health and long-running workflow progress still depend on log scraping, temp files, or subprocess stdout parsing in several paths
- contributors have no single source of truth for deciding where new state should live or how recovery should work

A durable inventory plus staged follow-up changes would make roadmap planning, issue splitting, and cross-stack collaboration substantially cheaper.

## Additional context

Repository evidence and follow-up assets already prepared in-tree:
- `docs/agent/tech-debt/td-034-runtime-and-dashboard-state-durability-and-telemetry-contract.md`
- `docs/agent/plans/pl-0011-runtime-and-dashboard-state-durability-and-telemetry-ratchet.md`

Representative code paths called out by the review:
- `src/semantic-router/pkg/config/canonical_defaults.go`
- `src/semantic-router/pkg/responsestore/{factory.go,memory_store.go}`
- `src/semantic-router/pkg/routerreplay/store/{factory.go,memory.go}`
- `src/semantic-router/pkg/vectorstore/{manager.go,filestore.go}`
- `src/semantic-router/pkg/startupstatus/status.go`
- `dashboard/backend/{evaluation/db.go,mlpipeline/runner.go,modelresearch/manager.go}`
- `dashboard/backend/handlers/{status_collectors.go,openclaw.go,openclaw_rooms.go}`
- `dashboard/frontend/src/hooks/useConversationStorage.ts`
- `src/vllm-sr/README.md`

Suggested child issues after triage:
1. Publish the router/dashboard state inventory and durability taxonomy.
2. Make router-side metadata and replay/response state restart-safe.
3. Unify dashboard workflow persistence and typed progress/health telemetry.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: define a durable state and recovery contract across router and dashboard surfaces #1606

Describe the feature

Primary layer

Why this layer?

Why do you need this feature?

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feature: define a durable state and recovery contract across router and dashboard surfaces #1606

Description

Describe the feature

Primary layer

Why this layer?

Why do you need this feature?

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions