Skip to content

feat: memory-pressure gate and concurrency cap for new sessions#1138

Open
PaytonWebber wants to merge 7 commits into
mainfrom
feat/oom-preflight-gate
Open

feat: memory-pressure gate and concurrency cap for new sessions#1138
PaytonWebber wants to merge 7 commits into
mainfrom
feat/oom-preflight-gate

Conversation

@PaytonWebber
Copy link
Copy Markdown
Collaborator

@PaytonWebber PaytonWebber commented Apr 21, 2026

Summary

  • Memory-pressure gate — pre-flight check before spinning up a new runner. Configurable via the new memoryGate config knob: true enables it at the default 85% pressure threshold, a number in (0, 1] sets a custom threshold (e.g. 0.9 = 90%), and false (or omitting the field) disables it. Uses only os.totalmem/os.freemem/process.memoryUsage/v8.getHeapStatistics so behaviour is identical on Linux and macOS.
  • Concurrency cap — new maxConcurrentRunners config rejects new work when already at capacity. Active count is derived live from isRunning() across Linear (agentSessionManager.getAllAgentRunners()) and chat (chatSessionHandler.getAllRunners()) runners, so there's no counter to drift.
  • User-visible rejection messages wired into all four runner-creation paths: Linear agentSessionCreated + agentSessionPrompted, GitHub webhook, GitLab webhook, and the Slack chat handler. Each platform posts via its native channel (Linear activity, GitHub PR comment, GitLab MR note, Slack thread reply).

Motivation

Cyrus previously had no visibility into host memory before accepting new sessions and no ceiling on concurrent runners, so under load the process could be SIGKILL'd by systemd-oomd or the kernel OOM killer with no warning to the user. This adds opt-in guardrails so Cyrus can refuse new work gracefully with a clear message instead of dying silently.

Config shape

{
  "memoryGate": true,
  "maxConcurrentRunners": 5
}

memoryGate also accepts a number in (0, 1] to set a custom pressure threshold (e.g. 0.9). Omitting either field preserves today's behaviour (no gate).

"Pressure" is the worst of three normalized dimensions: process RSS as a fraction of system memory, V8 heap usage as a fraction of the heap size limit, and system memory used as a fraction of total.

Test plan

  • Learning tests for checkMemoryHealth / collectMemoryMetrics / formatMemoryPressureMessage (injectable MemorySources pattern to sidestep ESM module mocking)
  • Schema validation tests for memoryGate and maxConcurrentRunners config fields
  • runner-gate.test.ts — 7 tests covering the concurrency cap and memory gate wiring into EdgeWorker
  • cyrus-core test suite (158 tests) passes
  • cyrus-edge-worker test suite (632 tests) passes
  • pnpm typecheck clean across all workspace projects
  • JSON schemas regenerated and committed
  • Manual verification on a memory-constrained host: confirm rejection message appears in Linear / GitHub / GitLab / Slack

PaytonWebber and others added 7 commits April 21, 2026 14:31
Cross-platform pre-flight checks (Linux + macOS) that reject new runner
creation when the host is under memory pressure or already at a configured
session ceiling, posting a user-visible "temporarily out of capacity"
message to Linear, GitHub, GitLab, or Slack instead of getting silently
OOM-killed by systemd-oomd or the kernel.

- New `memoryGate` config: maxRssPercent, minAvailableMemoryMb,
  maxHeapUsagePercent (opt-in via `enabled: true`)
- New `maxConcurrentRunners` config: hard cap on concurrent runners,
  counted live via `isRunning()` across Linear + chat runners
- Gates wired into all 4 runner-creation paths:
  Linear agentSessionCreated/Prompted, GitHub webhook, GitLab webhook,
  Slack chat handler
- Uses only `os`, `v8`, `process` Node built-ins for cross-platform parity
- memory-health.ts: `config?: MemoryGateConfig` instead of `| undefined`
- memory-health.ts: drop raw technical reason from user-facing message
  (stays in `MemoryCheckResult.reason` for operator logs)
- EdgeWorker.ts: log swallowed `runner.isRunning()` errors at debug level
  so a broken runner slipping under the cap is diagnosable
- EdgeWorker.ts: extract `enforceRunnerGate(onReject)` helper; collapse
  the four near-identical gate blocks (GitHub, GitLab, Linear, resume)
  into callsites that just pass a platform-specific rejection closure
- chat-sessions.test.ts: implement `notifyUnavailable` on TestChatAdapter
  (required by ChatPlatformAdapter since this PR)
- runner-gate.test.ts: new test file covering the gate wiring end-to-end
  via ChatSessionHandler — gate trips → notifyUnavailable called, runner
  not spawned; gate passes → runner spawned; no gate → runner spawned
* fix: close streaming prompt on result unless warm sessions enabled (#1141)

* fix: close streaming prompt on result unless warm sessions enabled

Before #1109, ClaudeRunner called streamingPrompt.complete() when the SDK
emitted a result message, which closed the async iterable and let the
for-await loop exit so the subprocess could shut down at turn end. That
close was removed to keep sessions warm for follow-up messages.

The pre-warm PR also added CYRUS_ENABLE_WARM_SESSIONS as an opt-in gate
for startup pre-warming and warm-instance attach, but the streaming-prompt
close behavior was not gated with it. Result: sessions kept their
subprocess alive after every turn even when warm sessions were disabled.

Thread the gate into ClaudeRunner via an optional keepSessionWarm
constructor arg (passed from EdgeWorker.createRunnerForType using
isWarmSessionsEnabled()). When false, complete the streaming prompt on
result as before; when true, leave it open for follow-ups. Result-message
emission stays in-loop in both modes — no revert of the pendingResultMessage
deferral.

* docs: clarify subprocess-exit motivation in changelog entry

* feat: add base branch change notification and blocked-by dependency deferral (CYPACK-978) (#1004)

* feat: add base branch change notification and blocked-by dependency deferral (CYPACK-978)

Add two session lifecycle features:

1. Base branch push notification: GitHub push webhooks on tracked base branches
   stream rebase notifications to active sessions via addStreamMessage.

2. Blocked-by dependency deferral: Issues with unresolved blocked-by relations
   are parked instead of starting a runner. Parked sessions wake automatically
   when blocking issues complete, or on user re-prompt if blockers clear.

* docs: update changelogs for CYPACK-978 (#1004)

* refactor: replace as Exclude type assertions with known literal eventType values

Each private translate method is only called from a specific if-branch where
the eventType is already known, so we can use the exact string literal instead
of casting via Exclude<typeof event.eventType, "push">.

---------

Co-authored-by: cyrusagent <237105008+cyrusagent[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Agent <agentclear@anthropic.com>
Co-authored-by: Connor Turland <1409121+Connoropolous@users.noreply.github.com>

* feat: add /linear-webhook endpoint with deprecated /webhook alias (CYPACK-1119) (#1142)

* chore: update @anthropic-ai/claude-agent-sdk to v0.2.117 (CYPACK-1120) (#1143)

* chore: update @anthropic-ai/claude-agent-sdk to v0.2.117 (CYPACK-1120)

Bumps claude-agent-sdk from 0.2.116 to 0.2.117 across all packages
(claude-runner, core, edge-worker, simple-agent-runner). This updates
the bundled Claude Code binary from v2.1.116 to v2.1.117, a parity
release with no tool-list changes.

Also fixes scripts/extract-claude-tools.sh to work with the new native
binary SDK structure introduced in v0.2.113: the SDK no longer ships a
bundled cli.js but instead installs platform-specific optional
dependencies (e.g. @anthropic-ai/claude-agent-sdk-darwin-arm64). The
script now resolves the binary by walking from the SDK package to its
optional platform dep.

Tool list verified unchanged against Claude Code v2.1.117 binary output.

* chore: add PR number to CHANGELOG entry for CYPACK-1120

* fix: post chat-session replies from result messages under warm mode (#1145)

* fix: post chat-session replies from result messages, not startStreaming resolution

With CYRUS_ENABLE_WARM_SESSIONS=1, ClaudeRunner keeps the streaming prompt
open across turns so the subprocess stays warm for follow-up messages. As
a side effect, runner.startStreaming() never resolves until the entire
session is torn down — but ChatSessionHandler awaited startStreaming()
and then called adapter.postReply() inline, so Slack/GitHub chat sessions
never posted a reply at all under warm mode.

Decouple reply posting from session termination:

- Maintain a FIFO queue of pending reply events per sessionId. Enqueue at
  each entry point that sends a prompt: new session, resume, and
  follow-up injection via addStreamMessage (which was also a fire-and-
  forget path with no reply today, regardless of warm mode).

- In handleAgentMessage, when the SDK emits a `result` message, dequeue
  the oldest event for that session and call adapter.postReply() with
  it. Each turn's reply pairs with the prompt that triggered it.

- Kick off start()/startStreaming() as a non-awaited promise; log the
  resolved sessionInfo and forward errors via the existing logger. The
  handler's handleEvent() returns once the prompt has been dispatched,
  so onWebhookEnd fires promptly under warm mode.

No completeStream() calls — the streaming prompt intentionally stays
open under warm mode so addStreamMessage follow-ups continue to work.

* fix: clear pending reply queue when chat runner errors before result

Codex review feedback on #1145: enqueueReply() runs before start/startStreaming,
but the error catch only logged and left the queued event in place. If the
runner dies before emitting a result (e.g. SDK error message terminates the
session) the stale event stays at the head of pendingReplyEvents — a later
resumeSession() on the same sessionId would then pair it with the new
runner's first result and shift every subsequent reply by one turn.

Add clearPendingReplies(sessionId) and call it from both the initial-turn
and resume error paths. Logs the count discarded so drift is visible.

* fix: de-flake EgressProxy CI tests (CYPACK-1122) (#1147)

* fix: de-flake EgressProxy CI tests (CYPACK-1122)

Two root causes behind the flaky `EgressProxy` suite in CI:

1. Port collision on EADDRINUSE (127.0.0.1:19281): the test allocated
   ports via `19080 + Math.floor(Math.random() * 1000)`, a narrow range
   that occasionally collides with another process or a port still in
   TIME_WAIT. Tests now bind to port 0 and read the OS-assigned
   ephemeral port via `proxy.getHttpProxyPort()` / `getSocksProxyPort()`.
   `EgressProxy.startHttpProxy` / `startSocksProxy` update the stored
   port to the actual bound `server.address().port` after `listen`.

2. SOCKS5 `blocks non-allowed domains` race: the server wrote the
   denial reply then immediately destroyed the socket. Because
   `socket.write` is async and `socket.destroy` aborts pending I/O,
   the reply could be truncated before reaching the client, which
   then timed out. Replaced `socket.write(reply); socket.destroy()`
   with `socket.end(reply)` so the reply is flushed before FIN in
   all three SOCKS5 error paths.

* chore: add PR link to CYPACK-1122 changelog entry

* chore: remove stale package-lock.json (fixes Dependabot picomatch alert)

This repo is pnpm-managed (`packageManager: pnpm@10.13.1` and
`pnpm-lock.yaml` is the source of truth). A stale `package-lock.json`
was accidentally committed and pinned a vulnerable `picomatch@2.3.1`
(GHSA-3v7f-55p6-f55p), which Dependabot has been flagging.

Removing the stray lockfile and gitignoring `package-lock.json` /
`yarn.lock` so npm/yarn lockfiles can't sneak back in.

* chore: bump pnpm to 10.33.1

* fix: spread process.env in claude-agent-sdk invocations (forward-port of v0.2.49 hotfix) (#1152)

* fix: spread process.env in claude-agent-sdk invocations (#1150)

* spread process.env again since the sdk https://github.com/anthropics/claude-agent-sdk-typescript/blob/main/CHANGELOG.md#02113 reverted to not overlaying

* docs: changelog for HOME/env regression fix

Document that re-spreading process.env restores HOME and other inherited
env vars for Claude sessions, fixing GPG-signed commits, gh CLI auth,
and other tools that depend on the user's shell environment.

* docs: record 0.2.49 hotfix release in CHANGELOG

The v0.2.49 hotfix was released from the `cypack-1123` branch. Add a
corresponding `## [0.2.49]` section on main so the release history shows
that it happened, and move the process.env bullet out of Unreleased into
0.2.49 (where it was actually shipped).

* chore: bump package versions to 0.2.49

* feat(claude-runner): enable SDK debug mode on DEBUG log level (CYPACK-1124) (#1153)

* feat(cli): optional Sentry error tracking with opt-out env var (CYPACK-1142) (#1163)

* feat(cli): integrate optional Sentry error tracking (CYPACK-1142)

Adds an ErrorReporter abstraction in cyrus-core (interface + Noop) and a
Sentry-backed implementation in apps/cli. The reporter is initialised
before Application bootstrap and wired into the existing
uncaughtException / unhandledRejection handlers, so unhandled CLI errors
are reported automatically.

Configuration:
- CYRUS_SENTRY_DSN — DSN to use (overrides the bundled default).
- CYRUS_SENTRY_DISABLED — opt-out switch (1/true/yes/on).
- CYRUS_SENTRY_ENVIRONMENT — environment tag (defaults to "production").

The bundled DEFAULT_SENTRY_DSN is left empty: the OAuth token used to
scope this work cannot create the ceedar/cyrus-cli Sentry project, so an
admin must create the project (team: cyrus, platform: node) and paste
the DSN into apps/cli/src/services/createErrorReporter.ts to enable
zero-config reporting for end users.

* feat(core): forward all logger.error calls to ErrorReporter (CYPACK-1142)

Makes the Sentry hook universal: every Logger.error(...) call across
edge-worker, claude-runner (Claude Code session errors), GitHub/Slack/
Linear transports, persistence manager, etc. now forwards to the global
ErrorReporter. The reporter is installed once during CLI bootstrap via
setGlobalErrorReporter(); libraries imported without the CLI continue to
use a Noop reporter, so test harnesses and SDK consumers are unaffected.

The Logger extracts the first Error in the trailing args (also unwrapping
`{ error: Error }` shapes used by transports) and reports it as an
exception with component / sessionId / platform / issueIdentifier /
repository tags pulled from LogContext. When no Error is present, the
message is captured at "error" severity so silent failure paths still
produce events.

Application's uncaughtException / unhandledRejection handlers were
collapsed onto the same path to avoid double-reporting now that Logger
forwards automatically.

* feat(cli): tag Sentry events with team_id from CYRUS_TEAM_ID (CYPACK-1142)

Reads CYRUS_TEAM_ID at reporter construction and applies it as a global
`team_id` tag via Sentry's initialScope, so every captured event (and
every future logger.error forwarded through the reporter) is filterable
per Cyrus tenant in Sentry without requiring tag plumbing at each
capture site.

Additional global tags can be added in buildInitialTags() in
createErrorReporter.ts; capture-site context (component, sessionId,
issueIdentifier, etc.) continues to flow via Logger.error.

* fix(core): apply CYRUS_TEAM_ID tag in Logger.error forwarding (CYPACK-1142)

Sentry's initialScope tags are overridden when call sites set per-event
tags via withScope().setTag(), and Logger.error builds an explicit tag
map for every forwarded event. So even though createErrorReporter set
team_id via initialScope, every Logger-routed exception/message ended up
without it.

Introduces setGlobalErrorTags / getGlobalErrorTags in cyrus-core as the
canonical place to register process-wide tags. Bootstrap mirrors the
team_id from CYRUS_TEAM_ID into both Sentry's initialScope (for events
emitted directly via the SDK) and the Logger registry (for events
emitted via logger.error). Per-call context tags still win on collision
so capture sites can override.

* Set default Sentry DSN for error reporting

* feat(sentry): scrub secrets, add fingerprint + sample-rate (CYPACK-1142)

Address the three high-severity gaps from the PR review:

1. Secret scrubbing (sentryScrubber.ts) — beforeSend hook redacts
   sensitive keys (token, secret, authorization, etc.) and
   token-shaped strings at any depth in event extras/contexts/request.
   Wired into createErrorReporter so every Sentry-bound event is filtered.

2. Stable fingerprinting (Logger + ErrorReporterContext) — Logger.error
   now sets fingerprint = ["logger", component, templatized(message)],
   collapsing UUIDs/issue identifiers/paths/long numbers to placeholders
   so messages with embedded IDs no longer fragment Sentry groups.

3. CYRUS_SENTRY_SAMPLE_RATE env var — wired through createErrorReporter
   to SentryErrorReporter so high-volume deployments can downsample
   without code changes.

* feat(sentry): structured cyrus context + extra-error/console integrations (CYPACK-1142)

Per Payton's follow-up: configure Sentry for structured logging.

- Every event now carries a structured `cyrus` context block (team_id,
  environment, release, plus optional linear_workspace from
  CYRUS_LINEAR_WORKSPACE and deployment_id from CYRUS_DEPLOYMENT_ID).
  Tags stay indexed/searchable; the context block groups the richer
  typed fields together in the Sentry UI.
- Add extraErrorDataIntegration so Error subclass own-properties (e.g.
  err.statusCode, err.requestId, custom Cyrus error fields) surface as
  structured `extra` data instead of being dropped.
- Add consoleIntegration so console.* output is captured as breadcrumbs
  giving each event a structured trail of the last log lines.

Tests cover the new structured context fields and the bundled DSN
fallback path.

* feat(sentry): forward all logs to Sentry Logs with team_id (CYPACK-1142)

Wire the existing logging abstraction (cyrus-core/logging/Logger)
through to Sentry's structured Logs API
(https://docs.sentry.io/product/explore/logs/), so every log line at
every level — not just errors — is searchable in Sentry tagged with
team_id and per-session metadata.

- Bump @sentry/node 8.55.1 → 9.47.1 (Logs API requires v9.41+).
- Add `enableLogs: true` to Sentry.init.
- Extend ErrorReporter interface with `log(level, message, attrs?)`,
  ErrorReporterLogLevel and ErrorReporterLogAttributes types. Noop
  reporter implements as no-op; SentryErrorReporter dispatches to
  Sentry.logger.{trace,debug,info,warn,error,fatal} merging
  process-wide tags (team_id, …) into per-log attributes.
- Logger.{debug,info,warn,error} now forward each call through
  reporter.log(...) attributing with component, sessionId,
  issueIdentifier, repository, and a primitive-summarised tail of
  trailing args. Errors continue to additionally capture as Sentry
  Issues for the alerting flow.
- Tests cover all-levels forwarding, team_id attribute presence,
  error-arg summarisation, and the enableLogs init flag.

* fix(sentry): gate SDK debug on CYRUS_SENTRY_DEBUG, not CYRUS_LOG_LEVEL (CYPACK-1142)

Reported on PR: with CYRUS_LOG_LEVEL=DEBUG the terminal floods with
"Sentry Logger [log]: [Tracing] Inheriting..." messages from
@sentry/opentelemetry's auto-instrumentation. Those are SDK-internal
diagnostics, not Cyrus app logs, and only fire when the SDK's debug
flag is on.

App debug logging and Sentry SDK debug logging are orthogonal — gate
on the dedicated CYRUS_SENTRY_DEBUG env var so app debugging stays
clean. The tracing-inheritance lines disappear automatically because
the SDK gates them on debug being on.

* fix(sentry): gate Logs forwarding on CYRUS_LOG_LEVEL (CYPACK-1142)

Reported on PR: previously the console-output gate respected
CYRUS_LOG_LEVEL but the Sentry Logs forward did not, so users running
at the default INFO level still paid for every logger.debug(...) call
in their Sentry quota.

Fixed by moving forwardLog inside the level-gated branch for
debug/info/warn. Two intentional asymmetries:

- error: structured-log forwarding is gated by level (so SILENT drops
  it), but the Sentry Issue capture is NOT gated — silencing logs must
  not silence real failure alerts. The level controls verbosity, not
  visibility of bugs.
- forwardToErrorReporter (Issue capture path) stays outside the gate
  for that reason.

Tests cover INFO/WARN drop behaviour, DEBUG enabling all four levels,
and SILENT still capturing errors as Issues while dropping the log
stream.

* fix(sentry): forward all logs unconditionally regardless of CYRUS_LOG_LEVEL (CYPACK-1142)

Reverses the prior level-gating change per Payton's follow-up:
CYRUS_LOG_LEVEL is for local terminal verbosity only; the structured
Sentry Logs stream should be the always-on backbone so debug-level
traces remain available in Sentry even when running at INFO in
production.

forwardLog is now called unconditionally on every Logger.debug/info/
warn/error call. Console output is still gated by level. Errors still
additionally capture as Sentry Issues regardless of level (unchanged).

* feat(sentry): scrub Logs, gate forwarding on team_id, lifecycle events (CYPACK-1142)

Addresses three review gaps identified in PR #1163:

1. Sentry Logs are now scrubbed via a dedicated `beforeSendLog` hook —
   `beforeSend` only runs for Issues, so the structured-log stream was
   previously bypassing redaction. Adds `scrubSentryLog` mirroring the
   key/string token rules already applied to events. Also scrubs
   `event.breadcrumbs[]` (consoleIntegration captures every console.*
   line, which can contain headers/tokens).

2. Logs forwarding is now gated on `CYRUS_TEAM_ID` AND the absence of
   `CYRUS_SENTRY_DISABLED`. Issue capture remains gated only on
   `CYRUS_SENTRY_DISABLED`, so installs without tenant tagging still
   surface errors but don't ship the higher-volume log stream.
   Implemented as a new `logsEnabled` option on SentryErrorReporter
   that mirrors into `Sentry.init({ enableLogs })` and short-circuits
   `log()` (belt-and-braces).

3. Reworked which logs forward:
     - WARN/ERROR auto-forward (always).
     - debug/info no longer forward — too high volume to ship blanket.
     - New `logger.event(name, attributes?)` API on ILogger for major
       lifecycle events that should always reach Sentry Logs regardless
       of local CYRUS_LOG_LEVEL.
   Wired `event()` into ClaudeRunner (session_started, session_resumed,
   claude_session_id_assigned, message_emitted, session_completed,
   session_stopped, session_stop_requested, claude_query_options) and
   EdgeWorker (webhook_received). Each event carries identifier,
   claudeSessionId, team_id, component, and other contextual metadata
   already merged via the Logger context/global tag plumbing.

SOLID notes: SentryErrorReporter still owns only SDK translation
(SRP); ErrorReporter contract unchanged (LSP/DIP); new event method
extends ILogger without altering existing methods (OCP); new
LogEventAttributes type narrows to primitives only (ISP).

* fix(sentry): stop redacting sessionId / claudeSessionId attributes (CYPACK-1142)

The "session" substring in SENSITIVE_KEY_PATTERNS was too broad — it
matched our identifier attributes (sessionId, claudeSessionId) and
redacted them in the Logs explorer, defeating the whole point of
forwarding lifecycle events.

Real session secrets are still caught by the more specific patterns:
session_token (token), session_cookie (cookie), session_secret (secret).
Plain identifiers now pass through. Added a regression test pinning
both directions.

* fix(sentry): gate Issues on CYRUS_TEAM_ID alongside Logs (CYPACK-1142)

Per follow-up requirement: Issue capture should also be gated on
CYRUS_TEAM_ID, not just Logs. Both surfaces now share a single gate —
installs without a tenant tag stay silent so the team's Sentry org
isn't flooded with untenanted self-hosted noise we can't slice.

Resolution order in createErrorReporter is now:
  1. CYRUS_SENTRY_DISABLED truthy → Noop
  2. CYRUS_TEAM_ID unset           → Noop
  3. No DSN configured             → Noop
  4. Otherwise                     → SentryErrorReporter

Dropped the now-redundant logsEnabled plumbing from
SentryErrorReporter — by the time the constructor runs both Issues
and Logs are wanted, so enableLogs is unconditionally true.

Tests refactored: removed assertions for "Issues without team_id"
behavior, added a regression for "no captureException when team_id
is unset", added a regression for "Noop when team_id is unset". 94
CLI tests still pass.

* fix(sentry): redact sensitive env keys in claude_query_options before serialise (CYPACK-1142)

The query options spread includes process.env, which carries every
PAT / OAuth token / webhook secret on the host. Sentry's server-side
data scrubber matches token-shaped substrings inside the JSON string
and replaces the *whole* attribute value with [Filtered], wiping the
diagnostic payload entirely.

Now redact sensitive keys (token, secret, password, apikey,
authorization, cookie, private_key, client_secret, refresh_token,
access_token, bearer, dsn, webhook_secret, signing_secret) at
serialisation time via the existing JSON.stringify replacer hook.
Sentry no longer sees the raw secrets, so it stops marking the
options field as filtered and the rest of the payload survives.

Patterns mirror apps/cli/src/services/sentryScrubber.ts but live
locally in claude-runner so this package keeps no upward dep on the
CLI app.

* fix(sentry): ship claude_query_options as flat primitive attributes (CYPACK-1142)

Server-side Sentry data scrubbing was filtering the whole \`options\`
attribute to \`[Filtered]\` regardless of how thoroughly we redacted
sensitive substrings inside the JSON. Stuffing a long nested-JSON
string under a single attribute key reliably trips at least one
matcher (token-shaped substrings, length, key name patterns) — and
when it does the *whole* payload disappears.

Switch strategy: build a sanitized projection that drops everything
secret-bearing or unbounded (full env values, MCP server inner
config, prompt text, system-prompt append) and keep only the
diagnostic surface — model, tool counts/previews, MCP server names,
env key NAMES (not values), system prompt shape, presence flags for
hooks/plugins/sandbox, etc. Then *flatten* the projection into one
attribute per datum (\`cqo.model\`, \`cqo.allowedToolsCount\`,
\`cqo.envKeyNamesPreview\`, …) so a per-key filter (if it ever fires)
loses one attribute, not the whole payload.

Local DEBUG console still logs the full untruncated options JSON so
on-machine troubleshooting is unaffected — the projection only
applies to what leaves the process.

---------

Co-authored-by: Payton Webber <53197664+PaytonWebber@users.noreply.github.com>

* feat: configurable mirror Claude session transcripts to hosted control plane (CYPACK-1121) (#1144)

* feat: mirror Claude session transcripts to hosted control plane (CYPACK-1121)

Introduce HttpSessionStore — a Claude Agent SDK SessionStore adapter that
POSTs session transcript entries to the Cyrus hosted control plane,
authenticated with the team-scoped CYRUS_API_KEY. Wire it through
ClaudeRunnerConfig and construct it in EdgeWorker when both
CYRUS_API_KEY and CYRUS_APP_URL are set so the SDK dual-writes local
JSONL + remote transcripts, letting sessions survive ephemeral
worktree teardowns.

Vendor the 13-check behavioral conformance suite from the upstream SDK
examples (ported from bun:test to vitest) and run it against
HttpSessionStore with an in-process fake backend. All 19 conformance +
transport tests pass; full claude-runner suite (105 tests) and
edge-worker suite (586 tests) are green.

* refactor(sessions): pass team id explicitly via CYRUS_TEAM_ID (CYPACK-1121)

Previously the edge only sent Authorization: Bearer <CYRUS_API_KEY>; the
server had to reverse-lookup the team by a SHA-256 hash of the key. The
edge actually knows its own team id, so send it directly as
X-Cyrus-Team-Id and skip the hash column entirely — O(1) primary-key
lookup on the server instead.

- HttpSessionStore now requires a teamId option; sends it as header on
  every request. buildRequestHeaders() is protected so alternate auth
  schemes can extend without rewriting the transport.
- EdgeWorker gates the remote store on all three env vars being set
  (CYRUS_APP_URL, CYRUS_API_KEY, CYRUS_TEAM_ID).
- Tests cover the new header + constructor validation.
- All 106 claude-runner + 586 edge-worker tests pass; typecheck clean.

* docs(sessions): link Linear ticket references in session-store comments (CYPACK-1121)

Add links to the SDK session-storage docs and the upstream reference-adapter
examples directly in the file headers so future readers can jump to both
sources without leaving the code.

* feat(sessions): add CYRUS_DISABLE_REMOTE_SESSION_STORE opt-out (CYPACK-1121)

Lets operators keep CYRUS_APP_URL/CYRUS_API_KEY/CYRUS_TEAM_ID set (other
features depend on those) while suppressing the remote Claude session
transcript mirror. When the env var is set to 1/true, EdgeWorker logs the
opt-out and skips constructing HttpSessionStore so transcripts stay local.

---------

Co-authored-by: Connor Turland <1409121+Connoropolous@users.noreply.github.com>

* feat(edge-worker): block session stop when work is unshipped (CYPACK-1140) (#1161)

* feat(edge-worker): block session stop when work is unshipped (CYPACK-1140)

The previous Stop hook used `additionalContext` (not a valid Stop-hook
output field per the SDK type) plus `continue: true`, which advised the
agent but did not actually block the stop — sessions ended without a PR
even after making code changes.

Replace the no-op with a real guardrail that inspects the worktree at
the session cwd:

- If `stop_hook_active` is set, allow the stop (prevents loops).
- If `git status --porcelain` shows uncommitted changes, or HEAD is
  ahead of `@{u}` (or `origin/HEAD` when no upstream is configured),
  return `decision: "block"` with a reason explaining what is unshipped
  and instructing the agent to commit, push, and open a PR.
- If the cwd is not a git repo or git is unavailable, return null and
  do not block.

The reason is fed back to the agent via the SDK's native blocking
mechanism, so the next turn actually sees it. Sessions with no code
changes (questions, research) stop normally.

Adds unit tests covering: non-git cwd, clean tree with upstream,
uncommitted changes, and commits ahead of upstream.

* docs: add PR link to CYPACK-1140 changelog entry

* feat(edge-worker): ensure Cyrus PR marker is always present (CYPACK-1141) (#1162)

* feat(edge-worker): ensure Cyrus PR marker is always present (CYPACK-1141)

Adds a PostToolUse hook on Bash that, after gh pr create/edit, glab mr
create/update/edit, or gt submit commands, idempotently appends
<!-- generated-by-cyrus --> to the live PR/MR description if missing.

This guarantees the GitHub/GitLab webhook handlers can identify
Cyrus-authored PRs (so "Changes requested" events get forwarded back)
even when the agent forgets to include the marker in the body it submits.

Implemented with a Provider strategy (GitHub via gh, GitLab via glab)
so new forges plug in without modifying the hook.

* docs: add PR link to CYPACK-1141 changelog entry

* fix: only call query.interrupt() on warm Claude sessions (CYPACK-1145) (#1165)

* fix: only call query.interrupt() on warm Claude sessions (CYPACK-1145)

Previously the stop-signal handler unconditionally invoked
runner.interrupt() on the first stop, which called the Claude SDK's
query.interrupt() even for non-warm sessions. The SDK aborts the
in-flight request in that case and surfaces "Error: Request was
aborted".

Stop signals now branch on whether the runner reports isWarm():
- Non-warm sessions: immediate full stop on the first signal.
- Warm sessions: interrupt on first stop, full terminate on a second
  stop within 10s (unchanged UX).

ClaudeRunner.interrupt() also now defensively falls back to stop()
when the runner is non-warm, so any stale callers can't reintroduce
the abort error.

* chore: add PR link to CYPACK-1145 changelog entry

* feat(cursor-runner): switch from CLI spawn to @cursor/sdk (CYPACK-1149) (#1169)

* feat(cursor-runner): switch from CLI spawn to @cursor/sdk (CYPACK-1149)

Replace the cursor-agent CLI spawn-and-parse implementation with direct use
of the new @cursor/sdk TypeScript SDK. Enforce tool permissions through
.cursor/hooks.json (validated as the actual gate in headless mode) instead
of .cursor/cli.json (validated as ignored by the SDK). MCP servers are now
passed inline to Agent.create() rather than synced through .cursor/mcp.json,
and the agent mcp list/enable preflight is dropped.

Highlights:
- New permissions.ts translates Cyrus tool patterns (Read/Bash/mcp__*) into
  the hook helper's pattern format (Read/Shell/Mcp/Tool).
- New permission-check.mjs ships with the package and is copied into the
  worktree's .cursor/ directory at session start; it enforces allow/deny
  at preToolUse, beforeShellExecution, beforeReadFile, and
  beforeMCPExecution with failClosed: true.
- cwd is passed as string[] to match the other runners.
- SDK sandbox plumbing is wired through but the Agent.create call is
  commented out pending Cursor exposing configureSandboxPrereqs in the
  public SDK (bug filed; tracked in TODO).

Verified: cyrus-cursor-runner build/typecheck pass with 25/25 tests;
cyrus-edge-worker typecheck passes with 611/611 tests.

* fix(cursor): default model to composer-2; map legacy gpt-5/auto to default

The new @cursor/sdk enforces a strict model-id allowlist (default,
composer-2, gpt-5.4, claude-sonnet-4-6, ...). Two issues surfaced during
F1 validation:

1. CursorRunner kept the old CLI alias `gpt-5 -> auto`. The SDK rejects
   `auto` outright. Now both `gpt-5` and `auto` map to `default`, which
   is a real SDK id and lets the server resolve.
2. RunnerSelectionService defaulted cursor to `gpt-5`, which isn't in the
   SDK's accepted list either. Default now `composer-2` (Cursor's named
   default) and the runner picks up `cursorDefaultModel` /
   `cursorDefaultFallbackModel` from config.

Adds matching schema fields to EdgeConfig and propagation through
ConfigManager. F1 test drive doc included.

Verified: F1 e2e ran a real Cursor session under composer-2, agent read
several files and produced a working FixedWindow implementation +
unit test in the worktree. Session completed (subtype: success).

* fix(cursor): coalesce streaming text deltas into one assistant message

The @cursor/sdk emits multiple `assistant` events per turn, each carrying
a partial text delta. The runner was emitting one SDKAssistantMessage per
event, which produced one Linear `thought` activity per token (e.g.
"Expl" / "oring the codebase to" / " locate" / ...).

Fix: buffer text from consecutive `assistant` events in the runner and
flush — emitting one consolidated SDKAssistantMessage — when:
- a `tool_use` block appears (in the same or a subsequent assistant event)
- any non-assistant SDK event arrives (user, tool_call, thinking, status)
- the run stream finalizes

Verified via a new replay test (8 deltas across two turns coalesce to 2
messages with the expected concatenated text) and an F1 drive: previous
run produced 241 activities (~100 fragmented thoughts); after the fix the
same prompt produced 67 activities (3 coalesced thoughts).

* chore: gitignore .claude/scheduled_tasks.lock harness runtime artifact

* fix(cursor): lazy-import @cursor/sdk to unblock CI on Node 18 / sqlite-broken Nodes

@cursor/sdk pulls in @connectrpc/connect-node -> undici@7.x and sqlite3@5.x
as transitive deps. Both crash at module-evaluation time on common CI
environments:
- undici 7.x requires Node >=20.18.1 and references the global File at
  module init; on Node 18 it throws ReferenceError: File is not defined.
- sqlite3@5.1.7 has no prebuilt binary for newer Node versions and crashes
  in bindings.js: Could not locate the bindings file.

Both fire as soon as anything statically imports @cursor/sdk, which the
CursorRunner module did at the top. That meant edge-worker tests which
mock cyrus-cursor-runner via the workspace alias still loaded its real
module graph and exploded — 26 of 50 test files unable to start across
Node 18.x and 22.x runs.

Move the import inside start():
  const { Agent } = await import("@cursor/sdk");
TypeScript type imports are erased at runtime and stay at the top, so
nothing else changes shape. Vitest's vi.mock("@cursor/sdk") still
intercepts dynamic imports, so the cursor-runner unit tests keep passing
unchanged.

Verified: pnpm --filter cyrus-edge-worker test:run -> 611/611 pass.
pnpm --filter cyrus-cursor-runner test:run -> 26/26 pass.

* feat(cursor): record token usage from turn-ended deltas in result message

The cursor-runner was emitting result messages with all-zero token usage
because the SDK does not surface tokens through `run.stream()` events
or `RunResult` — only via the `onDelta({ update })` callback's
`turn-ended` update:

  { type: "turn-ended", usage?: {
      inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens
  }}

Wire `onDelta` on `agent.send()` and accumulate per-turn deltas across the
run. Map the SDK's four counters into Cyrus's `SDKResultMessage.usage`
shape:
  inputTokens  -> input_tokens
  outputTokens -> output_tokens
  cacheReadTokens  -> cache_read_input_tokens
  cacheWriteTokens -> cache_creation_input_tokens
The SDK does not split ephemeral 1h vs 5m the way Anthropic does, so we
report 0 in both buckets and put the full count in the parent counter.

Cost (USD) is intentionally not reported: the SDK does not expose per-run
cost in its public types — Cursor handles billing server-side. We leave
`total_cost_usd: 0` for now.

Test coverage: new test drives two `turn-ended` deltas plus a non-token
delta and asserts the result message reports the correct accumulated
totals.

* test(cursor): use SDKResultMessage type guard instead of as-unknown cast

* fix(cursor): bare Read/Write/Bash also emit path-level allow rules (CYPACK-1150)

A real production session on CYPACK-1150 was denying every file read
and shell command. The user's repo config had bare names like:
  "Read", "Write", "Edit", "Bash", "Glob", "Grep", "TodoWrite"

In Claude SDK semantics, bare `Read` means "allow Read with no path
restriction." But our translator only emitted `Tool(Read)`, which gates
the SDK's `preToolUse` event but does NOT match the subsequent
`beforeReadFile` event whose candidate is `Read(<path>)`. With nothing
in the allow list of kind `Read(...)`, the helper denied the read with
"no allow rule matched (event=beforeReadFile, candidates=[{kind:'Read',value:'README.md'}])".

The agent narrated the deny back to the user verbatim:
  "blocked by a Cyrus hook (beforeReadFile: no allow rule matched
   for Read -> README.md)"

Fix: bare path-bearing names now expand to BOTH the preToolUse gate
AND the path/command-level gate:
  Read|Glob|Grep      -> Tool(Read)  + Read(**)
  Write|Edit|...      -> Tool(Write) + Write(**)
  Bash|Shell          -> Tool(Shell) + Shell(*)

Existing system-root + workspace-sibling auto-deny still applies
because the `Read(**)` allow is detected as a "broad" pattern.

Verified via F1 drive: same prompt that previously failed
("read README.md and package.json") now completes successfully —
both Read actions complete, no deny messages, session subtype=success.

New regression test asserts the bare-name expansion. All 28 cursor-runner
tests pass; typecheck clean.

* fix(cursor): map MCP server name from transport for hook permission checks (CYPACK-1151)

A real production session on CYPACK-1151 reported "Posting this to Linear
was blocked by MCP hooks here". The agent was trying to call
mcp__linear__save_comment with a properly configured allow rule
(`Mcp(linear:*)`), but the hook denied it.

Root cause: the SDK's beforeMCPExecution hook payload only carries the
bare tool_name (e.g. "save_comment") and the underlying transport
(`command` for stdio, `url` for http) — never the logical server name.
The helper was only reading `tool_name`, so a candidate `Mcp(save_comment)`
never matched a server-scoped pattern like `Mcp(linear:*)`.

Verified via a learning test (test-mcp-hook-payload.mjs in the SDK
sandbox): the captured stdin payload for beforeMCPExecution contains
`{ tool_name, tool_input, command, ... }` with no server identifier.

Fix:
- buildCyrusPermissionsConfig now accepts the SDK-shaped mcpServers map
  and emits a `mcpServers: [{name, commandLine|url}]` lookup table into
  cyrus-permissions.json (alongside allow/deny).
- permission-check.mjs reads the lookup and resolves the logical server
  from `payload.command` (stdio) or `payload.url` (http) before
  building the candidate.
- For beforeMCPExecution we now emit two candidates:
    Mcp(<server>:<tool_name>)   — when server lookup succeeds
    Mcp(<tool_name>)            — always, as a fallback / for unscoped patterns
  Patterns like `Mcp(linear:*)` and `Mcp(linear:save_comment)` now match
  for the configured server.

CursorRunner passes the mcpConfig (already mapped to SDK shape) into
buildCyrusPermissionsConfig at session start.

Tests: three new permission-check cases (server-scoped allow via
command lookup, via url lookup, and deny when no server matches).
All 31 cursor-runner tests pass; 611/611 edge-worker tests pass.
F1 sanity session (Read README.md/package.json) still completes
subtype=success.

* feat(cursor): wire SDK 1.0.11 sandbox; translate Claude SandboxSettings to .cursor/sandbox.json

@cursor/sdk@1.0.11 ships an auto-discoverable `cursorsandbox` helper via
platform optional deps (`@cursor/sdk-<platform>-<arch>`). The previous
ConfigurationError on macOS is gone — `local.sandboxOptions: { enabled: true }`
now engages Apple Seatbelt / Linux Landlock as designed. Verified by
running real sandboxed sessions:
  CURSOR_SANDBOX=seatbelt observed in agent shell env.
  Off-workspace home write blocked, /etc reads allowed (default policy),
  outbound network blocked unless allow-listed in .cursor/sandbox.json,
  workspace writes succeed.

Implementation:

- Bump @cursor/sdk to ^1.0.11.
- Add .npmrc public-hoist-pattern for @cursor/sdk-* / @cursor/february-*
  so pnpm exposes the platform binary at <root>/node_modules/@cursor/...
  where the SDK's `resolvePlatformPackageBinary` walk-up search can find
  it. (Without this, pnpm strict mode keeps it under .pnpm/ and the SDK
  silently falls back to "sandboxing not supported".)
- New `packages/cursor-runner/src/sandbox.ts`:
    - `buildCursorSandboxJson({workspace, sandboxSettings, egressCaCertPath,
      additionalReadwritePaths})` translates the Claude SDK SandboxSettings
      shape into the Cursor sandbox.json schema. Mapping:
        filesystem.allowWrite -> additionalReadwritePaths
        filesystem.allowRead  -> additionalReadonlyPaths
        network.allowedDomains -> networkPolicy.allow
        network.deniedDomains  -> networkPolicy.deny
        network.{httpProxyPort,socksProxyPort} -> + 127.0.0.1/::1/localhost
                                                  to allow list
        egressCaCertPath -> + readonly path so child processes can read it
      Default policy is `workspace_readwrite` + `networkPolicy.default: "deny"`,
      mirroring Claude's "block by default, allow what's needed" model.
    - `buildSandboxEnv` returns the env vars to set on `process.env` so
      sandboxed shell tools inherit cert trust + proxy hints
      (NODE_EXTRA_CA_CERTS, SSL_CERT_FILE, GIT_SSL_CAINFO, REQUESTS_CA_BUNDLE,
      PIP_CERT, CURL_CA_BUNDLE, CARGO_HTTP_CAINFO, AWS_CA_BUNDLE, DENO_CERT,
      HTTP_PROXY/HTTPS_PROXY/ALL_PROXY).
- `CursorRunnerConfig` gains `sandboxSettings?: CursorSandboxInput` (a
  structurally-compatible subset of Claude's SandboxSettings; defined
  locally to avoid a hard dep on cyrus-claude-runner) and
  `egressCaCertPath?: string`. Drops the deprecated `sandbox: "enabled"|
  "disabled"` string flag — no callers used it.
- CursorRunner:
    - Sets `local.sandboxOptions: { enabled: <sandboxSettings.enabled> }`
      on Agent.create / Agent.resume.
    - At session start, writes `<workspace>/.cursor/sandbox.json` (with
      backup/restore symmetric to .cursor/hooks.json), and snapshots-and-
      sets the sandbox env vars on process.env.
    - At session end, removes the file (restoring any backup) and
      restores the prior env values.
- RunnerConfigBuilder: drops the legacy CYRUS_SANDBOX env-var passthrough
  for cursor; instead forwards the same `sandboxSettings` and
  `egressCaCertPath` it already gives Claude. The cursor runner does the
  schema translation internally.

Caveats (documented inline in sandbox.ts):

- `filesystem.denyRead` / `denyWrite` from Claude's settings are accepted
  but not enforced — Cursor's `workspace_readwrite` policy doesn't expose
  per-path denies under the default profile. Use `.cursor/hooks.json`
  (the existing Cyrus permission-check helper) for fine-grained read
  blocking when needed.
- Sandbox features Claude exposes that Cursor's sandbox.json doesn't:
  `network.allowAllUnixSockets`, `allowMachLookup`, `allowLocalBinding`.
  The default Cursor profile covers most of these implicitly.

Tests:

- New sandbox.test.ts: 10 unit tests covering filesystem/network mapping,
  proxy-port loopback injection, CA cert path, dedup of workspace, and
  empty-when-disabled.
- New CursorRunner.test.ts cases: verifies sandbox.json is written when
  enabled, sandboxOptions.enabled flows through to Agent.create, and
  process.env is restored after the session ends.

Verification:
- pnpm --filter cyrus-cursor-runner test:run -> 43/43 pass
- pnpm --filter cyrus-edge-worker  test:run -> 611/611 pass
- F1 drive with CYRUS_SANDBOX=1: sandbox policy installed at session
  start (allowReadwrite=4, networkAllow=3 incl. loopback), workspace
  reads completed, session subtype=success.
- F1 drive with sandbox disabled: still completes subtype=success.
- Live SDK learning tests confirm Apple Seatbelt engages on macOS,
  off-workspace writes blocked, network deny-default + allowlist works.

* fix(cursor): defer MCP allow check from preToolUse to beforeMCPExecution (CYPACK-1154/1155)

The previous MCP fix (commit ace17a7) added server-name lookup at
beforeMCPExecution, but production sessions on CYPACK-1154 and
CYPACK-1155 still showed "no allow rule matched MCP:get_issue" — and
the agent's own narration revealed the deny came from preToolUse, not
beforeMCPExecution.

Why: confirmed via a learning test that captures both events for the
same MCP call. The SDK fires:
  preToolUse:           tool_name="MCP:get_issue"   command=undefined  url=undefined
  beforeMCPExecution:   tool_name="get_issue"       command="node /path/to/server.mjs"

At preToolUse there is NO transport identifier, so we cannot resolve
the logical server name (e.g. "linear") to evaluate `Mcp(linear:*)`.
Our helper was emitting a candidate `Tool("MCP:get_issue")`, which
naturally matches none of the standard `Tool(Read|Shell|Write)` allows
nor `Mcp(linear:*)` — denied at the first hook.

Fix: when preToolUse arrives with a tool_name starting with `MCP:`,
emit no candidates, which falls through to "allow" in the helper. The
subsequent beforeMCPExecution event has full server context and runs
the existing server-scoped check (`Mcp(linear:save_comment)` /
`Mcp(linear:*)`). Net effect: the actual permission gate for MCP
tools is the second hook, which is the only place we can correctly
evaluate it.

Note on rollout: the artifacts under <worktree>/.cursor/ (hooks.json,
cyrus-permission-check.mjs, cyrus-permissions.json) are rewritten
fresh at every session start by CursorRunner.installPermissionsArtifacts.
So once this commit is built and the Cyrus daemon process restarts to
load the new dist, every new Cursor session picks up the fix
automatically — no per-worktree cleanup needed.

Test:
  permission-check.test.ts — new regression case asserting that
  preToolUse with tool_name="MCP:get_issue" returns allow when the
  user's allow list scopes via Mcp(linear:*). 44/44 cursor-runner
  tests pass; 611/611 edge-worker tests pass.

* Bump @openai/codex-sdk to 0.125.x (CYPACK-1151) (#1171)

* Bump @openai/codex-sdk to ^0.125.0

Aligns the Codex runner with OpenAI Codex CLI 0.125.x bundled by the SDK
(including additive usage metadata such as reasoning output tokens).

Closes Cyrus assessment for CYPACK-1151.

Made-with: Cursor

* Changelog: link PR #1171 for Codex SDK bump (CYPACK-1151)

Made-with: Cursor

* Fix Biome errors after merging main

* CI: drop Node 18 from matrix

* fix(cursor-runner): allow sqlite3 install script to run under pnpm 10 (CYPACK-1158) (#1174)

* fix(cursor-runner): allow sqlite3 install script to run under pnpm 10 (CYPACK-1158)

@cursor/sdk@1.0.11 pulls sqlite3@5.1.7 as a runtime dep. Its native
node_sqlite3.node binding is fetched by an install lifecycle script,
which pnpm 10 blocks by default. Without it, sqlite3 is "installed"
but missing its .node binding, and Cursor sessions crash on first
import with "Could not locate the bindings file".

Adds sqlite3 to pnpm.onlyBuiltDependencies so the install script runs
on pnpm install.

* docs: add PR link to CYPACK-1158 changelog entry

* fix(deps): override tar to >=7.5.11 to patch sqlite3 advisories (CYPACK-1159) (#1175)

The @cursor/sdk → sqlite3 → tar@6.2.1 chain introduced in CYPACK-1149
was flagged for 6 high-severity path-traversal/hardlink/symlink CVEs.
sqlite3@5.1.7 is the latest release and pins tar^6, so a root override
is the only way to reach the patched transitive.

* Update @anthropic-ai/claude-agent-sdk to v0.2.123 and @anthropic-ai/sdk to ^0.91.0 (CYPACK-1152) (#1172)

* chore(deps): update @anthropic-ai/claude-agent-sdk to v0.2.123 and @anthropic-ai/sdk to ^0.91.0 (CYPACK-1152)

Bumps @anthropic-ai/claude-agent-sdk from 0.2.117 to 0.2.123 across all
packages and @anthropic-ai/sdk from ^0.90.0 to ^0.91.0. Removes LSP from
availableTools in config.ts — LSP is no longer shipped in claude-agent-sdk
v0.2.123. Updates the corresponding test fixtures.

* chore: add PR link to CHANGELOG for CYPACK-1152

* Prepare release v0.2.50 (#1176)

* Prepare release v0.2.51 (#1177)

* test: add learning tests for memory-pressure gate (CYPACK-1165)

Pin down behavior of the OOM-preflight gate and concurrency cap added
in feat/oom-preflight-gate:

- memory-health: partial-threshold semantics, no-threshold no-op,
  free-vs-heap precedence, strict comparison boundaries, metrics
  snapshot on rejection
- memory-gate-schema: Zod constraints on MemoryGateConfigSchema and
  EdgeConfigSchema.maxConcurrentRunners
- runner-gate: gate invoked once per event, userMessage propagated
  verbatim, onWebhookEnd fires on reject

* fix: propagate memoryGate and maxConcurrentRunners through CLI + ConfigManager (CYPACK-1165)

The CLI builder in WorkerService.ts and the hot-reload merge in
ConfigManager.ts both assembled EdgeWorkerConfig field-by-field. New
EdgeConfig fields like memoryGate and maxConcurrentRunners were loaded
from disk but silently dropped before reaching EdgeWorker, so the
runner gate added in feat/oom-preflight-gate never fired in production.
detectGlobalConfigChanges had the same hand-maintained whitelist, so
changing those fields at runtime wouldn't have triggered a hot-reload
even after the propagation fix.

Refactored both call sites to spread the on-disk config first, then
overlay runtime/handler fields and env-var overrides. New EdgeConfig
keys now flow through structurally without code changes here.

- apps/cli/src/services/WorkerService.ts: spread edgeConfig at the top,
  layer caller/runtime fields, layer env precedence last.
- packages/edge-worker/src/config-merge.ts: extract pure mergeEdgeConfig
  and hasGlobalConfigChanges helpers; the latter computes the diff key
  set from the live objects via a runtime-only-keys denylist.
- packages/edge-worker/src/ConfigManager.ts: delegate to the helpers.
- packages/edge-worker/test/config-merge.test.ts: 14 regression tests
  pinning down propagation, legacy-alias resolution, and structural
  change detection.

* fix: post memory-pressure rejection as a response activity with Stop signal (CYPACK-1165)

When the runner gate rejects a new Linear session, the user-facing
"Cyrus is at capacity / temporarily out of capacity" message was
being emitted as a thought activity. Linear treats thoughts as
intermediate updates, so the session stayed open after the rejection
even though no runner was ever spawned to follow up.

Emit the rejection as a response activity with AgentActivitySignal.Stop
so Linear treats it as the session's terminal message and closes the
agent session.

* fix: drop Stop signal from memory-pressure rejection (CYPACK-1165)

Linear's API rejects signal=stop on response-type activities — that
signal is only allowed on prompt activities. The Stop signal was
added in the previous commit to terminate the agent session in the
Linear UI, but it triggered:

  Invalid signal: "stop" is only allowed for prompt type activities.

Drop the signal. The response-type activity by itself is sufficient
to render the rejection as the session's terminal message in Linear.

* refactor: collapse memoryGate to a single intuitive knob (CYPACK-1165)

The previous shape exposed four sub-fields:

  memoryGate: {
    enabled: true,
    maxRssPercent: 0.75,
    minAvailableMemoryMb: 500,
    maxHeapUsagePercent: 0.85
  }

Operators had to reason about three thresholds and a separate enable
flag, and the absolute MB threshold required retuning per host size.

Replace with a single value (boolean | number):

  memoryGate: true     # enabled at default 85% pressure threshold
  memoryGate: 0.9      # enabled at custom threshold
  memoryGate: false    # disabled (also: omit entirely)

Internally, 'pressure' is the worst of three normalized dimensions:
process RSS, V8 heap, and used system memory. The single percentage
captures all three concerns from the legacy config and is portable
across host sizes (no absolute-MB knob to retune).

The rejection reason still names the dominant dimension (RSS / heap /
system memory) so operator logs remain diagnosable.

- core/src/memory-health.ts: type MemoryGateConfig = boolean | number;
  export DEFAULT_MEMORY_PRESSURE_THRESHOLD = 0.85; collectMemoryMetrics
  computes systemUsedPercent and pressure.
- core/src/config-schemas.ts: MemoryGateConfigSchema is now z.union of
  boolean and a (0,1] number; legacy verbose object form is rejected.
- core/test/memory-health.test.ts + memory-gate-schema.test.ts:
  rewritten around the new shape (29 tests).
- edge-worker/test/config-merge.test.ts: migrated regression cases.
- CHANGELOG: rewritten unreleased entry to describe the new knob.
- core/schemas/*.json: regenerated.

---------

Co-authored-by: Connor Turland <1409121+Connoropolous@users.noreply.github.com>
Co-authored-by: cyrusagent <237105008+cyrusagent[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Agent <agentclear@anthropic.com>
Co-authored-by: Payton Webber <53197664+PaytonWebber@users.noreply.github.com>
# Conflicts:
#	CHANGELOG.md
#	packages/edge-worker/src/ConfigManager.ts
… message

When the memory-pressure gate rejects a new session, the user-facing
message now discloses how many Cyrus sessions are already running on the
host. This helps users see whether the pressure is coming from Cyrus's
own workload or from elsewhere on the box.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants