feat: memory-pressure gate and concurrency cap for new sessions#1138
Open
PaytonWebber wants to merge 7 commits into
Open
feat: memory-pressure gate and concurrency cap for new sessions#1138PaytonWebber wants to merge 7 commits into
PaytonWebber wants to merge 7 commits into
Conversation
Cross-platform pre-flight checks (Linux + macOS) that reject new runner creation when the host is under memory pressure or already at a configured session ceiling, posting a user-visible "temporarily out of capacity" message to Linear, GitHub, GitLab, or Slack instead of getting silently OOM-killed by systemd-oomd or the kernel. - New `memoryGate` config: maxRssPercent, minAvailableMemoryMb, maxHeapUsagePercent (opt-in via `enabled: true`) - New `maxConcurrentRunners` config: hard cap on concurrent runners, counted live via `isRunning()` across Linear + chat runners - Gates wired into all 4 runner-creation paths: Linear agentSessionCreated/Prompted, GitHub webhook, GitLab webhook, Slack chat handler - Uses only `os`, `v8`, `process` Node built-ins for cross-platform parity
# Conflicts: # CHANGELOG.md
- memory-health.ts: `config?: MemoryGateConfig` instead of `| undefined` - memory-health.ts: drop raw technical reason from user-facing message (stays in `MemoryCheckResult.reason` for operator logs) - EdgeWorker.ts: log swallowed `runner.isRunning()` errors at debug level so a broken runner slipping under the cap is diagnosable - EdgeWorker.ts: extract `enforceRunnerGate(onReject)` helper; collapse the four near-identical gate blocks (GitHub, GitLab, Linear, resume) into callsites that just pass a platform-specific rejection closure - chat-sessions.test.ts: implement `notifyUnavailable` on TestChatAdapter (required by ChatPlatformAdapter since this PR) - runner-gate.test.ts: new test file covering the gate wiring end-to-end via ChatSessionHandler — gate trips → notifyUnavailable called, runner not spawned; gate passes → runner spawned; no gate → runner spawned
* fix: close streaming prompt on result unless warm sessions enabled (#1141) * fix: close streaming prompt on result unless warm sessions enabled Before #1109, ClaudeRunner called streamingPrompt.complete() when the SDK emitted a result message, which closed the async iterable and let the for-await loop exit so the subprocess could shut down at turn end. That close was removed to keep sessions warm for follow-up messages. The pre-warm PR also added CYRUS_ENABLE_WARM_SESSIONS as an opt-in gate for startup pre-warming and warm-instance attach, but the streaming-prompt close behavior was not gated with it. Result: sessions kept their subprocess alive after every turn even when warm sessions were disabled. Thread the gate into ClaudeRunner via an optional keepSessionWarm constructor arg (passed from EdgeWorker.createRunnerForType using isWarmSessionsEnabled()). When false, complete the streaming prompt on result as before; when true, leave it open for follow-ups. Result-message emission stays in-loop in both modes — no revert of the pendingResultMessage deferral. * docs: clarify subprocess-exit motivation in changelog entry * feat: add base branch change notification and blocked-by dependency deferral (CYPACK-978) (#1004) * feat: add base branch change notification and blocked-by dependency deferral (CYPACK-978) Add two session lifecycle features: 1. Base branch push notification: GitHub push webhooks on tracked base branches stream rebase notifications to active sessions via addStreamMessage. 2. Blocked-by dependency deferral: Issues with unresolved blocked-by relations are parked instead of starting a runner. Parked sessions wake automatically when blocking issues complete, or on user re-prompt if blockers clear. * docs: update changelogs for CYPACK-978 (#1004) * refactor: replace as Exclude type assertions with known literal eventType values Each private translate method is only called from a specific if-branch where the eventType is already known, so we can use the exact string literal instead of casting via Exclude<typeof event.eventType, "push">. --------- Co-authored-by: cyrusagent <237105008+cyrusagent[bot]@users.noreply.github.com> Co-authored-by: Cyrus Agent <agentclear@anthropic.com> Co-authored-by: Connor Turland <1409121+Connoropolous@users.noreply.github.com> * feat: add /linear-webhook endpoint with deprecated /webhook alias (CYPACK-1119) (#1142) * chore: update @anthropic-ai/claude-agent-sdk to v0.2.117 (CYPACK-1120) (#1143) * chore: update @anthropic-ai/claude-agent-sdk to v0.2.117 (CYPACK-1120) Bumps claude-agent-sdk from 0.2.116 to 0.2.117 across all packages (claude-runner, core, edge-worker, simple-agent-runner). This updates the bundled Claude Code binary from v2.1.116 to v2.1.117, a parity release with no tool-list changes. Also fixes scripts/extract-claude-tools.sh to work with the new native binary SDK structure introduced in v0.2.113: the SDK no longer ships a bundled cli.js but instead installs platform-specific optional dependencies (e.g. @anthropic-ai/claude-agent-sdk-darwin-arm64). The script now resolves the binary by walking from the SDK package to its optional platform dep. Tool list verified unchanged against Claude Code v2.1.117 binary output. * chore: add PR number to CHANGELOG entry for CYPACK-1120 * fix: post chat-session replies from result messages under warm mode (#1145) * fix: post chat-session replies from result messages, not startStreaming resolution With CYRUS_ENABLE_WARM_SESSIONS=1, ClaudeRunner keeps the streaming prompt open across turns so the subprocess stays warm for follow-up messages. As a side effect, runner.startStreaming() never resolves until the entire session is torn down — but ChatSessionHandler awaited startStreaming() and then called adapter.postReply() inline, so Slack/GitHub chat sessions never posted a reply at all under warm mode. Decouple reply posting from session termination: - Maintain a FIFO queue of pending reply events per sessionId. Enqueue at each entry point that sends a prompt: new session, resume, and follow-up injection via addStreamMessage (which was also a fire-and- forget path with no reply today, regardless of warm mode). - In handleAgentMessage, when the SDK emits a `result` message, dequeue the oldest event for that session and call adapter.postReply() with it. Each turn's reply pairs with the prompt that triggered it. - Kick off start()/startStreaming() as a non-awaited promise; log the resolved sessionInfo and forward errors via the existing logger. The handler's handleEvent() returns once the prompt has been dispatched, so onWebhookEnd fires promptly under warm mode. No completeStream() calls — the streaming prompt intentionally stays open under warm mode so addStreamMessage follow-ups continue to work. * fix: clear pending reply queue when chat runner errors before result Codex review feedback on #1145: enqueueReply() runs before start/startStreaming, but the error catch only logged and left the queued event in place. If the runner dies before emitting a result (e.g. SDK error message terminates the session) the stale event stays at the head of pendingReplyEvents — a later resumeSession() on the same sessionId would then pair it with the new runner's first result and shift every subsequent reply by one turn. Add clearPendingReplies(sessionId) and call it from both the initial-turn and resume error paths. Logs the count discarded so drift is visible. * fix: de-flake EgressProxy CI tests (CYPACK-1122) (#1147) * fix: de-flake EgressProxy CI tests (CYPACK-1122) Two root causes behind the flaky `EgressProxy` suite in CI: 1. Port collision on EADDRINUSE (127.0.0.1:19281): the test allocated ports via `19080 + Math.floor(Math.random() * 1000)`, a narrow range that occasionally collides with another process or a port still in TIME_WAIT. Tests now bind to port 0 and read the OS-assigned ephemeral port via `proxy.getHttpProxyPort()` / `getSocksProxyPort()`. `EgressProxy.startHttpProxy` / `startSocksProxy` update the stored port to the actual bound `server.address().port` after `listen`. 2. SOCKS5 `blocks non-allowed domains` race: the server wrote the denial reply then immediately destroyed the socket. Because `socket.write` is async and `socket.destroy` aborts pending I/O, the reply could be truncated before reaching the client, which then timed out. Replaced `socket.write(reply); socket.destroy()` with `socket.end(reply)` so the reply is flushed before FIN in all three SOCKS5 error paths. * chore: add PR link to CYPACK-1122 changelog entry * chore: remove stale package-lock.json (fixes Dependabot picomatch alert) This repo is pnpm-managed (`packageManager: pnpm@10.13.1` and `pnpm-lock.yaml` is the source of truth). A stale `package-lock.json` was accidentally committed and pinned a vulnerable `picomatch@2.3.1` (GHSA-3v7f-55p6-f55p), which Dependabot has been flagging. Removing the stray lockfile and gitignoring `package-lock.json` / `yarn.lock` so npm/yarn lockfiles can't sneak back in. * chore: bump pnpm to 10.33.1 * fix: spread process.env in claude-agent-sdk invocations (forward-port of v0.2.49 hotfix) (#1152) * fix: spread process.env in claude-agent-sdk invocations (#1150) * spread process.env again since the sdk https://github.com/anthropics/claude-agent-sdk-typescript/blob/main/CHANGELOG.md#02113 reverted to not overlaying * docs: changelog for HOME/env regression fix Document that re-spreading process.env restores HOME and other inherited env vars for Claude sessions, fixing GPG-signed commits, gh CLI auth, and other tools that depend on the user's shell environment. * docs: record 0.2.49 hotfix release in CHANGELOG The v0.2.49 hotfix was released from the `cypack-1123` branch. Add a corresponding `## [0.2.49]` section on main so the release history shows that it happened, and move the process.env bullet out of Unreleased into 0.2.49 (where it was actually shipped). * chore: bump package versions to 0.2.49 * feat(claude-runner): enable SDK debug mode on DEBUG log level (CYPACK-1124) (#1153) * feat(cli): optional Sentry error tracking with opt-out env var (CYPACK-1142) (#1163) * feat(cli): integrate optional Sentry error tracking (CYPACK-1142) Adds an ErrorReporter abstraction in cyrus-core (interface + Noop) and a Sentry-backed implementation in apps/cli. The reporter is initialised before Application bootstrap and wired into the existing uncaughtException / unhandledRejection handlers, so unhandled CLI errors are reported automatically. Configuration: - CYRUS_SENTRY_DSN — DSN to use (overrides the bundled default). - CYRUS_SENTRY_DISABLED — opt-out switch (1/true/yes/on). - CYRUS_SENTRY_ENVIRONMENT — environment tag (defaults to "production"). The bundled DEFAULT_SENTRY_DSN is left empty: the OAuth token used to scope this work cannot create the ceedar/cyrus-cli Sentry project, so an admin must create the project (team: cyrus, platform: node) and paste the DSN into apps/cli/src/services/createErrorReporter.ts to enable zero-config reporting for end users. * feat(core): forward all logger.error calls to ErrorReporter (CYPACK-1142) Makes the Sentry hook universal: every Logger.error(...) call across edge-worker, claude-runner (Claude Code session errors), GitHub/Slack/ Linear transports, persistence manager, etc. now forwards to the global ErrorReporter. The reporter is installed once during CLI bootstrap via setGlobalErrorReporter(); libraries imported without the CLI continue to use a Noop reporter, so test harnesses and SDK consumers are unaffected. The Logger extracts the first Error in the trailing args (also unwrapping `{ error: Error }` shapes used by transports) and reports it as an exception with component / sessionId / platform / issueIdentifier / repository tags pulled from LogContext. When no Error is present, the message is captured at "error" severity so silent failure paths still produce events. Application's uncaughtException / unhandledRejection handlers were collapsed onto the same path to avoid double-reporting now that Logger forwards automatically. * feat(cli): tag Sentry events with team_id from CYRUS_TEAM_ID (CYPACK-1142) Reads CYRUS_TEAM_ID at reporter construction and applies it as a global `team_id` tag via Sentry's initialScope, so every captured event (and every future logger.error forwarded through the reporter) is filterable per Cyrus tenant in Sentry without requiring tag plumbing at each capture site. Additional global tags can be added in buildInitialTags() in createErrorReporter.ts; capture-site context (component, sessionId, issueIdentifier, etc.) continues to flow via Logger.error. * fix(core): apply CYRUS_TEAM_ID tag in Logger.error forwarding (CYPACK-1142) Sentry's initialScope tags are overridden when call sites set per-event tags via withScope().setTag(), and Logger.error builds an explicit tag map for every forwarded event. So even though createErrorReporter set team_id via initialScope, every Logger-routed exception/message ended up without it. Introduces setGlobalErrorTags / getGlobalErrorTags in cyrus-core as the canonical place to register process-wide tags. Bootstrap mirrors the team_id from CYRUS_TEAM_ID into both Sentry's initialScope (for events emitted directly via the SDK) and the Logger registry (for events emitted via logger.error). Per-call context tags still win on collision so capture sites can override. * Set default Sentry DSN for error reporting * feat(sentry): scrub secrets, add fingerprint + sample-rate (CYPACK-1142) Address the three high-severity gaps from the PR review: 1. Secret scrubbing (sentryScrubber.ts) — beforeSend hook redacts sensitive keys (token, secret, authorization, etc.) and token-shaped strings at any depth in event extras/contexts/request. Wired into createErrorReporter so every Sentry-bound event is filtered. 2. Stable fingerprinting (Logger + ErrorReporterContext) — Logger.error now sets fingerprint = ["logger", component, templatized(message)], collapsing UUIDs/issue identifiers/paths/long numbers to placeholders so messages with embedded IDs no longer fragment Sentry groups. 3. CYRUS_SENTRY_SAMPLE_RATE env var — wired through createErrorReporter to SentryErrorReporter so high-volume deployments can downsample without code changes. * feat(sentry): structured cyrus context + extra-error/console integrations (CYPACK-1142) Per Payton's follow-up: configure Sentry for structured logging. - Every event now carries a structured `cyrus` context block (team_id, environment, release, plus optional linear_workspace from CYRUS_LINEAR_WORKSPACE and deployment_id from CYRUS_DEPLOYMENT_ID). Tags stay indexed/searchable; the context block groups the richer typed fields together in the Sentry UI. - Add extraErrorDataIntegration so Error subclass own-properties (e.g. err.statusCode, err.requestId, custom Cyrus error fields) surface as structured `extra` data instead of being dropped. - Add consoleIntegration so console.* output is captured as breadcrumbs giving each event a structured trail of the last log lines. Tests cover the new structured context fields and the bundled DSN fallback path. * feat(sentry): forward all logs to Sentry Logs with team_id (CYPACK-1142) Wire the existing logging abstraction (cyrus-core/logging/Logger) through to Sentry's structured Logs API (https://docs.sentry.io/product/explore/logs/), so every log line at every level — not just errors — is searchable in Sentry tagged with team_id and per-session metadata. - Bump @sentry/node 8.55.1 → 9.47.1 (Logs API requires v9.41+). - Add `enableLogs: true` to Sentry.init. - Extend ErrorReporter interface with `log(level, message, attrs?)`, ErrorReporterLogLevel and ErrorReporterLogAttributes types. Noop reporter implements as no-op; SentryErrorReporter dispatches to Sentry.logger.{trace,debug,info,warn,error,fatal} merging process-wide tags (team_id, …) into per-log attributes. - Logger.{debug,info,warn,error} now forward each call through reporter.log(...) attributing with component, sessionId, issueIdentifier, repository, and a primitive-summarised tail of trailing args. Errors continue to additionally capture as Sentry Issues for the alerting flow. - Tests cover all-levels forwarding, team_id attribute presence, error-arg summarisation, and the enableLogs init flag. * fix(sentry): gate SDK debug on CYRUS_SENTRY_DEBUG, not CYRUS_LOG_LEVEL (CYPACK-1142) Reported on PR: with CYRUS_LOG_LEVEL=DEBUG the terminal floods with "Sentry Logger [log]: [Tracing] Inheriting..." messages from @sentry/opentelemetry's auto-instrumentation. Those are SDK-internal diagnostics, not Cyrus app logs, and only fire when the SDK's debug flag is on. App debug logging and Sentry SDK debug logging are orthogonal — gate on the dedicated CYRUS_SENTRY_DEBUG env var so app debugging stays clean. The tracing-inheritance lines disappear automatically because the SDK gates them on debug being on. * fix(sentry): gate Logs forwarding on CYRUS_LOG_LEVEL (CYPACK-1142) Reported on PR: previously the console-output gate respected CYRUS_LOG_LEVEL but the Sentry Logs forward did not, so users running at the default INFO level still paid for every logger.debug(...) call in their Sentry quota. Fixed by moving forwardLog inside the level-gated branch for debug/info/warn. Two intentional asymmetries: - error: structured-log forwarding is gated by level (so SILENT drops it), but the Sentry Issue capture is NOT gated — silencing logs must not silence real failure alerts. The level controls verbosity, not visibility of bugs. - forwardToErrorReporter (Issue capture path) stays outside the gate for that reason. Tests cover INFO/WARN drop behaviour, DEBUG enabling all four levels, and SILENT still capturing errors as Issues while dropping the log stream. * fix(sentry): forward all logs unconditionally regardless of CYRUS_LOG_LEVEL (CYPACK-1142) Reverses the prior level-gating change per Payton's follow-up: CYRUS_LOG_LEVEL is for local terminal verbosity only; the structured Sentry Logs stream should be the always-on backbone so debug-level traces remain available in Sentry even when running at INFO in production. forwardLog is now called unconditionally on every Logger.debug/info/ warn/error call. Console output is still gated by level. Errors still additionally capture as Sentry Issues regardless of level (unchanged). * feat(sentry): scrub Logs, gate forwarding on team_id, lifecycle events (CYPACK-1142) Addresses three review gaps identified in PR #1163: 1. Sentry Logs are now scrubbed via a dedicated `beforeSendLog` hook — `beforeSend` only runs for Issues, so the structured-log stream was previously bypassing redaction. Adds `scrubSentryLog` mirroring the key/string token rules already applied to events. Also scrubs `event.breadcrumbs[]` (consoleIntegration captures every console.* line, which can contain headers/tokens). 2. Logs forwarding is now gated on `CYRUS_TEAM_ID` AND the absence of `CYRUS_SENTRY_DISABLED`. Issue capture remains gated only on `CYRUS_SENTRY_DISABLED`, so installs without tenant tagging still surface errors but don't ship the higher-volume log stream. Implemented as a new `logsEnabled` option on SentryErrorReporter that mirrors into `Sentry.init({ enableLogs })` and short-circuits `log()` (belt-and-braces). 3. Reworked which logs forward: - WARN/ERROR auto-forward (always). - debug/info no longer forward — too high volume to ship blanket. - New `logger.event(name, attributes?)` API on ILogger for major lifecycle events that should always reach Sentry Logs regardless of local CYRUS_LOG_LEVEL. Wired `event()` into ClaudeRunner (session_started, session_resumed, claude_session_id_assigned, message_emitted, session_completed, session_stopped, session_stop_requested, claude_query_options) and EdgeWorker (webhook_received). Each event carries identifier, claudeSessionId, team_id, component, and other contextual metadata already merged via the Logger context/global tag plumbing. SOLID notes: SentryErrorReporter still owns only SDK translation (SRP); ErrorReporter contract unchanged (LSP/DIP); new event method extends ILogger without altering existing methods (OCP); new LogEventAttributes type narrows to primitives only (ISP). * fix(sentry): stop redacting sessionId / claudeSessionId attributes (CYPACK-1142) The "session" substring in SENSITIVE_KEY_PATTERNS was too broad — it matched our identifier attributes (sessionId, claudeSessionId) and redacted them in the Logs explorer, defeating the whole point of forwarding lifecycle events. Real session secrets are still caught by the more specific patterns: session_token (token), session_cookie (cookie), session_secret (secret). Plain identifiers now pass through. Added a regression test pinning both directions. * fix(sentry): gate Issues on CYRUS_TEAM_ID alongside Logs (CYPACK-1142) Per follow-up requirement: Issue capture should also be gated on CYRUS_TEAM_ID, not just Logs. Both surfaces now share a single gate — installs without a tenant tag stay silent so the team's Sentry org isn't flooded with untenanted self-hosted noise we can't slice. Resolution order in createErrorReporter is now: 1. CYRUS_SENTRY_DISABLED truthy → Noop 2. CYRUS_TEAM_ID unset → Noop 3. No DSN configured → Noop 4. Otherwise → SentryErrorReporter Dropped the now-redundant logsEnabled plumbing from SentryErrorReporter — by the time the constructor runs both Issues and Logs are wanted, so enableLogs is unconditionally true. Tests refactored: removed assertions for "Issues without team_id" behavior, added a regression for "no captureException when team_id is unset", added a regression for "Noop when team_id is unset". 94 CLI tests still pass. * fix(sentry): redact sensitive env keys in claude_query_options before serialise (CYPACK-1142) The query options spread includes process.env, which carries every PAT / OAuth token / webhook secret on the host. Sentry's server-side data scrubber matches token-shaped substrings inside the JSON string and replaces the *whole* attribute value with [Filtered], wiping the diagnostic payload entirely. Now redact sensitive keys (token, secret, password, apikey, authorization, cookie, private_key, client_secret, refresh_token, access_token, bearer, dsn, webhook_secret, signing_secret) at serialisation time via the existing JSON.stringify replacer hook. Sentry no longer sees the raw secrets, so it stops marking the options field as filtered and the rest of the payload survives. Patterns mirror apps/cli/src/services/sentryScrubber.ts but live locally in claude-runner so this package keeps no upward dep on the CLI app. * fix(sentry): ship claude_query_options as flat primitive attributes (CYPACK-1142) Server-side Sentry data scrubbing was filtering the whole \`options\` attribute to \`[Filtered]\` regardless of how thoroughly we redacted sensitive substrings inside the JSON. Stuffing a long nested-JSON string under a single attribute key reliably trips at least one matcher (token-shaped substrings, length, key name patterns) — and when it does the *whole* payload disappears. Switch strategy: build a sanitized projection that drops everything secret-bearing or unbounded (full env values, MCP server inner config, prompt text, system-prompt append) and keep only the diagnostic surface — model, tool counts/previews, MCP server names, env key NAMES (not values), system prompt shape, presence flags for hooks/plugins/sandbox, etc. Then *flatten* the projection into one attribute per datum (\`cqo.model\`, \`cqo.allowedToolsCount\`, \`cqo.envKeyNamesPreview\`, …) so a per-key filter (if it ever fires) loses one attribute, not the whole payload. Local DEBUG console still logs the full untruncated options JSON so on-machine troubleshooting is unaffected — the projection only applies to what leaves the process. --------- Co-authored-by: Payton Webber <53197664+PaytonWebber@users.noreply.github.com> * feat: configurable mirror Claude session transcripts to hosted control plane (CYPACK-1121) (#1144) * feat: mirror Claude session transcripts to hosted control plane (CYPACK-1121) Introduce HttpSessionStore — a Claude Agent SDK SessionStore adapter that POSTs session transcript entries to the Cyrus hosted control plane, authenticated with the team-scoped CYRUS_API_KEY. Wire it through ClaudeRunnerConfig and construct it in EdgeWorker when both CYRUS_API_KEY and CYRUS_APP_URL are set so the SDK dual-writes local JSONL + remote transcripts, letting sessions survive ephemeral worktree teardowns. Vendor the 13-check behavioral conformance suite from the upstream SDK examples (ported from bun:test to vitest) and run it against HttpSessionStore with an in-process fake backend. All 19 conformance + transport tests pass; full claude-runner suite (105 tests) and edge-worker suite (586 tests) are green. * refactor(sessions): pass team id explicitly via CYRUS_TEAM_ID (CYPACK-1121) Previously the edge only sent Authorization: Bearer <CYRUS_API_KEY>; the server had to reverse-lookup the team by a SHA-256 hash of the key. The edge actually knows its own team id, so send it directly as X-Cyrus-Team-Id and skip the hash column entirely — O(1) primary-key lookup on the server instead. - HttpSessionStore now requires a teamId option; sends it as header on every request. buildRequestHeaders() is protected so alternate auth schemes can extend without rewriting the transport. - EdgeWorker gates the remote store on all three env vars being set (CYRUS_APP_URL, CYRUS_API_KEY, CYRUS_TEAM_ID). - Tests cover the new header + constructor validation. - All 106 claude-runner + 586 edge-worker tests pass; typecheck clean. * docs(sessions): link Linear ticket references in session-store comments (CYPACK-1121) Add links to the SDK session-storage docs and the upstream reference-adapter examples directly in the file headers so future readers can jump to both sources without leaving the code. * feat(sessions): add CYRUS_DISABLE_REMOTE_SESSION_STORE opt-out (CYPACK-1121) Lets operators keep CYRUS_APP_URL/CYRUS_API_KEY/CYRUS_TEAM_ID set (other features depend on those) while suppressing the remote Claude session transcript mirror. When the env var is set to 1/true, EdgeWorker logs the opt-out and skips constructing HttpSessionStore so transcripts stay local. --------- Co-authored-by: Connor Turland <1409121+Connoropolous@users.noreply.github.com> * feat(edge-worker): block session stop when work is unshipped (CYPACK-1140) (#1161) * feat(edge-worker): block session stop when work is unshipped (CYPACK-1140) The previous Stop hook used `additionalContext` (not a valid Stop-hook output field per the SDK type) plus `continue: true`, which advised the agent but did not actually block the stop — sessions ended without a PR even after making code changes. Replace the no-op with a real guardrail that inspects the worktree at the session cwd: - If `stop_hook_active` is set, allow the stop (prevents loops). - If `git status --porcelain` shows uncommitted changes, or HEAD is ahead of `@{u}` (or `origin/HEAD` when no upstream is configured), return `decision: "block"` with a reason explaining what is unshipped and instructing the agent to commit, push, and open a PR. - If the cwd is not a git repo or git is unavailable, return null and do not block. The reason is fed back to the agent via the SDK's native blocking mechanism, so the next turn actually sees it. Sessions with no code changes (questions, research) stop normally. Adds unit tests covering: non-git cwd, clean tree with upstream, uncommitted changes, and commits ahead of upstream. * docs: add PR link to CYPACK-1140 changelog entry * feat(edge-worker): ensure Cyrus PR marker is always present (CYPACK-1141) (#1162) * feat(edge-worker): ensure Cyrus PR marker is always present (CYPACK-1141) Adds a PostToolUse hook on Bash that, after gh pr create/edit, glab mr create/update/edit, or gt submit commands, idempotently appends <!-- generated-by-cyrus --> to the live PR/MR description if missing. This guarantees the GitHub/GitLab webhook handlers can identify Cyrus-authored PRs (so "Changes requested" events get forwarded back) even when the agent forgets to include the marker in the body it submits. Implemented with a Provider strategy (GitHub via gh, GitLab via glab) so new forges plug in without modifying the hook. * docs: add PR link to CYPACK-1141 changelog entry * fix: only call query.interrupt() on warm Claude sessions (CYPACK-1145) (#1165) * fix: only call query.interrupt() on warm Claude sessions (CYPACK-1145) Previously the stop-signal handler unconditionally invoked runner.interrupt() on the first stop, which called the Claude SDK's query.interrupt() even for non-warm sessions. The SDK aborts the in-flight request in that case and surfaces "Error: Request was aborted". Stop signals now branch on whether the runner reports isWarm(): - Non-warm sessions: immediate full stop on the first signal. - Warm sessions: interrupt on first stop, full terminate on a second stop within 10s (unchanged UX). ClaudeRunner.interrupt() also now defensively falls back to stop() when the runner is non-warm, so any stale callers can't reintroduce the abort error. * chore: add PR link to CYPACK-1145 changelog entry * feat(cursor-runner): switch from CLI spawn to @cursor/sdk (CYPACK-1149) (#1169) * feat(cursor-runner): switch from CLI spawn to @cursor/sdk (CYPACK-1149) Replace the cursor-agent CLI spawn-and-parse implementation with direct use of the new @cursor/sdk TypeScript SDK. Enforce tool permissions through .cursor/hooks.json (validated as the actual gate in headless mode) instead of .cursor/cli.json (validated as ignored by the SDK). MCP servers are now passed inline to Agent.create() rather than synced through .cursor/mcp.json, and the agent mcp list/enable preflight is dropped. Highlights: - New permissions.ts translates Cyrus tool patterns (Read/Bash/mcp__*) into the hook helper's pattern format (Read/Shell/Mcp/Tool). - New permission-check.mjs ships with the package and is copied into the worktree's .cursor/ directory at session start; it enforces allow/deny at preToolUse, beforeShellExecution, beforeReadFile, and beforeMCPExecution with failClosed: true. - cwd is passed as string[] to match the other runners. - SDK sandbox plumbing is wired through but the Agent.create call is commented out pending Cursor exposing configureSandboxPrereqs in the public SDK (bug filed; tracked in TODO). Verified: cyrus-cursor-runner build/typecheck pass with 25/25 tests; cyrus-edge-worker typecheck passes with 611/611 tests. * fix(cursor): default model to composer-2; map legacy gpt-5/auto to default The new @cursor/sdk enforces a strict model-id allowlist (default, composer-2, gpt-5.4, claude-sonnet-4-6, ...). Two issues surfaced during F1 validation: 1. CursorRunner kept the old CLI alias `gpt-5 -> auto`. The SDK rejects `auto` outright. Now both `gpt-5` and `auto` map to `default`, which is a real SDK id and lets the server resolve. 2. RunnerSelectionService defaulted cursor to `gpt-5`, which isn't in the SDK's accepted list either. Default now `composer-2` (Cursor's named default) and the runner picks up `cursorDefaultModel` / `cursorDefaultFallbackModel` from config. Adds matching schema fields to EdgeConfig and propagation through ConfigManager. F1 test drive doc included. Verified: F1 e2e ran a real Cursor session under composer-2, agent read several files and produced a working FixedWindow implementation + unit test in the worktree. Session completed (subtype: success). * fix(cursor): coalesce streaming text deltas into one assistant message The @cursor/sdk emits multiple `assistant` events per turn, each carrying a partial text delta. The runner was emitting one SDKAssistantMessage per event, which produced one Linear `thought` activity per token (e.g. "Expl" / "oring the codebase to" / " locate" / ...). Fix: buffer text from consecutive `assistant` events in the runner and flush — emitting one consolidated SDKAssistantMessage — when: - a `tool_use` block appears (in the same or a subsequent assistant event) - any non-assistant SDK event arrives (user, tool_call, thinking, status) - the run stream finalizes Verified via a new replay test (8 deltas across two turns coalesce to 2 messages with the expected concatenated text) and an F1 drive: previous run produced 241 activities (~100 fragmented thoughts); after the fix the same prompt produced 67 activities (3 coalesced thoughts). * chore: gitignore .claude/scheduled_tasks.lock harness runtime artifact * fix(cursor): lazy-import @cursor/sdk to unblock CI on Node 18 / sqlite-broken Nodes @cursor/sdk pulls in @connectrpc/connect-node -> undici@7.x and sqlite3@5.x as transitive deps. Both crash at module-evaluation time on common CI environments: - undici 7.x requires Node >=20.18.1 and references the global File at module init; on Node 18 it throws ReferenceError: File is not defined. - sqlite3@5.1.7 has no prebuilt binary for newer Node versions and crashes in bindings.js: Could not locate the bindings file. Both fire as soon as anything statically imports @cursor/sdk, which the CursorRunner module did at the top. That meant edge-worker tests which mock cyrus-cursor-runner via the workspace alias still loaded its real module graph and exploded — 26 of 50 test files unable to start across Node 18.x and 22.x runs. Move the import inside start(): const { Agent } = await import("@cursor/sdk"); TypeScript type imports are erased at runtime and stay at the top, so nothing else changes shape. Vitest's vi.mock("@cursor/sdk") still intercepts dynamic imports, so the cursor-runner unit tests keep passing unchanged. Verified: pnpm --filter cyrus-edge-worker test:run -> 611/611 pass. pnpm --filter cyrus-cursor-runner test:run -> 26/26 pass. * feat(cursor): record token usage from turn-ended deltas in result message The cursor-runner was emitting result messages with all-zero token usage because the SDK does not surface tokens through `run.stream()` events or `RunResult` — only via the `onDelta({ update })` callback's `turn-ended` update: { type: "turn-ended", usage?: { inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens }} Wire `onDelta` on `agent.send()` and accumulate per-turn deltas across the run. Map the SDK's four counters into Cyrus's `SDKResultMessage.usage` shape: inputTokens -> input_tokens outputTokens -> output_tokens cacheReadTokens -> cache_read_input_tokens cacheWriteTokens -> cache_creation_input_tokens The SDK does not split ephemeral 1h vs 5m the way Anthropic does, so we report 0 in both buckets and put the full count in the parent counter. Cost (USD) is intentionally not reported: the SDK does not expose per-run cost in its public types — Cursor handles billing server-side. We leave `total_cost_usd: 0` for now. Test coverage: new test drives two `turn-ended` deltas plus a non-token delta and asserts the result message reports the correct accumulated totals. * test(cursor): use SDKResultMessage type guard instead of as-unknown cast * fix(cursor): bare Read/Write/Bash also emit path-level allow rules (CYPACK-1150) A real production session on CYPACK-1150 was denying every file read and shell command. The user's repo config had bare names like: "Read", "Write", "Edit", "Bash", "Glob", "Grep", "TodoWrite" In Claude SDK semantics, bare `Read` means "allow Read with no path restriction." But our translator only emitted `Tool(Read)`, which gates the SDK's `preToolUse` event but does NOT match the subsequent `beforeReadFile` event whose candidate is `Read(<path>)`. With nothing in the allow list of kind `Read(...)`, the helper denied the read with "no allow rule matched (event=beforeReadFile, candidates=[{kind:'Read',value:'README.md'}])". The agent narrated the deny back to the user verbatim: "blocked by a Cyrus hook (beforeReadFile: no allow rule matched for Read -> README.md)" Fix: bare path-bearing names now expand to BOTH the preToolUse gate AND the path/command-level gate: Read|Glob|Grep -> Tool(Read) + Read(**) Write|Edit|... -> Tool(Write) + Write(**) Bash|Shell -> Tool(Shell) + Shell(*) Existing system-root + workspace-sibling auto-deny still applies because the `Read(**)` allow is detected as a "broad" pattern. Verified via F1 drive: same prompt that previously failed ("read README.md and package.json") now completes successfully — both Read actions complete, no deny messages, session subtype=success. New regression test asserts the bare-name expansion. All 28 cursor-runner tests pass; typecheck clean. * fix(cursor): map MCP server name from transport for hook permission checks (CYPACK-1151) A real production session on CYPACK-1151 reported "Posting this to Linear was blocked by MCP hooks here". The agent was trying to call mcp__linear__save_comment with a properly configured allow rule (`Mcp(linear:*)`), but the hook denied it. Root cause: the SDK's beforeMCPExecution hook payload only carries the bare tool_name (e.g. "save_comment") and the underlying transport (`command` for stdio, `url` for http) — never the logical server name. The helper was only reading `tool_name`, so a candidate `Mcp(save_comment)` never matched a server-scoped pattern like `Mcp(linear:*)`. Verified via a learning test (test-mcp-hook-payload.mjs in the SDK sandbox): the captured stdin payload for beforeMCPExecution contains `{ tool_name, tool_input, command, ... }` with no server identifier. Fix: - buildCyrusPermissionsConfig now accepts the SDK-shaped mcpServers map and emits a `mcpServers: [{name, commandLine|url}]` lookup table into cyrus-permissions.json (alongside allow/deny). - permission-check.mjs reads the lookup and resolves the logical server from `payload.command` (stdio) or `payload.url` (http) before building the candidate. - For beforeMCPExecution we now emit two candidates: Mcp(<server>:<tool_name>) — when server lookup succeeds Mcp(<tool_name>) — always, as a fallback / for unscoped patterns Patterns like `Mcp(linear:*)` and `Mcp(linear:save_comment)` now match for the configured server. CursorRunner passes the mcpConfig (already mapped to SDK shape) into buildCyrusPermissionsConfig at session start. Tests: three new permission-check cases (server-scoped allow via command lookup, via url lookup, and deny when no server matches). All 31 cursor-runner tests pass; 611/611 edge-worker tests pass. F1 sanity session (Read README.md/package.json) still completes subtype=success. * feat(cursor): wire SDK 1.0.11 sandbox; translate Claude SandboxSettings to .cursor/sandbox.json @cursor/sdk@1.0.11 ships an auto-discoverable `cursorsandbox` helper via platform optional deps (`@cursor/sdk-<platform>-<arch>`). The previous ConfigurationError on macOS is gone — `local.sandboxOptions: { enabled: true }` now engages Apple Seatbelt / Linux Landlock as designed. Verified by running real sandboxed sessions: CURSOR_SANDBOX=seatbelt observed in agent shell env. Off-workspace home write blocked, /etc reads allowed (default policy), outbound network blocked unless allow-listed in .cursor/sandbox.json, workspace writes succeed. Implementation: - Bump @cursor/sdk to ^1.0.11. - Add .npmrc public-hoist-pattern for @cursor/sdk-* / @cursor/february-* so pnpm exposes the platform binary at <root>/node_modules/@cursor/... where the SDK's `resolvePlatformPackageBinary` walk-up search can find it. (Without this, pnpm strict mode keeps it under .pnpm/ and the SDK silently falls back to "sandboxing not supported".) - New `packages/cursor-runner/src/sandbox.ts`: - `buildCursorSandboxJson({workspace, sandboxSettings, egressCaCertPath, additionalReadwritePaths})` translates the Claude SDK SandboxSettings shape into the Cursor sandbox.json schema. Mapping: filesystem.allowWrite -> additionalReadwritePaths filesystem.allowRead -> additionalReadonlyPaths network.allowedDomains -> networkPolicy.allow network.deniedDomains -> networkPolicy.deny network.{httpProxyPort,socksProxyPort} -> + 127.0.0.1/::1/localhost to allow list egressCaCertPath -> + readonly path so child processes can read it Default policy is `workspace_readwrite` + `networkPolicy.default: "deny"`, mirroring Claude's "block by default, allow what's needed" model. - `buildSandboxEnv` returns the env vars to set on `process.env` so sandboxed shell tools inherit cert trust + proxy hints (NODE_EXTRA_CA_CERTS, SSL_CERT_FILE, GIT_SSL_CAINFO, REQUESTS_CA_BUNDLE, PIP_CERT, CURL_CA_BUNDLE, CARGO_HTTP_CAINFO, AWS_CA_BUNDLE, DENO_CERT, HTTP_PROXY/HTTPS_PROXY/ALL_PROXY). - `CursorRunnerConfig` gains `sandboxSettings?: CursorSandboxInput` (a structurally-compatible subset of Claude's SandboxSettings; defined locally to avoid a hard dep on cyrus-claude-runner) and `egressCaCertPath?: string`. Drops the deprecated `sandbox: "enabled"| "disabled"` string flag — no callers used it. - CursorRunner: - Sets `local.sandboxOptions: { enabled: <sandboxSettings.enabled> }` on Agent.create / Agent.resume. - At session start, writes `<workspace>/.cursor/sandbox.json` (with backup/restore symmetric to .cursor/hooks.json), and snapshots-and- sets the sandbox env vars on process.env. - At session end, removes the file (restoring any backup) and restores the prior env values. - RunnerConfigBuilder: drops the legacy CYRUS_SANDBOX env-var passthrough for cursor; instead forwards the same `sandboxSettings` and `egressCaCertPath` it already gives Claude. The cursor runner does the schema translation internally. Caveats (documented inline in sandbox.ts): - `filesystem.denyRead` / `denyWrite` from Claude's settings are accepted but not enforced — Cursor's `workspace_readwrite` policy doesn't expose per-path denies under the default profile. Use `.cursor/hooks.json` (the existing Cyrus permission-check helper) for fine-grained read blocking when needed. - Sandbox features Claude exposes that Cursor's sandbox.json doesn't: `network.allowAllUnixSockets`, `allowMachLookup`, `allowLocalBinding`. The default Cursor profile covers most of these implicitly. Tests: - New sandbox.test.ts: 10 unit tests covering filesystem/network mapping, proxy-port loopback injection, CA cert path, dedup of workspace, and empty-when-disabled. - New CursorRunner.test.ts cases: verifies sandbox.json is written when enabled, sandboxOptions.enabled flows through to Agent.create, and process.env is restored after the session ends. Verification: - pnpm --filter cyrus-cursor-runner test:run -> 43/43 pass - pnpm --filter cyrus-edge-worker test:run -> 611/611 pass - F1 drive with CYRUS_SANDBOX=1: sandbox policy installed at session start (allowReadwrite=4, networkAllow=3 incl. loopback), workspace reads completed, session subtype=success. - F1 drive with sandbox disabled: still completes subtype=success. - Live SDK learning tests confirm Apple Seatbelt engages on macOS, off-workspace writes blocked, network deny-default + allowlist works. * fix(cursor): defer MCP allow check from preToolUse to beforeMCPExecution (CYPACK-1154/1155) The previous MCP fix (commit ace17a7) added server-name lookup at beforeMCPExecution, but production sessions on CYPACK-1154 and CYPACK-1155 still showed "no allow rule matched MCP:get_issue" — and the agent's own narration revealed the deny came from preToolUse, not beforeMCPExecution. Why: confirmed via a learning test that captures both events for the same MCP call. The SDK fires: preToolUse: tool_name="MCP:get_issue" command=undefined url=undefined beforeMCPExecution: tool_name="get_issue" command="node /path/to/server.mjs" At preToolUse there is NO transport identifier, so we cannot resolve the logical server name (e.g. "linear") to evaluate `Mcp(linear:*)`. Our helper was emitting a candidate `Tool("MCP:get_issue")`, which naturally matches none of the standard `Tool(Read|Shell|Write)` allows nor `Mcp(linear:*)` — denied at the first hook. Fix: when preToolUse arrives with a tool_name starting with `MCP:`, emit no candidates, which falls through to "allow" in the helper. The subsequent beforeMCPExecution event has full server context and runs the existing server-scoped check (`Mcp(linear:save_comment)` / `Mcp(linear:*)`). Net effect: the actual permission gate for MCP tools is the second hook, which is the only place we can correctly evaluate it. Note on rollout: the artifacts under <worktree>/.cursor/ (hooks.json, cyrus-permission-check.mjs, cyrus-permissions.json) are rewritten fresh at every session start by CursorRunner.installPermissionsArtifacts. So once this commit is built and the Cyrus daemon process restarts to load the new dist, every new Cursor session picks up the fix automatically — no per-worktree cleanup needed. Test: permission-check.test.ts — new regression case asserting that preToolUse with tool_name="MCP:get_issue" returns allow when the user's allow list scopes via Mcp(linear:*). 44/44 cursor-runner tests pass; 611/611 edge-worker tests pass. * Bump @openai/codex-sdk to 0.125.x (CYPACK-1151) (#1171) * Bump @openai/codex-sdk to ^0.125.0 Aligns the Codex runner with OpenAI Codex CLI 0.125.x bundled by the SDK (including additive usage metadata such as reasoning output tokens). Closes Cyrus assessment for CYPACK-1151. Made-with: Cursor * Changelog: link PR #1171 for Codex SDK bump (CYPACK-1151) Made-with: Cursor * Fix Biome errors after merging main * CI: drop Node 18 from matrix * fix(cursor-runner): allow sqlite3 install script to run under pnpm 10 (CYPACK-1158) (#1174) * fix(cursor-runner): allow sqlite3 install script to run under pnpm 10 (CYPACK-1158) @cursor/sdk@1.0.11 pulls sqlite3@5.1.7 as a runtime dep. Its native node_sqlite3.node binding is fetched by an install lifecycle script, which pnpm 10 blocks by default. Without it, sqlite3 is "installed" but missing its .node binding, and Cursor sessions crash on first import with "Could not locate the bindings file". Adds sqlite3 to pnpm.onlyBuiltDependencies so the install script runs on pnpm install. * docs: add PR link to CYPACK-1158 changelog entry * fix(deps): override tar to >=7.5.11 to patch sqlite3 advisories (CYPACK-1159) (#1175) The @cursor/sdk → sqlite3 → tar@6.2.1 chain introduced in CYPACK-1149 was flagged for 6 high-severity path-traversal/hardlink/symlink CVEs. sqlite3@5.1.7 is the latest release and pins tar^6, so a root override is the only way to reach the patched transitive. * Update @anthropic-ai/claude-agent-sdk to v0.2.123 and @anthropic-ai/sdk to ^0.91.0 (CYPACK-1152) (#1172) * chore(deps): update @anthropic-ai/claude-agent-sdk to v0.2.123 and @anthropic-ai/sdk to ^0.91.0 (CYPACK-1152) Bumps @anthropic-ai/claude-agent-sdk from 0.2.117 to 0.2.123 across all packages and @anthropic-ai/sdk from ^0.90.0 to ^0.91.0. Removes LSP from availableTools in config.ts — LSP is no longer shipped in claude-agent-sdk v0.2.123. Updates the corresponding test fixtures. * chore: add PR link to CHANGELOG for CYPACK-1152 * Prepare release v0.2.50 (#1176) * Prepare release v0.2.51 (#1177) * test: add learning tests for memory-pressure gate (CYPACK-1165) Pin down behavior of the OOM-preflight gate and concurrency cap added in feat/oom-preflight-gate: - memory-health: partial-threshold semantics, no-threshold no-op, free-vs-heap precedence, strict comparison boundaries, metrics snapshot on rejection - memory-gate-schema: Zod constraints on MemoryGateConfigSchema and EdgeConfigSchema.maxConcurrentRunners - runner-gate: gate invoked once per event, userMessage propagated verbatim, onWebhookEnd fires on reject * fix: propagate memoryGate and maxConcurrentRunners through CLI + ConfigManager (CYPACK-1165) The CLI builder in WorkerService.ts and the hot-reload merge in ConfigManager.ts both assembled EdgeWorkerConfig field-by-field. New EdgeConfig fields like memoryGate and maxConcurrentRunners were loaded from disk but silently dropped before reaching EdgeWorker, so the runner gate added in feat/oom-preflight-gate never fired in production. detectGlobalConfigChanges had the same hand-maintained whitelist, so changing those fields at runtime wouldn't have triggered a hot-reload even after the propagation fix. Refactored both call sites to spread the on-disk config first, then overlay runtime/handler fields and env-var overrides. New EdgeConfig keys now flow through structurally without code changes here. - apps/cli/src/services/WorkerService.ts: spread edgeConfig at the top, layer caller/runtime fields, layer env precedence last. - packages/edge-worker/src/config-merge.ts: extract pure mergeEdgeConfig and hasGlobalConfigChanges helpers; the latter computes the diff key set from the live objects via a runtime-only-keys denylist. - packages/edge-worker/src/ConfigManager.ts: delegate to the helpers. - packages/edge-worker/test/config-merge.test.ts: 14 regression tests pinning down propagation, legacy-alias resolution, and structural change detection. * fix: post memory-pressure rejection as a response activity with Stop signal (CYPACK-1165) When the runner gate rejects a new Linear session, the user-facing "Cyrus is at capacity / temporarily out of capacity" message was being emitted as a thought activity. Linear treats thoughts as intermediate updates, so the session stayed open after the rejection even though no runner was ever spawned to follow up. Emit the rejection as a response activity with AgentActivitySignal.Stop so Linear treats it as the session's terminal message and closes the agent session. * fix: drop Stop signal from memory-pressure rejection (CYPACK-1165) Linear's API rejects signal=stop on response-type activities — that signal is only allowed on prompt activities. The Stop signal was added in the previous commit to terminate the agent session in the Linear UI, but it triggered: Invalid signal: "stop" is only allowed for prompt type activities. Drop the signal. The response-type activity by itself is sufficient to render the rejection as the session's terminal message in Linear. * refactor: collapse memoryGate to a single intuitive knob (CYPACK-1165) The previous shape exposed four sub-fields: memoryGate: { enabled: true, maxRssPercent: 0.75, minAvailableMemoryMb: 500, maxHeapUsagePercent: 0.85 } Operators had to reason about three thresholds and a separate enable flag, and the absolute MB threshold required retuning per host size. Replace with a single value (boolean | number): memoryGate: true # enabled at default 85% pressure threshold memoryGate: 0.9 # enabled at custom threshold memoryGate: false # disabled (also: omit entirely) Internally, 'pressure' is the worst of three normalized dimensions: process RSS, V8 heap, and used system memory. The single percentage captures all three concerns from the legacy config and is portable across host sizes (no absolute-MB knob to retune). The rejection reason still names the dominant dimension (RSS / heap / system memory) so operator logs remain diagnosable. - core/src/memory-health.ts: type MemoryGateConfig = boolean | number; export DEFAULT_MEMORY_PRESSURE_THRESHOLD = 0.85; collectMemoryMetrics computes systemUsedPercent and pressure. - core/src/config-schemas.ts: MemoryGateConfigSchema is now z.union of boolean and a (0,1] number; legacy verbose object form is rejected. - core/test/memory-health.test.ts + memory-gate-schema.test.ts: rewritten around the new shape (29 tests). - edge-worker/test/config-merge.test.ts: migrated regression cases. - CHANGELOG: rewritten unreleased entry to describe the new knob. - core/schemas/*.json: regenerated. --------- Co-authored-by: Connor Turland <1409121+Connoropolous@users.noreply.github.com> Co-authored-by: cyrusagent <237105008+cyrusagent[bot]@users.noreply.github.com> Co-authored-by: Cyrus Agent <agentclear@anthropic.com> Co-authored-by: Payton Webber <53197664+PaytonWebber@users.noreply.github.com>
# Conflicts: # CHANGELOG.md # packages/edge-worker/src/ConfigManager.ts
… message When the memory-pressure gate rejects a new session, the user-facing message now discloses how many Cyrus sessions are already running on the host. This helps users see whether the pressure is coming from Cyrus's own workload or from elsewhere on the box.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
memoryGateconfig knob:trueenables it at the default 85% pressure threshold, a number in(0, 1]sets a custom threshold (e.g.0.9= 90%), andfalse(or omitting the field) disables it. Uses onlyos.totalmem/os.freemem/process.memoryUsage/v8.getHeapStatisticsso behaviour is identical on Linux and macOS.maxConcurrentRunnersconfig rejects new work when already at capacity. Active count is derived live fromisRunning()across Linear (agentSessionManager.getAllAgentRunners()) and chat (chatSessionHandler.getAllRunners()) runners, so there's no counter to drift.agentSessionCreated+agentSessionPrompted, GitHub webhook, GitLab webhook, and the Slack chat handler. Each platform posts via its native channel (Linear activity, GitHub PR comment, GitLab MR note, Slack thread reply).Motivation
Cyrus previously had no visibility into host memory before accepting new sessions and no ceiling on concurrent runners, so under load the process could be SIGKILL'd by
systemd-oomdor the kernel OOM killer with no warning to the user. This adds opt-in guardrails so Cyrus can refuse new work gracefully with a clear message instead of dying silently.Config shape
{ "memoryGate": true, "maxConcurrentRunners": 5 }memoryGatealso accepts a number in(0, 1]to set a custom pressure threshold (e.g.0.9). Omitting either field preserves today's behaviour (no gate)."Pressure" is the worst of three normalized dimensions: process RSS as a fraction of system memory, V8 heap usage as a fraction of the heap size limit, and system memory used as a fraction of total.
Test plan
checkMemoryHealth/collectMemoryMetrics/formatMemoryPressureMessage(injectableMemorySourcespattern to sidestep ESM module mocking)memoryGateandmaxConcurrentRunnersconfig fieldsrunner-gate.test.ts— 7 tests covering the concurrency cap and memory gate wiring intoEdgeWorkercyrus-coretest suite (158 tests) passescyrus-edge-workertest suite (632 tests) passespnpm typecheckclean across all workspace projects