Skip to content

Resource Monitoring

Alessio Rocchi edited this page Jan 29, 2026 · 1 revision

Resource Monitoring

Guide to monitoring and controlling agent resource consumption with the Resource Exhaustion Service.


Overview

The Resource Exhaustion Service prevents runaway agents by:

  • Tracking resource usage - Files, API calls, tokens, subtasks
  • Progressive intervention - Warning → Pause → Terminate
  • Deliverable checkpoints - Require periodic progress markers
  • Automatic enforcement - Configurable thresholds with auto-pause

Resource Metrics Tracked

Metric Description
filesRead Number of files read
filesWritten Number of files created
filesModified Number of files modified
apiCallsCount Total API calls made
subtasksSpawned Number of subtasks created
tokensConsumed Total tokens used
timeWithoutDeliverable Duration since last deliverable

Phase Progression

Agents progress through phases based on resource consumption:

stateDiagram-v2
    [*] --> Normal: Agent started
    Normal --> Warning: Approaching threshold
    Warning --> Normal: Deliverable recorded
    Warning --> Intervention: Threshold exceeded
    Intervention --> Warning: Agent resumed
    Intervention --> Termination: No response
    Termination --> [*]
Loading

Phase Descriptions

Phase Description Actions
Normal Operating within limits No action
Warning Approaching limits (default 80%) Log warning, notify
Intervention Exceeded limits Pause agent, require approval
Termination Unrecoverable Force stop agent

Configuration

interface ResourceExhaustionConfig {
  enabled: boolean;
  warningThresholdPercent: number;  // Default: 0.8 (80%)
  checkIntervalMs: number;          // Default: 60000 (1 minute)
  pauseOnIntervention: boolean;     // Default: true
  autoTerminate: boolean;           // Default: false
  thresholds: ResourceThresholds;
}

interface ResourceThresholds {
  maxFilesAccessed: number;           // Default: 100
  maxApiCalls: number;                // Default: 50
  maxSubtasksSpawned: number;         // Default: 20
  maxTokensConsumed: number;          // Default: 100000
  maxTimeWithoutDeliverableMs: number; // Default: 300000 (5 min)
}

Example Configuration

{
  "resourceExhaustion": {
    "enabled": true,
    "warningThresholdPercent": 0.8,
    "checkIntervalMs": 60000,
    "pauseOnIntervention": true,
    "autoTerminate": false,
    "thresholds": {
      "maxFilesAccessed": 100,
      "maxApiCalls": 50,
      "maxSubtasksSpawned": 20,
      "maxTokensConsumed": 100000,
      "maxTimeWithoutDeliverableMs": 300000
    }
  }
}

Deliverable Checkpoints

Deliverables are progress markers that indicate an agent is making meaningful progress, not just consuming resources.

Deliverable Types

Type Description
code_commit Code committed to repository
test_passed Tests passing
review_complete Code review finished
documentation Documentation produced
analysis_report Analysis or report generated
deployment Deployment completed
other Custom deliverable type

Recording Deliverables

Agents (or orchestrators) should record deliverables periodically:

import { getResourceExhaustionService } from '@blackms/aistack';

const resourceService = getResourceExhaustionService(store, config);

// Record a deliverable
const checkpoint = resourceService.recordDeliverable(
  agentId,
  'code_commit',
  'Implemented user authentication module',
  ['src/auth/login.ts', 'src/auth/jwt.ts']
);

Recording a deliverable:

  1. Creates a checkpoint in the database
  2. Updates lastDeliverableAt timestamp
  3. Resets agent from warning to normal phase

Programmatic API

Initialize Agent Tracking

import { getResourceExhaustionService } from '@blackms/aistack';

const resourceService = getResourceExhaustionService(store, config);

// Start tracking a new agent
const metrics = resourceService.initializeAgent(agentId, 'coder');

Record Operations

// Record file operations
resourceService.recordFileOperation(agentId, 'read');
resourceService.recordFileOperation(agentId, 'write');
resourceService.recordFileOperation(agentId, 'modify');

// Record API calls
resourceService.recordApiCall(agentId, 1500); // with token count

// Record subtask spawning
resourceService.recordSubtaskSpawn(agentId);

Check Agent Status

// Get current metrics
const metrics = resourceService.getAgentMetrics(agentId);
console.log(metrics);
// {
//   agentId: 'uuid',
//   filesRead: 15,
//   filesWritten: 3,
//   filesModified: 8,
//   apiCallsCount: 12,
//   subtasksSpawned: 2,
//   tokensConsumed: 45000,
//   phase: 'normal',
//   lastDeliverableAt: Date,
//   ...
// }

// Evaluate current phase
const phase = resourceService.evaluateAgent(agentId);
// Returns: 'normal' | 'warning' | 'intervention' | 'termination'

Manual Intervention

// Pause an agent
await resourceService.pauseAgent(agentId, 'Manual review required');

// Check if paused
const isPaused = resourceService.isAgentPaused(agentId);

// Resume agent
resourceService.resumeAgent(agentId);

// Terminate agent
resourceService.terminateAgent(agentId, 'Exceeded all limits');

Get Metrics Summary

const summary = resourceService.getResourceMetrics(new Date('2026-01-01'));
// {
//   totalAgentsTracked: 5,
//   agentsByPhase: { normal: 3, warning: 1, intervention: 1, termination: 0 },
//   pausedAgents: 1,
//   totalWarnings: 15,
//   totalInterventions: 3,
//   totalTerminations: 0,
//   recentEvents: [...]
// }

Integration with system_health

The Resource Exhaustion Service integrates with system_health:

{
  "status": "healthy",
  "checks": {
    "database": true,
    "vectorSearch": true,
    "github": true,
    "resourceExhaustion": {
      "enabled": true,
      "agentsTracked": 5,
      "agentsByPhase": {
        "normal": 3,
        "warning": 1,
        "intervention": 1
      },
      "pausedAgents": 1
    }
  }
}

Prometheus Metrics

When Prometheus metrics are enabled, the service exposes:

Metric Type Description
agent_files_accessed Histogram Files accessed per agent
agent_api_calls Histogram API calls per agent
agent_tokens_consumed Histogram Tokens consumed per agent
agents_paused_current Gauge Currently paused agents
resource_exhaustion_warnings_total Counter Total warnings issued
resource_exhaustion_interventions_total Counter Total interventions
resource_exhaustion_terminations_total Counter Total terminations

Best Practices

Set Appropriate Thresholds

// For exploratory/research agents - higher limits
{
  "thresholds": {
    "maxFilesAccessed": 500,
    "maxApiCalls": 100,
    "maxTimeWithoutDeliverableMs": 600000  // 10 minutes
  }
}

// For production/deployment agents - stricter limits
{
  "thresholds": {
    "maxFilesAccessed": 50,
    "maxApiCalls": 20,
    "maxSubtasksSpawned": 5,
    "maxTimeWithoutDeliverableMs": 180000  // 3 minutes
  }
}

Record Deliverables Proactively

// After completing meaningful work, record a deliverable
if (testsPassed) {
  resourceService.recordDeliverable(
    agentId,
    'test_passed',
    `All ${testCount} tests passing`
  );
}

// This resets the "time without deliverable" timer
// and transitions warning → normal

Monitor Warning Phase

const metrics = resourceService.getAgentMetrics(agentId);

if (metrics.phase === 'warning') {
  // Agent approaching limits
  // Consider: completing current task, recording deliverable, or pausing
  console.warn(`Agent ${agentId} in warning phase`);
}

Handle Paused Agents

// Check if agent is paused before assigning work
if (resourceService.isAgentPaused(agentId)) {
  // Either wait for resume or use different agent
  const resumed = await resourceService.waitForResume(agentId);
  if (!resumed) {
    // Agent was terminated, handle accordingly
  }
}

Troubleshooting

Agent Stuck in Warning

Problem: Agent keeps hitting warning threshold

Solutions:

  1. Record deliverables more frequently
  2. Increase threshold limits
  3. Break task into smaller subtasks

Intervention Too Aggressive

Problem: Agents getting paused too often

Solutions:

  1. Increase warningThresholdPercent (e.g., 0.9)
  2. Increase absolute thresholds
  3. Reduce checkIntervalMs for more gradual detection

Missing Metrics

Problem: Metrics not being recorded

Solutions:

  1. Ensure enabled: true in config
  2. Call initializeAgent() when agent starts
  3. Verify service is started with start()

Related:

Clone this wiki locally