CPRA - Concurrent Pulse-Remediation-Alerting System

Monitor millions of services concurrently with automated remediation and dynamic worker scaling.

CPRA is a high-performance infrastructure monitoring system designed for platform teams managing large-scale microservice architectures. Built on Entity-Component-System (ECS) architecture and queueing theory principles, CPRA handles 1,000,000+ concurrent health checks with automatic worker pool scaling to meet SLO targets.

Why CPRA?

Use CPRA when you need to:

Monitor 100,000+ concurrent services, containers, or endpoints
Automatically remediate failures without human intervention
Scale monitoring infrastructure dynamically based on load
Achieve sub-100ms P95 latency from detection to alerting
Minimize memory footprint (~100 bytes per monitor)

Key Features

🚀 Massive Scalability

Handles 1,000,000+ concurrent monitors on commodity hardware
Linear scaling with minimal overhead per monitor
Memory-efficient design: ~100 bytes per monitor

⚡ High Performance

10,000+ health checks per second per pipeline
P95 latency < 100ms from schedule to result processing
Batch processing and lock-free queues minimize overhead

🔄 Automated Remediation

Three independent pipelines:
1. Pulse: Health checking (HTTP, TCP, ICMP, custom scripts)
2. Intervention: Automated recovery (restart services, scale resources, run scripts)
3. Code: Alerting and notifications (email, SMS, webhooks, PagerDuty)

🧠 Intelligent Scaling

M/M/c queueing theory: Automatically calculates optimal worker count
Allen-Cunneen approximation: Handles real-world workload variability
SLO-driven sizing: Dynamically scales to meet latency targets

🏗️ Data-Oriented Architecture

Entity-Component-System (ECS) using mlange-42/ark
Cache-friendly memory layout for maximum performance
Minimal allocations and GC pressure

🔧 Production-Ready

Built-in pprof profiling for debugging
Graceful shutdown with context cancellation
Comprehensive logging with debug mode
Memory management with automatic GC triggering

Performance Characteristics

Metric	Value
Max Concurrent Monitors	1,000,000+
Throughput	10,000+ checks/sec/pipeline
Latency (P95)	< 100ms (configurable via SLO)
Memory per Monitor	~100 bytes
Total Memory (1M monitors)	~100 MB + worker pool overhead
Worker Scaling	Dynamic (M/M/c based)

See Architecture Overview for detailed benchmarks and analysis.

Architecture

CPRA uses a three-pipeline architecture built on Entity-Component-System principles:

Three Independent Processing Pipelines

Pulse Pipeline: Executes health checks (HTTP requests, TCP connections, custom scripts)
Intervention Pipeline: Performs automated remediation when monitors fail
Code Pipeline: Sends alert notifications to incident management systems

Each pipeline operates independently with its own queue and dynamically-scaled worker pool, enabling:

Pipeline-specific tuning: Configure each pipeline separately
Fault isolation: One pipeline failure doesn't affect others
Independent scaling: Scale workers based on per-pipeline load

Queue and Worker Pool Architecture

Queue Implementations:

HybridQueue: Ring buffer + overflow slice for reliable FIFO processing
AdaptiveQueue: Auto-scaling ring buffer for variable load
WorkivaQueue: Lock-free ring buffer for ultra-low latency

Dynamic Worker Pools:

Powered by panjf2000/ants goroutine pool
Automatic scaling using M/M/c queueing theory
Configurable min/max workers and SLO targets

For a comprehensive architecture explanation, see the Architecture Overview.

Quick Start

Option 1: Build and Run Locally

# Prerequisites: Go 1.25 or later
go version  # Should show go1.25 or higher

# Build from source
git clone https://github.com/ziad/cpra.git
cd cpra
go build .

# Run with example configuration
./cpra --yaml mock-servers/test_10k.yaml

Expected Output:

Starting CPRA Optimized Controller for 1M Monitors
Profiling server listening at http://localhost:6060/debug/pprof/
Loading monitors from mock-servers/test_10k.yaml...
Monitor loading completed in 1.2s
[INFO] Controller started successfully
[INFO] Pulse pipeline processing 10,000 monitors
[INFO] Worker pool scaled to 143 workers (target SLO: 100ms)

Installation

Prerequisites

Go 1.25 or later (download)
Docker (optional, for containerized deployment)

Building from Source

Clone the repository:

git clone https://github.com/ziad/cpra.git
cd cpra

Download dependencies:
```
go mod download
```
Build the application:
```
go build .
```
Verify installation:
```
./cpra --help
```

Docker Deployment

Build the Docker image:

docker build -f docker/Dockerfile -t cpra:latest .

Run the container:

docker run -it --rm \
  -v $(pwd)/my-monitors.yaml:/app/monitors.yaml \
  cpra:latest \
  ./cpra --yaml monitors.yaml

Configuration

Monitor Configuration (YAML)

Create a monitors.yaml file to define health checks:

monitors:
  - name: "my-service-health-check"
    pulse_check:
      type: http
      interval: 30s
      timeout: 5s
      max_failures: 3
      config:
        method: GET
        url: http://my-service.example.com/health
        retries: 2
    intervention:
      action: docker
      config:
        container: my-service-container
        action: restart
    codes:
      red:
        dispatch: true
        notify: pagerduty
        config:
          url: https://events.pagerduty.com/v2/enqueue
      yellow:
        dispatch: true
        notify: log
        config:
          file: /var/log/cpra-alerts.log

Generating Test Configurations:

Use mock-servers/generate_monitors.py to generate test configurations with any number of monitors.

Application Configuration

Configure CPRA behavior programmatically:

package main

import (
    "cpra/internal/controller"
)

func main() {
    config := controller.DefaultConfig()

    // Debug mode
    config.Debug = true

    // Worker pool settings (applies to all three pipelines)
    config.WorkerConfig.MinWorkers = 10
    config.WorkerConfig.MaxWorkers = 500

    // Queue settings
    config.QueueCapacity = 131072  // Must be power of 2

    // Performance tuning
    config.BatchSize = 2000
    config.SizingServiceTime = 20 * time.Millisecond  // Average job duration
    config.SizingSLO = 100 * time.Millisecond         // Target latency
    config.SizingHeadroomPct = 0.15                   // 15% safety buffer

    ctrl := controller.NewController(config)
    // ... rest of initialization
}

See the API Reference for complete configuration options.

Command-Line Options

./cpra [OPTIONS]

Option	Type	Default	Description
`--yaml`	string	`internal/loader/replicated_test.yaml`	Path to monitors YAML file
`--config`	string	-	Configuration file path (optional)
`--debug`	bool	`false`	Enable debug-level logging
`--pprof`	bool	`true`	Enable pprof profiling server
`--pprof.addr`	string	`localhost:6060`	Pprof server listen address

Examples:

# Run with debug logging
./cpra --yaml monitors.yaml --debug

# Run with custom pprof port
./cpra --yaml monitors.yaml --pprof.addr localhost:8080

# Disable profiling
./cpra --yaml monitors.yaml --pprof=false

Documentation

Comprehensive Guides

Architecture Overview - System design, diagrams, and performance analysis
API Reference - Complete API documentation with function signatures
Types Reference - Data structures and component definitions
Quickstart Tutorial - Get started in 5-10 minutes
Common Tasks - How-to guides for typical operations

Additional Resources

Getting Started - Detailed setup and deployment guide

Troubleshooting

Common Issues

Issue: YAML file not found

Warning: YAML file monitors.yaml not found, starting without loading monitors

Solution: Verify the file path is correct. Use absolute paths or paths relative to where you run the binary:

./cpra --yaml $(pwd)/monitors.yaml

Issue: Build fails with Go version error

go.mod requires go >= 1.25

Solution: Upgrade Go to version 1.25 or later:

go version  # Check current version
# Download Go 1.25+ from https://go.dev/dl/

Issue: High memory usage Solution: Check memory usage with pprof:

# While CPRA is running, access pprof
go tool pprof http://localhost:6060/debug/pprof/heap

# View top memory consumers
(pprof) top

Adjust memory limits in configuration:

config.WorkerConfig.MaxWorkers = 200  // Reduce max workers
config.QueueCapacity = 65536          // Reduce queue size

Issue: Worker pool not scaling Solution: Enable debug logging to see scaling decisions:

./cpra --yaml monitors.yaml --debug

Check queueing theory parameters:

config.SizingServiceTime = 50 * time.Millisecond  // Increase if jobs take longer
config.SizingSLO = 200 * time.Millisecond         // Relax SLO if needed

Issue: Monitors not executing Solution: Verify monitor configuration format and check logs:

./cpra --yaml monitors.yaml --debug 2>&1 | grep ERROR

Validate YAML syntax:

# Use a YAML validator
python -m yaml monitors.yaml

Getting Help

Documentation: Check the docs/ folder for detailed guides
Issues: Open an issue for bugs or feature requests
Discussions: Ask questions and share ideas in GitHub Discussions
Logs: Always provide logs when reporting issues (use --debug flag)

Contributing

We welcome contributions from the community! CPRA is an open-source project and we appreciate:

🐛 Bug reports and fixes
✨ Feature requests and implementations
📖 Documentation improvements
🧪 Test coverage enhancements
💡 Performance optimizations

Getting Started:

Look for issues labeled good first issue
Fork the repository and submit a pull request

Development Resources:

Architecture Overview - Understand the system design
API Reference - Function signatures and usage

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

CPRA is built on excellent open-source libraries:

mlange-42/ark - High-performance Entity-Component-System
panjf2000/ants - Goroutine pool with dynamic scaling
Workiva/go-datastructures - Lock-free data structures
uber-go/zap - Structured logging

Documentation • Architecture • Issues

Built with ❤️ for platform teams managing large-scale infrastructure

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
cmd		cmd
docs		docs
internal		internal
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
mkdocs.yml		mkdocs.yml

Folders and files

Latest commit

History

Repository files navigation

CPRA - Concurrent Pulse-Remediation-Alerting System

Table of Contents

Why CPRA?

Key Features

🚀 Massive Scalability

⚡ High Performance

🔄 Automated Remediation

🧠 Intelligent Scaling

🏗️ Data-Oriented Architecture

🔧 Production-Ready

Performance Characteristics

Architecture

Three Independent Processing Pipelines

Queue and Worker Pool Architecture

Quick Start

Option 1: Build and Run Locally

Installation

Prerequisites

Building from Source

Docker Deployment

Configuration

Monitor Configuration (YAML)

Application Configuration

Command-Line Options

Documentation

Comprehensive Guides

Additional Resources

Troubleshooting

Common Issues

Getting Help

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages