Skip to content

ziad-hsn/cpra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CPRA - Concurrent Pulse-Remediation-Alerting System

Go Version License Documentation

Monitor millions of services concurrently with automated remediation and dynamic worker scaling.

CPRA is a high-performance infrastructure monitoring system designed for platform teams managing large-scale microservice architectures. Built on Entity-Component-System (ECS) architecture and queueing theory principles, CPRA handles 1,000,000+ concurrent health checks with automatic worker pool scaling to meet SLO targets.


Table of Contents


Why CPRA?

Use CPRA when you need to:

  • Monitor 100,000+ concurrent services, containers, or endpoints
  • Automatically remediate failures without human intervention
  • Scale monitoring infrastructure dynamically based on load
  • Achieve sub-100ms P95 latency from detection to alerting
  • Minimize memory footprint (~100 bytes per monitor)

Key Features

🚀 Massive Scalability

  • Handles 1,000,000+ concurrent monitors on commodity hardware
  • Linear scaling with minimal overhead per monitor
  • Memory-efficient design: ~100 bytes per monitor

High Performance

  • 10,000+ health checks per second per pipeline
  • P95 latency < 100ms from schedule to result processing
  • Batch processing and lock-free queues minimize overhead

🔄 Automated Remediation

  • Three independent pipelines:
    1. Pulse: Health checking (HTTP, TCP, ICMP, custom scripts)
    2. Intervention: Automated recovery (restart services, scale resources, run scripts)
    3. Code: Alerting and notifications (email, SMS, webhooks, PagerDuty)

🧠 Intelligent Scaling

  • M/M/c queueing theory: Automatically calculates optimal worker count
  • Allen-Cunneen approximation: Handles real-world workload variability
  • SLO-driven sizing: Dynamically scales to meet latency targets

🏗️ Data-Oriented Architecture

  • Entity-Component-System (ECS) using mlange-42/ark
  • Cache-friendly memory layout for maximum performance
  • Minimal allocations and GC pressure

🔧 Production-Ready

  • Built-in pprof profiling for debugging
  • Graceful shutdown with context cancellation
  • Comprehensive logging with debug mode
  • Memory management with automatic GC triggering

Performance Characteristics

Metric Value
Max Concurrent Monitors 1,000,000+
Throughput 10,000+ checks/sec/pipeline
Latency (P95) < 100ms (configurable via SLO)
Memory per Monitor ~100 bytes
Total Memory (1M monitors) ~100 MB + worker pool overhead
Worker Scaling Dynamic (M/M/c based)

See Architecture Overview for detailed benchmarks and analysis.


Architecture

CPRA uses a three-pipeline architecture built on Entity-Component-System principles:

ECS Architecture

Three Independent Processing Pipelines

Pipeline Flow

  1. Pulse Pipeline: Executes health checks (HTTP requests, TCP connections, custom scripts)
  2. Intervention Pipeline: Performs automated remediation when monitors fail
  3. Code Pipeline: Sends alert notifications to incident management systems

Each pipeline operates independently with its own queue and dynamically-scaled worker pool, enabling:

  • Pipeline-specific tuning: Configure each pipeline separately
  • Fault isolation: One pipeline failure doesn't affect others
  • Independent scaling: Scale workers based on per-pipeline load

Queue and Worker Pool Architecture

Queue and Worker Pool

Queue Implementations:

  • HybridQueue: Ring buffer + overflow slice for reliable FIFO processing
  • AdaptiveQueue: Auto-scaling ring buffer for variable load
  • WorkivaQueue: Lock-free ring buffer for ultra-low latency

Dynamic Worker Pools:

  • Powered by panjf2000/ants goroutine pool
  • Automatic scaling using M/M/c queueing theory
  • Configurable min/max workers and SLO targets

For a comprehensive architecture explanation, see the Architecture Overview.


Quick Start

Option 1: Build and Run Locally

# Prerequisites: Go 1.25 or later
go version  # Should show go1.25 or higher

# Build from source
git clone https://github.com/ziad/cpra.git
cd cpra
go build .

# Run with example configuration
./cpra --yaml mock-servers/test_10k.yaml

Expected Output:

Starting CPRA Optimized Controller for 1M Monitors
Profiling server listening at http://localhost:6060/debug/pprof/
Loading monitors from mock-servers/test_10k.yaml...
Monitor loading completed in 1.2s
[INFO] Controller started successfully
[INFO] Pulse pipeline processing 10,000 monitors
[INFO] Worker pool scaled to 143 workers (target SLO: 100ms)

Installation

Prerequisites

  • Go 1.25 or later (download)
  • Docker (optional, for containerized deployment)

Building from Source

  1. Clone the repository:

    git clone https://github.com/ziad/cpra.git
    cd cpra
  2. Download dependencies:

    go mod download
  3. Build the application:

    go build .
  4. Verify installation:

    ./cpra --help

Docker Deployment

  1. Build the Docker image:

    docker build -f docker/Dockerfile -t cpra:latest .
  2. Run the container:

    docker run -it --rm \
      -v $(pwd)/my-monitors.yaml:/app/monitors.yaml \
      cpra:latest \
      ./cpra --yaml monitors.yaml

Configuration

Monitor Configuration (YAML)

Create a monitors.yaml file to define health checks:

monitors:
  - name: "my-service-health-check"
    pulse_check:
      type: http
      interval: 30s
      timeout: 5s
      max_failures: 3
      config:
        method: GET
        url: http://my-service.example.com/health
        retries: 2
    intervention:
      action: docker
      config:
        container: my-service-container
        action: restart
    codes:
      red:
        dispatch: true
        notify: pagerduty
        config:
          url: https://events.pagerduty.com/v2/enqueue
      yellow:
        dispatch: true
        notify: log
        config:
          file: /var/log/cpra-alerts.log

Generating Test Configurations:

Use mock-servers/generate_monitors.py to generate test configurations with any number of monitors.

Application Configuration

Configure CPRA behavior programmatically:

package main

import (
    "cpra/internal/controller"
)

func main() {
    config := controller.DefaultConfig()

    // Debug mode
    config.Debug = true

    // Worker pool settings (applies to all three pipelines)
    config.WorkerConfig.MinWorkers = 10
    config.WorkerConfig.MaxWorkers = 500

    // Queue settings
    config.QueueCapacity = 131072  // Must be power of 2

    // Performance tuning
    config.BatchSize = 2000
    config.SizingServiceTime = 20 * time.Millisecond  // Average job duration
    config.SizingSLO = 100 * time.Millisecond         // Target latency
    config.SizingHeadroomPct = 0.15                   // 15% safety buffer

    ctrl := controller.NewController(config)
    // ... rest of initialization
}

See the API Reference for complete configuration options.


Command-Line Options

./cpra [OPTIONS]
Option Type Default Description
--yaml string internal/loader/replicated_test.yaml Path to monitors YAML file
--config string - Configuration file path (optional)
--debug bool false Enable debug-level logging
--pprof bool true Enable pprof profiling server
--pprof.addr string localhost:6060 Pprof server listen address

Examples:

# Run with debug logging
./cpra --yaml monitors.yaml --debug

# Run with custom pprof port
./cpra --yaml monitors.yaml --pprof.addr localhost:8080

# Disable profiling
./cpra --yaml monitors.yaml --pprof=false

Documentation

Comprehensive Guides

Additional Resources


Troubleshooting

Common Issues

Issue: YAML file not found

Warning: YAML file monitors.yaml not found, starting without loading monitors

Solution: Verify the file path is correct. Use absolute paths or paths relative to where you run the binary:

./cpra --yaml $(pwd)/monitors.yaml

Issue: Build fails with Go version error

go.mod requires go >= 1.25

Solution: Upgrade Go to version 1.25 or later:

go version  # Check current version
# Download Go 1.25+ from https://go.dev/dl/

Issue: High memory usage Solution: Check memory usage with pprof:

# While CPRA is running, access pprof
go tool pprof http://localhost:6060/debug/pprof/heap

# View top memory consumers
(pprof) top

Adjust memory limits in configuration:

config.WorkerConfig.MaxWorkers = 200  // Reduce max workers
config.QueueCapacity = 65536          // Reduce queue size

Issue: Worker pool not scaling Solution: Enable debug logging to see scaling decisions:

./cpra --yaml monitors.yaml --debug

Check queueing theory parameters:

config.SizingServiceTime = 50 * time.Millisecond  // Increase if jobs take longer
config.SizingSLO = 200 * time.Millisecond         // Relax SLO if needed

Issue: Monitors not executing Solution: Verify monitor configuration format and check logs:

./cpra --yaml monitors.yaml --debug 2>&1 | grep ERROR

Validate YAML syntax:

# Use a YAML validator
python -m yaml monitors.yaml

Getting Help

  • Documentation: Check the docs/ folder for detailed guides
  • Issues: Open an issue for bugs or feature requests
  • Discussions: Ask questions and share ideas in GitHub Discussions
  • Logs: Always provide logs when reporting issues (use --debug flag)

Contributing

We welcome contributions from the community! CPRA is an open-source project and we appreciate:

  • 🐛 Bug reports and fixes
  • ✨ Feature requests and implementations
  • 📖 Documentation improvements
  • 🧪 Test coverage enhancements
  • 💡 Performance optimizations

Getting Started:

  1. Look for issues labeled good first issue
  2. Fork the repository and submit a pull request

Development Resources:


License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

CPRA is built on excellent open-source libraries:


DocumentationArchitectureIssues

Built with ❤️ for platform teams managing large-scale infrastructure

About

CPRA is a high-performance infrastructure monitoring system designed for platform teams managing large-scale microservice architectures. Built on Entity-Component-System (ECS) architecture and queueing theory principles, CPRA handles 1,000,000+ concurrent health checks with automatic worker pool scaling to meet SLO targets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors