Monitor millions of services concurrently with automated remediation and dynamic worker scaling.
CPRA is a high-performance infrastructure monitoring system designed for platform teams managing large-scale microservice architectures. Built on Entity-Component-System (ECS) architecture and queueing theory principles, CPRA handles 1,000,000+ concurrent health checks with automatic worker pool scaling to meet SLO targets.
- Why CPRA?
- Key Features
- Performance Characteristics
- Architecture
- Quick Start
- Installation
- Configuration
- Command-Line Options
- Documentation
- Troubleshooting
- Contributing
- License
Use CPRA when you need to:
- Monitor 100,000+ concurrent services, containers, or endpoints
- Automatically remediate failures without human intervention
- Scale monitoring infrastructure dynamically based on load
- Achieve sub-100ms P95 latency from detection to alerting
- Minimize memory footprint (~100 bytes per monitor)
- Handles 1,000,000+ concurrent monitors on commodity hardware
- Linear scaling with minimal overhead per monitor
- Memory-efficient design: ~100 bytes per monitor
- 10,000+ health checks per second per pipeline
- P95 latency < 100ms from schedule to result processing
- Batch processing and lock-free queues minimize overhead
- Three independent pipelines:
- Pulse: Health checking (HTTP, TCP, ICMP, custom scripts)
- Intervention: Automated recovery (restart services, scale resources, run scripts)
- Code: Alerting and notifications (email, SMS, webhooks, PagerDuty)
- M/M/c queueing theory: Automatically calculates optimal worker count
- Allen-Cunneen approximation: Handles real-world workload variability
- SLO-driven sizing: Dynamically scales to meet latency targets
- Entity-Component-System (ECS) using mlange-42/ark
- Cache-friendly memory layout for maximum performance
- Minimal allocations and GC pressure
- Built-in pprof profiling for debugging
- Graceful shutdown with context cancellation
- Comprehensive logging with debug mode
- Memory management with automatic GC triggering
| Metric | Value |
|---|---|
| Max Concurrent Monitors | 1,000,000+ |
| Throughput | 10,000+ checks/sec/pipeline |
| Latency (P95) | < 100ms (configurable via SLO) |
| Memory per Monitor | ~100 bytes |
| Total Memory (1M monitors) | ~100 MB + worker pool overhead |
| Worker Scaling | Dynamic (M/M/c based) |
See Architecture Overview for detailed benchmarks and analysis.
CPRA uses a three-pipeline architecture built on Entity-Component-System principles:
- Pulse Pipeline: Executes health checks (HTTP requests, TCP connections, custom scripts)
- Intervention Pipeline: Performs automated remediation when monitors fail
- Code Pipeline: Sends alert notifications to incident management systems
Each pipeline operates independently with its own queue and dynamically-scaled worker pool, enabling:
- Pipeline-specific tuning: Configure each pipeline separately
- Fault isolation: One pipeline failure doesn't affect others
- Independent scaling: Scale workers based on per-pipeline load
Queue Implementations:
- HybridQueue: Ring buffer + overflow slice for reliable FIFO processing
- AdaptiveQueue: Auto-scaling ring buffer for variable load
- WorkivaQueue: Lock-free ring buffer for ultra-low latency
Dynamic Worker Pools:
- Powered by panjf2000/ants goroutine pool
- Automatic scaling using M/M/c queueing theory
- Configurable min/max workers and SLO targets
For a comprehensive architecture explanation, see the Architecture Overview.
# Prerequisites: Go 1.25 or later
go version # Should show go1.25 or higher
# Build from source
git clone https://github.com/ziad/cpra.git
cd cpra
go build .
# Run with example configuration
./cpra --yaml mock-servers/test_10k.yamlExpected Output:
Starting CPRA Optimized Controller for 1M Monitors
Profiling server listening at http://localhost:6060/debug/pprof/
Loading monitors from mock-servers/test_10k.yaml...
Monitor loading completed in 1.2s
[INFO] Controller started successfully
[INFO] Pulse pipeline processing 10,000 monitors
[INFO] Worker pool scaled to 143 workers (target SLO: 100ms)
- Go 1.25 or later (download)
- Docker (optional, for containerized deployment)
-
Clone the repository:
git clone https://github.com/ziad/cpra.git cd cpra -
Download dependencies:
go mod download
-
Build the application:
go build . -
Verify installation:
./cpra --help
-
Build the Docker image:
docker build -f docker/Dockerfile -t cpra:latest . -
Run the container:
docker run -it --rm \ -v $(pwd)/my-monitors.yaml:/app/monitors.yaml \ cpra:latest \ ./cpra --yaml monitors.yaml
Create a monitors.yaml file to define health checks:
monitors:
- name: "my-service-health-check"
pulse_check:
type: http
interval: 30s
timeout: 5s
max_failures: 3
config:
method: GET
url: http://my-service.example.com/health
retries: 2
intervention:
action: docker
config:
container: my-service-container
action: restart
codes:
red:
dispatch: true
notify: pagerduty
config:
url: https://events.pagerduty.com/v2/enqueue
yellow:
dispatch: true
notify: log
config:
file: /var/log/cpra-alerts.logGenerating Test Configurations:
Use mock-servers/generate_monitors.py to generate test configurations with any number of monitors.
Configure CPRA behavior programmatically:
package main
import (
"cpra/internal/controller"
)
func main() {
config := controller.DefaultConfig()
// Debug mode
config.Debug = true
// Worker pool settings (applies to all three pipelines)
config.WorkerConfig.MinWorkers = 10
config.WorkerConfig.MaxWorkers = 500
// Queue settings
config.QueueCapacity = 131072 // Must be power of 2
// Performance tuning
config.BatchSize = 2000
config.SizingServiceTime = 20 * time.Millisecond // Average job duration
config.SizingSLO = 100 * time.Millisecond // Target latency
config.SizingHeadroomPct = 0.15 // 15% safety buffer
ctrl := controller.NewController(config)
// ... rest of initialization
}See the API Reference for complete configuration options.
./cpra [OPTIONS]| Option | Type | Default | Description |
|---|---|---|---|
--yaml |
string | internal/loader/replicated_test.yaml |
Path to monitors YAML file |
--config |
string | - | Configuration file path (optional) |
--debug |
bool | false |
Enable debug-level logging |
--pprof |
bool | true |
Enable pprof profiling server |
--pprof.addr |
string | localhost:6060 |
Pprof server listen address |
Examples:
# Run with debug logging
./cpra --yaml monitors.yaml --debug
# Run with custom pprof port
./cpra --yaml monitors.yaml --pprof.addr localhost:8080
# Disable profiling
./cpra --yaml monitors.yaml --pprof=false- Architecture Overview - System design, diagrams, and performance analysis
- API Reference - Complete API documentation with function signatures
- Types Reference - Data structures and component definitions
- Quickstart Tutorial - Get started in 5-10 minutes
- Common Tasks - How-to guides for typical operations
- Getting Started - Detailed setup and deployment guide
Issue: YAML file not found
Warning: YAML file monitors.yaml not found, starting without loading monitors
Solution: Verify the file path is correct. Use absolute paths or paths relative to where you run the binary:
./cpra --yaml $(pwd)/monitors.yamlIssue: Build fails with Go version error
go.mod requires go >= 1.25
Solution: Upgrade Go to version 1.25 or later:
go version # Check current version
# Download Go 1.25+ from https://go.dev/dl/Issue: High memory usage
Solution: Check memory usage with pprof:
# While CPRA is running, access pprof
go tool pprof http://localhost:6060/debug/pprof/heap
# View top memory consumers
(pprof) topAdjust memory limits in configuration:
config.WorkerConfig.MaxWorkers = 200 // Reduce max workers
config.QueueCapacity = 65536 // Reduce queue sizeIssue: Worker pool not scaling
Solution: Enable debug logging to see scaling decisions:
./cpra --yaml monitors.yaml --debugCheck queueing theory parameters:
config.SizingServiceTime = 50 * time.Millisecond // Increase if jobs take longer
config.SizingSLO = 200 * time.Millisecond // Relax SLO if neededIssue: Monitors not executing
Solution: Verify monitor configuration format and check logs:
./cpra --yaml monitors.yaml --debug 2>&1 | grep ERRORValidate YAML syntax:
# Use a YAML validator
python -m yaml monitors.yaml- Documentation: Check the docs/ folder for detailed guides
- Issues: Open an issue for bugs or feature requests
- Discussions: Ask questions and share ideas in GitHub Discussions
- Logs: Always provide logs when reporting issues (use
--debugflag)
We welcome contributions from the community! CPRA is an open-source project and we appreciate:
- 🐛 Bug reports and fixes
- ✨ Feature requests and implementations
- 📖 Documentation improvements
- 🧪 Test coverage enhancements
- 💡 Performance optimizations
Getting Started:
- Look for issues labeled
good first issue - Fork the repository and submit a pull request
Development Resources:
- Architecture Overview - Understand the system design
- API Reference - Function signatures and usage
This project is licensed under the MIT License - see the LICENSE file for details.
CPRA is built on excellent open-source libraries:
- mlange-42/ark - High-performance Entity-Component-System
- panjf2000/ants - Goroutine pool with dynamic scaling
- Workiva/go-datastructures - Lock-free data structures
- uber-go/zap - Structured logging
Documentation • Architecture • Issues
Built with ❤️ for platform teams managing large-scale infrastructure


