Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
401dd71
add OTEL-aware telemetry core that now registers HTTP runtime metrics
ginaxu1 Nov 17, 2025
4974806
Refactor to exchange/pkg/monitoring
ginaxu1 Nov 17, 2025
734cd5b
Add metrics to api-server-go
ginaxu1 Nov 17, 2025
f7460c8
Add metrics to PDP and CE too
ginaxu1 Nov 17, 2025
b3f98b7
Add README for pkg/monitoring
ginaxu1 Nov 17, 2025
0841ae5
Address PR comments
ginaxu1 Nov 18, 2025
637d041
Update README
ginaxu1 Nov 18, 2025
cf8a92c
Address PR comments
ginaxu1 Nov 19, 2025
be19b1b
add OTEL-aware telemetry core that now registers HTTP runtime metrics
ginaxu1 Nov 17, 2025
acf256f
Refactor to exchange/pkg/monitoring
ginaxu1 Nov 17, 2025
c242942
Add metrics to PDP and CE too
ginaxu1 Nov 17, 2025
e4f2e60
Address PR comments
ginaxu1 Nov 18, 2025
68b3baf
add monitoring for other services
ginaxu1 Nov 21, 2025
a65b062
Fix Build fail
ginaxu1 Nov 21, 2025
59d6fd2
Fix OE Unit tests
ginaxu1 Nov 21, 2025
8a309d3
Remove unneeded files
ginaxu1 Nov 21, 2025
68491e5
Fix merge conflicts w main
ginaxu1 Nov 25, 2025
d40d287
Rewrite observability/
ginaxu1 Nov 27, 2025
2f4c8d2
revert changes to services, will add in next PR
ginaxu1 Nov 27, 2025
be8049c
more clean up, make PR smaller and focused only on adding observabili…
ginaxu1 Nov 27, 2025
67e47c1
remove pkg/monitoring
ginaxu1 Nov 27, 2025
9a8c1a3
Update observability/README.md
ginaxu1 Nov 28, 2025
a6e2957
Update observability/README.md
ginaxu1 Nov 28, 2025
7a402ee
Update observability/docker-compose.yml
ginaxu1 Nov 28, 2025
b2a6fbc
Update observability/README.md
ginaxu1 Nov 28, 2025
c164e21
Address remaining comments
ginaxu1 Nov 28, 2025
6340c6f
Address comments
ginaxu1 Dec 2, 2025
f60e87b
Update README for observability, refernce superapp format
ginaxu1 Dec 2, 2025
59c1ed7
Fix merge conflict with main:
ginaxu1 Dec 2, 2025
956059b
revert portal-backend/main.go to main version
ginaxu1 Dec 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions audit-service/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ services:
depends_on:
- postgres
restart: unless-stopped
networks:
- opendif-network

postgres:
image: postgres:15-alpine
Expand All @@ -28,6 +30,13 @@ services:
volumes:
- postgres_data:/var/lib/postgresql/data
restart: unless-stopped
networks:
- opendif-network

volumes:
postgres_data:

networks:
opendif-network:
name: opendif-network
external: true
11 changes: 6 additions & 5 deletions exchange/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ services:
start_period: 10s
restart: unless-stopped
networks:
- exchange-network
- opendif-network

consent-engine:
build:
Expand All @@ -55,7 +55,7 @@ services:
start_period: 10s
restart: unless-stopped
networks:
- exchange-network
- opendif-network

orchestration-engine:
build:
Expand All @@ -82,8 +82,9 @@ services:
start_period: 10s
restart: unless-stopped
networks:
- exchange-network
- opendif-network

networks:
exchange-network:
driver: bridge
opendif-network:
name: opendif-network
external: true
342 changes: 342 additions & 0 deletions observability/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,342 @@
# Observability Stack for OpenDIF MVP

Local development stack: **Go Services** → **Prometheus** → **Grafana**

Collects real-time metrics from all Go services for debugging performance and errors.

---

## Quick Start

```bash
cd observability
docker compose up -d
```

**Services:**

- **Prometheus**: http://localhost:9090 (raw metrics & queries)
- **Grafana**: http://localhost:3002 (dashboards, login: `admin` / `admin`)

**Prerequisites:**

Ensure all Go services are running and connected to the `opendif-network`:
- Orchestration Engine (port 4000)
- Consent Engine (port 8081)
- Policy Decision Point (port 8082)
- Portal Backend (port 3000)
- Audit Service (port 3001)

---

## Metrics Overview

### HTTP Request Metrics

| Metric | Type | Labels | Purpose |
| ------------------------------- | --------- | ----------------------------------------- | -------------------------- |
| `http_requests_total` | Counter | `method`, `route`, `status_code` | Request volume by endpoint |
| `http_request_duration_seconds` | Histogram | `method`, `route` | API latency percentiles |

**Label Definitions:**
- `method`: HTTP method (GET, POST, PUT, DELETE, etc.)
- `route`: Normalized route path (e.g., `/consents`, `/policies`)
- `status_code`: HTTP response status code (200, 404, 500, etc.)

### External Call Metrics

| Metric | Type | Labels | Purpose |
| --------------------------------- | --------- | ----------------------------------------- | -------------------------- |
| `external_calls_total` | Counter | `external_target`, `external_operation`, `external_success` | External call volume |
| `external_call_duration_seconds` | Histogram | `external_target`, `external_operation` | External call latency |
| `external_call_errors_total` | Counter | `external_target`, `external_operation` | Failed external calls |

**Label Definitions:**
- `external_target`: Target service or system (e.g., `postgres`, `redis`, `external-api`)
- `external_operation`: Operation type (e.g., `query`, `insert`, `get`, `set`)
- `external_success`: Success status (`true` or `false`)

### Database Metrics

| Metric | Type | Labels | Purpose |
| ------------------------- | --------- | ------------------------- | --------------------- |
| `db_latency_seconds` | Histogram | `db_name`, `db_operation` | Database query timing |

**Label Definitions:**
- `db_name`: Database name or identifier
- `db_operation`: Database operation type (e.g., `select`, `insert`, `update`, `delete`)

### Business Event Metrics

| Metric | Type | Labels | Purpose |
| ------------------------- | ------- | ------------------------------- | ---------------------- |
| `business_events_total` | Counter | `business_action`, `business_outcome` | Business KPI tracking |

**Label Definitions:**
- `business_action`: Business action type (e.g., `consent_created`, `policy_evaluated`)
- `business_outcome`: Outcome of the action (e.g., `success`, `failure`, `pending`)

### Workflow Metrics

| Metric | Type | Labels | Purpose |
| ------------------------------- | --------------- | --------------- | --------------------- |
| `workflow_duration_seconds` | Histogram | `workflow_name` | End-to-end workflow timing |
| `workflow_inflight` | UpDownCounter | `workflow_name` | Active workflow count |

**Label Definitions:**
- `workflow_name`: Name of the workflow (e.g., `data_exchange`, `consent_flow`)

### Cache Metrics

| Metric | Type | Labels | Purpose |
| --------------------- | ------- | ------------------------- | ----------------- |
| `cache_events_total` | Counter | `cache_name`, `cache_result` | Cache hit/miss tracking |

**Label Definitions:**
- `cache_name`: Cache identifier or name
- `cache_result`: Cache operation result (`hit` or `miss`)

### Policy Decision Metrics (PDP)

| Metric | Type | Labels | Purpose |
| ----------------------------- | --------- | ----------------- | ---------------------- |
| `decision_latency_seconds` | Histogram | `decision_type` | Policy evaluation time |
| `decision_failures_total` | Counter | `failure_reason` | Policy decision errors |

**Label Definitions:**
- `decision_type`: Type of policy decision (e.g., `allow`, `deny`, `conditional`)
- `failure_reason`: Reason for decision failure (e.g., `policy_not_found`, `evaluation_error`)

### Go Runtime Metrics (Automatic)

The monitoring package automatically instruments Go runtime metrics:
- `process_cpu_seconds_total` - CPU usage
- `go_memstats_*` - Memory statistics (alloc, sys, heap, etc.)
- `go_goroutines` - Goroutine count
- `go_gc_duration_seconds` - Garbage collection pause times

---

## Useful Prometheus Queries

**Request Rate by Endpoint:**
```promql
sum by (route, method) (rate(http_requests_total[5m]))
```

**95th Percentile Latency by Endpoint:**
```promql
histogram_quantile(0.95, sum by (route, le) (rate(http_request_duration_seconds_bucket[5m])))
```

**Error Rate by Endpoint:**
```promql
sum by (route) (rate(http_requests_total{status_code=~"5.."}[5m]))
```

**Top 10 Slowest Endpoints:**
```promql
topk(10, histogram_quantile(0.95, sum by (route, le) (rate(http_request_duration_seconds_bucket[5m]))))
```

**External Call Error Rate:**
```promql
sum by (external_target, external_operation) (rate(external_call_errors_total[5m]))
```

**95th Percentile Database Latency:**
```promql
histogram_quantile(0.95, sum by (db_name, db_operation, le) (rate(db_latency_seconds_bucket[5m])))
```

**Service Availability:**
```promql
up{job=~"orchestration-engine|consent-engine|policy-decision-point|portal-backend|audit-service"}
```

**Current Metric Values (All):**
```promql
{__name__=~"http_.*|external_.*|db_.*|business_.*|workflow_.*|cache_.*|decision_.*"}
```

---

## Grafana Dashboard

Pre-configured dashboard: **Go Services Metrics**

**URL:** http://localhost:3002/d/go-services/go-services-metrics

**Panels:**

- HTTP Traffic (req/s)
- HTTP Latency (P95)
- Service Health (1=up, 0=down)
- External Calls per Second
- External Call Error %
- Business Events

---

## Stop Services

```bash
docker compose down
```

**Keep data (volumes persist):**
```bash
docker compose stop
```

**Remove everything (data + volumes):**
```bash
docker compose down -v
```

---

## Production Deployment

This setup is for **local development only**. For production:

1. **Use Managed Service**: Grafana Cloud (free tier), Datadog, New Relic
2. **Or Self-Host**: Deploy Prometheus HA, Thanos/Mimir for long-term storage
3. **Security Hardening**: Change Grafana admin password, enable OAuth/SSO, use reverse proxy
4. **Storage & Retention**: Adjust `--storage.tsdb.retention.time` based on storage capacity
5. **Alerting**: Configure Alertmanager for production alerts

---

## Architecture

```
Go Services (Port 3000, 4000, 8081, 8082, 3001)
↓ /metrics endpoint (Prometheus format)
Prometheus (localhost:9090, 30d retention)
↓ PromQL queries
Grafana (localhost:3002, dashboards)
```

**Stack Components:**

- **prometheus**: `prom/prometheus:v2.55.1`
- **grafana**: `grafana/grafana:11.2.0`

**Network:**

All services run on a shared Docker network (`opendif-network`) to enable service discovery. Services are referenced by their Docker Compose service names in Prometheus configuration (e.g., `orchestration-engine:4000`).

**Volumes:**

- `prometheus-data`: Metric storage (30 day retention)
- `grafana-data`: Dashboard configs & user data

---

## Troubleshooting

### Prometheus Can't Scrape Services

**Issue**: Targets show as DOWN in Prometheus (http://localhost:9090/targets)

**Solutions:**

1. Verify services are running and exposing metrics:
```bash
curl http://localhost:4000/metrics
curl http://localhost:8081/metrics
curl http://localhost:8082/metrics
```

2. Check Prometheus logs:
```bash
docker compose logs prometheus
```

3. Verify network connectivity:
- Ensure all services are on the same `opendif-network`
- Check service names match Prometheus configuration

### Grafana Can't Connect to Prometheus

**Issue**: "Data source is not working" in Grafana

**Solutions:**

1. Verify Prometheus is running: `curl http://localhost:9090/-/healthy`
2. Check datasource URL in `grafana/provisioning/datasources/datasource.yml` (should be `http://prometheus:9090`)
3. Ensure both containers are on the same Docker network (`opendif-network`)

### Network Issues

**Issue**: Services can't communicate with each other

**Solutions:**

1. Verify network exists: `docker network ls | grep opendif-network`
2. Check service is on network: `docker network inspect opendif-network`
3. Recreate network if needed:
```bash
docker compose down
docker network rm opendif-network
docker compose up -d
```

---

## How to Add Metrics to New Go Services

1. **Import the monitoring package:**
```go
import "github.com/gov-dx-sandbox/exchange/shared/monitoring"
```

2. **Expose metrics endpoint in main.go:**
```go
mux.Handle("/metrics", monitoring.Handler())
```

3. **Wrap HTTP handlers with metrics middleware:**
```go
handler := monitoring.HTTPMetricsMiddleware(mux)
```

4. **Add service to Prometheus configuration:**
Edit `prometheus/prometheus.yml`:
```yaml
- job_name: your-service
metrics_path: /metrics
static_configs:
- targets:
- your-service:PORT
labels:
service: 'your-service'
port: 'PORT'
```

5. **Ensure service is on `opendif-network`:**
In your service's `docker-compose.yml`:
```yaml
services:
your-service:
networks:
- opendif-network

networks:
opendif-network:
name: opendif-network
external: true
```

6. **Restart Prometheus:**
```bash
docker compose restart prometheus
```

---

## Additional Resources

- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)
Loading
Loading