-
Notifications
You must be signed in to change notification settings - Fork 3
317 Add Observability Stack for this project #319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
401dd71
add OTEL-aware telemetry core that now registers HTTP runtime metrics
ginaxu1 4974806
Refactor to exchange/pkg/monitoring
ginaxu1 734cd5b
Add metrics to api-server-go
ginaxu1 f7460c8
Add metrics to PDP and CE too
ginaxu1 b3f98b7
Add README for pkg/monitoring
ginaxu1 0841ae5
Address PR comments
ginaxu1 637d041
Update README
ginaxu1 cf8a92c
Address PR comments
ginaxu1 be19b1b
add OTEL-aware telemetry core that now registers HTTP runtime metrics
ginaxu1 acf256f
Refactor to exchange/pkg/monitoring
ginaxu1 c242942
Add metrics to PDP and CE too
ginaxu1 e4f2e60
Address PR comments
ginaxu1 68b3baf
add monitoring for other services
ginaxu1 a65b062
Fix Build fail
ginaxu1 59d6fd2
Fix OE Unit tests
ginaxu1 8a309d3
Remove unneeded files
ginaxu1 68491e5
Fix merge conflicts w main
ginaxu1 d40d287
Rewrite observability/
ginaxu1 2f4c8d2
revert changes to services, will add in next PR
ginaxu1 be8049c
more clean up, make PR smaller and focused only on adding observabili…
ginaxu1 67e47c1
remove pkg/monitoring
ginaxu1 9a8c1a3
Update observability/README.md
ginaxu1 a6e2957
Update observability/README.md
ginaxu1 7a402ee
Update observability/docker-compose.yml
ginaxu1 b2a6fbc
Update observability/README.md
ginaxu1 c164e21
Address remaining comments
ginaxu1 6340c6f
Address comments
ginaxu1 f60e87b
Update README for observability, refernce superapp format
ginaxu1 59c1ed7
Fix merge conflict with main:
ginaxu1 956059b
revert portal-backend/main.go to main version
ginaxu1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,342 @@ | ||
| # Observability Stack for OpenDIF MVP | ||
|
|
||
| Local development stack: **Go Services** → **Prometheus** → **Grafana** | ||
|
|
||
| Collects real-time metrics from all Go services for debugging performance and errors. | ||
|
|
||
| --- | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ```bash | ||
| cd observability | ||
| docker compose up -d | ||
| ``` | ||
|
|
||
| **Services:** | ||
|
|
||
| - **Prometheus**: http://localhost:9090 (raw metrics & queries) | ||
| - **Grafana**: http://localhost:3002 (dashboards, login: `admin` / `admin`) | ||
|
|
||
| **Prerequisites:** | ||
|
|
||
| Ensure all Go services are running and connected to the `opendif-network`: | ||
| - Orchestration Engine (port 4000) | ||
| - Consent Engine (port 8081) | ||
| - Policy Decision Point (port 8082) | ||
| - Portal Backend (port 3000) | ||
| - Audit Service (port 3001) | ||
|
|
||
| --- | ||
|
|
||
| ## Metrics Overview | ||
|
|
||
| ### HTTP Request Metrics | ||
|
|
||
| | Metric | Type | Labels | Purpose | | ||
| | ------------------------------- | --------- | ----------------------------------------- | -------------------------- | | ||
| | `http_requests_total` | Counter | `method`, `route`, `status_code` | Request volume by endpoint | | ||
| | `http_request_duration_seconds` | Histogram | `method`, `route` | API latency percentiles | | ||
|
|
||
| **Label Definitions:** | ||
| - `method`: HTTP method (GET, POST, PUT, DELETE, etc.) | ||
| - `route`: Normalized route path (e.g., `/consents`, `/policies`) | ||
| - `status_code`: HTTP response status code (200, 404, 500, etc.) | ||
|
|
||
| ### External Call Metrics | ||
|
|
||
| | Metric | Type | Labels | Purpose | | ||
| | --------------------------------- | --------- | ----------------------------------------- | -------------------------- | | ||
| | `external_calls_total` | Counter | `external_target`, `external_operation`, `external_success` | External call volume | | ||
| | `external_call_duration_seconds` | Histogram | `external_target`, `external_operation` | External call latency | | ||
| | `external_call_errors_total` | Counter | `external_target`, `external_operation` | Failed external calls | | ||
|
|
||
| **Label Definitions:** | ||
| - `external_target`: Target service or system (e.g., `postgres`, `redis`, `external-api`) | ||
| - `external_operation`: Operation type (e.g., `query`, `insert`, `get`, `set`) | ||
| - `external_success`: Success status (`true` or `false`) | ||
|
|
||
| ### Database Metrics | ||
|
|
||
| | Metric | Type | Labels | Purpose | | ||
| | ------------------------- | --------- | ------------------------- | --------------------- | | ||
| | `db_latency_seconds` | Histogram | `db_name`, `db_operation` | Database query timing | | ||
|
|
||
| **Label Definitions:** | ||
| - `db_name`: Database name or identifier | ||
| - `db_operation`: Database operation type (e.g., `select`, `insert`, `update`, `delete`) | ||
|
|
||
| ### Business Event Metrics | ||
|
|
||
| | Metric | Type | Labels | Purpose | | ||
| | ------------------------- | ------- | ------------------------------- | ---------------------- | | ||
| | `business_events_total` | Counter | `business_action`, `business_outcome` | Business KPI tracking | | ||
|
|
||
| **Label Definitions:** | ||
| - `business_action`: Business action type (e.g., `consent_created`, `policy_evaluated`) | ||
| - `business_outcome`: Outcome of the action (e.g., `success`, `failure`, `pending`) | ||
|
|
||
| ### Workflow Metrics | ||
|
|
||
| | Metric | Type | Labels | Purpose | | ||
| | ------------------------------- | --------------- | --------------- | --------------------- | | ||
| | `workflow_duration_seconds` | Histogram | `workflow_name` | End-to-end workflow timing | | ||
| | `workflow_inflight` | UpDownCounter | `workflow_name` | Active workflow count | | ||
|
|
||
| **Label Definitions:** | ||
| - `workflow_name`: Name of the workflow (e.g., `data_exchange`, `consent_flow`) | ||
|
|
||
| ### Cache Metrics | ||
|
|
||
| | Metric | Type | Labels | Purpose | | ||
| | --------------------- | ------- | ------------------------- | ----------------- | | ||
| | `cache_events_total` | Counter | `cache_name`, `cache_result` | Cache hit/miss tracking | | ||
|
|
||
| **Label Definitions:** | ||
| - `cache_name`: Cache identifier or name | ||
| - `cache_result`: Cache operation result (`hit` or `miss`) | ||
|
|
||
| ### Policy Decision Metrics (PDP) | ||
|
|
||
| | Metric | Type | Labels | Purpose | | ||
| | ----------------------------- | --------- | ----------------- | ---------------------- | | ||
| | `decision_latency_seconds` | Histogram | `decision_type` | Policy evaluation time | | ||
| | `decision_failures_total` | Counter | `failure_reason` | Policy decision errors | | ||
|
|
||
| **Label Definitions:** | ||
| - `decision_type`: Type of policy decision (e.g., `allow`, `deny`, `conditional`) | ||
| - `failure_reason`: Reason for decision failure (e.g., `policy_not_found`, `evaluation_error`) | ||
|
|
||
| ### Go Runtime Metrics (Automatic) | ||
|
|
||
| The monitoring package automatically instruments Go runtime metrics: | ||
| - `process_cpu_seconds_total` - CPU usage | ||
| - `go_memstats_*` - Memory statistics (alloc, sys, heap, etc.) | ||
| - `go_goroutines` - Goroutine count | ||
| - `go_gc_duration_seconds` - Garbage collection pause times | ||
|
|
||
| --- | ||
|
|
||
| ## Useful Prometheus Queries | ||
|
|
||
| **Request Rate by Endpoint:** | ||
| ```promql | ||
| sum by (route, method) (rate(http_requests_total[5m])) | ||
| ``` | ||
|
|
||
| **95th Percentile Latency by Endpoint:** | ||
| ```promql | ||
| histogram_quantile(0.95, sum by (route, le) (rate(http_request_duration_seconds_bucket[5m]))) | ||
| ``` | ||
|
|
||
| **Error Rate by Endpoint:** | ||
| ```promql | ||
| sum by (route) (rate(http_requests_total{status_code=~"5.."}[5m])) | ||
| ``` | ||
|
|
||
| **Top 10 Slowest Endpoints:** | ||
| ```promql | ||
| topk(10, histogram_quantile(0.95, sum by (route, le) (rate(http_request_duration_seconds_bucket[5m])))) | ||
| ``` | ||
|
|
||
| **External Call Error Rate:** | ||
| ```promql | ||
| sum by (external_target, external_operation) (rate(external_call_errors_total[5m])) | ||
| ``` | ||
|
|
||
| **95th Percentile Database Latency:** | ||
| ```promql | ||
| histogram_quantile(0.95, sum by (db_name, db_operation, le) (rate(db_latency_seconds_bucket[5m]))) | ||
| ``` | ||
|
|
||
| **Service Availability:** | ||
| ```promql | ||
| up{job=~"orchestration-engine|consent-engine|policy-decision-point|portal-backend|audit-service"} | ||
| ``` | ||
|
|
||
| **Current Metric Values (All):** | ||
| ```promql | ||
| {__name__=~"http_.*|external_.*|db_.*|business_.*|workflow_.*|cache_.*|decision_.*"} | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Grafana Dashboard | ||
|
|
||
| Pre-configured dashboard: **Go Services Metrics** | ||
|
|
||
| **URL:** http://localhost:3002/d/go-services/go-services-metrics | ||
|
|
||
| **Panels:** | ||
|
|
||
| - HTTP Traffic (req/s) | ||
| - HTTP Latency (P95) | ||
| - Service Health (1=up, 0=down) | ||
| - External Calls per Second | ||
| - External Call Error % | ||
| - Business Events | ||
|
|
||
| --- | ||
|
|
||
| ## Stop Services | ||
|
|
||
| ```bash | ||
| docker compose down | ||
| ``` | ||
|
|
||
| **Keep data (volumes persist):** | ||
| ```bash | ||
| docker compose stop | ||
| ``` | ||
|
|
||
| **Remove everything (data + volumes):** | ||
| ```bash | ||
| docker compose down -v | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Production Deployment | ||
|
|
||
| This setup is for **local development only**. For production: | ||
|
|
||
| 1. **Use Managed Service**: Grafana Cloud (free tier), Datadog, New Relic | ||
| 2. **Or Self-Host**: Deploy Prometheus HA, Thanos/Mimir for long-term storage | ||
| 3. **Security Hardening**: Change Grafana admin password, enable OAuth/SSO, use reverse proxy | ||
| 4. **Storage & Retention**: Adjust `--storage.tsdb.retention.time` based on storage capacity | ||
| 5. **Alerting**: Configure Alertmanager for production alerts | ||
|
|
||
| --- | ||
|
|
||
| ## Architecture | ||
|
|
||
| ``` | ||
| Go Services (Port 3000, 4000, 8081, 8082, 3001) | ||
| ↓ /metrics endpoint (Prometheus format) | ||
| Prometheus (localhost:9090, 30d retention) | ||
| ↓ PromQL queries | ||
| Grafana (localhost:3002, dashboards) | ||
| ``` | ||
|
|
||
| **Stack Components:** | ||
|
|
||
| - **prometheus**: `prom/prometheus:v2.55.1` | ||
| - **grafana**: `grafana/grafana:11.2.0` | ||
|
|
||
| **Network:** | ||
|
|
||
| All services run on a shared Docker network (`opendif-network`) to enable service discovery. Services are referenced by their Docker Compose service names in Prometheus configuration (e.g., `orchestration-engine:4000`). | ||
|
|
||
| **Volumes:** | ||
|
|
||
| - `prometheus-data`: Metric storage (30 day retention) | ||
| - `grafana-data`: Dashboard configs & user data | ||
|
|
||
| --- | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Prometheus Can't Scrape Services | ||
|
|
||
| **Issue**: Targets show as DOWN in Prometheus (http://localhost:9090/targets) | ||
|
|
||
| **Solutions:** | ||
|
|
||
| 1. Verify services are running and exposing metrics: | ||
| ```bash | ||
| curl http://localhost:4000/metrics | ||
| curl http://localhost:8081/metrics | ||
| curl http://localhost:8082/metrics | ||
| ``` | ||
|
|
||
| 2. Check Prometheus logs: | ||
| ```bash | ||
| docker compose logs prometheus | ||
| ``` | ||
|
|
||
| 3. Verify network connectivity: | ||
| - Ensure all services are on the same `opendif-network` | ||
| - Check service names match Prometheus configuration | ||
|
|
||
| ### Grafana Can't Connect to Prometheus | ||
|
|
||
| **Issue**: "Data source is not working" in Grafana | ||
|
|
||
| **Solutions:** | ||
|
|
||
| 1. Verify Prometheus is running: `curl http://localhost:9090/-/healthy` | ||
| 2. Check datasource URL in `grafana/provisioning/datasources/datasource.yml` (should be `http://prometheus:9090`) | ||
| 3. Ensure both containers are on the same Docker network (`opendif-network`) | ||
|
|
||
| ### Network Issues | ||
|
|
||
| **Issue**: Services can't communicate with each other | ||
|
|
||
| **Solutions:** | ||
|
|
||
| 1. Verify network exists: `docker network ls | grep opendif-network` | ||
| 2. Check service is on network: `docker network inspect opendif-network` | ||
| 3. Recreate network if needed: | ||
| ```bash | ||
| docker compose down | ||
| docker network rm opendif-network | ||
| docker compose up -d | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## How to Add Metrics to New Go Services | ||
|
|
||
| 1. **Import the monitoring package:** | ||
| ```go | ||
| import "github.com/gov-dx-sandbox/exchange/shared/monitoring" | ||
| ``` | ||
|
|
||
| 2. **Expose metrics endpoint in main.go:** | ||
| ```go | ||
| mux.Handle("/metrics", monitoring.Handler()) | ||
| ``` | ||
|
|
||
| 3. **Wrap HTTP handlers with metrics middleware:** | ||
| ```go | ||
| handler := monitoring.HTTPMetricsMiddleware(mux) | ||
| ``` | ||
|
|
||
| 4. **Add service to Prometheus configuration:** | ||
| Edit `prometheus/prometheus.yml`: | ||
| ```yaml | ||
| - job_name: your-service | ||
| metrics_path: /metrics | ||
| static_configs: | ||
| - targets: | ||
| - your-service:PORT | ||
| labels: | ||
| service: 'your-service' | ||
| port: 'PORT' | ||
| ``` | ||
|
|
||
| 5. **Ensure service is on `opendif-network`:** | ||
| In your service's `docker-compose.yml`: | ||
| ```yaml | ||
| services: | ||
| your-service: | ||
| networks: | ||
| - opendif-network | ||
|
|
||
| networks: | ||
| opendif-network: | ||
| name: opendif-network | ||
| external: true | ||
| ``` | ||
|
|
||
| 6. **Restart Prometheus:** | ||
| ```bash | ||
| docker compose restart prometheus | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Additional Resources | ||
|
|
||
| - [Prometheus Documentation](https://prometheus.io/docs/) | ||
| - [Grafana Documentation](https://grafana.com/docs/) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.