Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 26 additions & 1 deletion docs/docs/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,17 @@ for all OpenTelemetry-related configurables.

## Prometheus

OPA exposes an HTTP endpoint that can be used to collect performance metrics
OPA exposes an HTTP endpoint at `/metrics` that can be used to collect performance metrics
for all API calls. The Prometheus endpoint is enabled by default when you run
OPA as a server.

OPA provides two ways to access performance metrics:

1. **System-wide metrics** via the `/metrics` Prometheus endpoint - Instance-level metrics across all OPA operations
2. **Per-query metrics** via API responses with `?metrics=true` - Metrics for individual query executions

These serve different purposes: system metrics for OPA instance monitoring and alerting, per-query metrics for debugging and optimization.

You can enable metric collection from OPA with the following `prometheus.yml` config:

```yaml
Expand Down Expand Up @@ -86,6 +93,24 @@ When Prometheus is enabled in the status plugin (see [Configuration](./configura
| last_success_bundle_request | gauge | Last successful bundle request in UNIX nanoseconds. | STABLE |
| bundle_loading_duration_ns | histogram | A histogram of duration for bundle loading. | STABLE |

## Available Metrics

The Prometheus `/metrics` endpoint exposes the following instance-level metrics:

- **URL**: `http://localhost:8181/metrics` (default configuration)
- **Method**: HTTP GET
- **Format**: Prometheus text format
- **Contents**: Instance-level counters, timers, histograms, Go runtime metrics
- **Use case**: Monitoring dashboards, alerting, performance trends

### Additional Resources

- **Per-query metrics**: See [REST API Performance Metrics](./rest-api#performance-metrics) for debugging individual queries
- **Policy performance**: See [Policy Performance](./policy-performance#performance-metrics) for optimization guidance
- **Status API**: See [Status API](./management-status) for metrics reporting via status updates
- **Decision logs**: See [Decision Logs](./management-decision-logs) for including metrics in decision logs
- **CLI tools**: See [opa eval](./cli#eval) and [opa bench](./cli#bench) for command-line metric collection

## Health Checks

OPA exposes a `/health` API endpoint that can be used to perform health checks.
Expand Down
61 changes: 61 additions & 0 deletions docs/docs/policy-performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -969,6 +969,66 @@ This feature can be enabled for `opa run`, `opa eval`, and `opa bench` by settin

Users are recommended to do performance testing to determine the optimal configuration for their use case.

## Performance Metrics

OPA exposes metrics for policy evaluation performance. These are available through:

- **System-wide metrics** at the `/metrics` Prometheus endpoint
- **Per-query metrics** with individual API responses when `?metrics=true` is specified

See [Monitoring](./monitoring#metrics-overview) for more details.

### Common Built-in Function Metrics

#### HTTP Built-ins

`http.send` metrics help identify I/O bottlenecks:

- `timer_rego_builtin_http_send_ns` - Total time spent in http.send calls
- `counter_rego_builtin_http_send_interquery_cache_hits` - Inter-query cache hits
- `counter_rego_builtin_http_send_network_requests` - Actual network requests made

High cache hit ratios indicate effective caching and reduced network overhead.

#### Regex Built-ins

Regex operation metrics help optimize pattern matching:

- `timer_rego_builtin_regex_interquery_ns` - Time spent in regex operations
- `counter_rego_builtin_regex_interquery_cache_hits` - Regex pattern cache hits
- `counter_rego_builtin_regex_interquery_value_cache_hits` - Regex value cache hits

Effective regex caching improves performance when the same patterns are used repeatedly.

### Core Query Metrics

Basic query evaluation phases:

- `timer_rego_query_parse_ns` - Time parsing the query string
- `timer_rego_query_compile_ns` - Time compiling the query
- `timer_rego_query_eval_ns` - Time executing the compiled query

Compilation time often dominates in complex policies.

### High-Level Metrics

Server-level metrics for overall performance:

- `timer_server_handler_ns` - Total request handler execution time
- `counter_server_query_cache_hit` - Server-level query cache hits

### Using Metrics for Optimization

1. **Query phases**: Compare parse, compile, and eval times to identify bottlenecks
2. **Cache effectiveness**: Low cache hit rates suggest tuning opportunities
3. **I/O bottlenecks**: High `http.send` network request counts indicate caching issues
4. **Pattern matching**: Monitor regex cache hits for frequently used patterns

Access metrics via:
- REST API: Add `?metrics=true` to policy evaluation requests
- CLI: Use `--metrics` flag with `opa eval` or `opa bench`
- Prometheus: See [Monitoring](./monitoring#prometheus) for system-wide metrics

## Key Takeaways

For high-performance use cases:
Expand All @@ -979,3 +1039,4 @@ For high-performance use cases:
- Write your policies with indexed statements so that [rule-indexing](https://blog.openpolicyagent.org/optimizing-opa-rule-indexing-59f03f17caf3) is effective.
- Use the profiler to help identify portions of the policy that would benefit the most from improved performance.
- Use the benchmark tools to help get real world timing data and detect policy performance changes.
- Monitor performance metrics to track optimization impact and identify bottlenecks.
10 changes: 10 additions & 0 deletions docs/docs/policy-reference/builtins/glob.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,13 @@ The following table shows examples of how `glob.match` works:
| `output := glob.match("{cat,bat,[fr]at}", [], "bat")` | `true` | A glob with pattern-alternatives matchers. |
| `output := glob.match("{cat,bat,[fr]at}", [], "rat")` | `true` | A glob with pattern-alternatives matchers. |
| `output := glob.match("{cat,bat,[fr]at}", [], "at")` | `false` | A glob with pattern-alternatives matchers. |

## Performance Metrics

When OPA is configured with metrics enabled, `glob.match` operations expose the following metrics in per-query metrics (accessible when `?metrics=true` is specified in API requests):

| Metric | Description |
| ------ | ----------- |
| `counter_rego_builtin_glob_interquery_value_cache_hits` | Number of inter-query cache hits for compiled glob patterns |

Effective glob pattern caching improves performance when the same patterns are used repeatedly across queries. High cache hit ratios indicate that glob compilation overhead is being minimized through caching.
12 changes: 12 additions & 0 deletions docs/docs/policy-reference/builtins/http.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -113,3 +113,15 @@ The table below shows examples of calling `http.send`:
| Files containing TLS material | `http.send({"method": "get", "url": "https://127.0.0.1:65331", "tls_ca_cert_file": "testdata/ca.pem", "tls_client_cert_file": "testdata/client-cert.pem", "tls_client_key_file": "testdata/client-key.pem"})` |
| Environment variables containing TLS material | `http.send({"method": "get", "url": "https://127.0.0.1:65360", "tls_ca_cert_env_variable": "CLIENT_CA_ENV", "tls_client_cert_env_variable": "CLIENT_CERT_ENV", "tls_client_key_env_variable": "CLIENT_KEY_ENV"})` |
| Unix Socket URL Format | `http.send({"method": "get", "url": "unix://localhost/?socket=%F2path%F2file.socket"})` |

## Performance Metrics

When OPA is configured with metrics enabled, `http.send` operations expose the following metrics in per-query metrics (accessible when `?metrics=true` is specified in API requests):

| Metric | Description |
| ------ | ----------- |
| `timer_rego_builtin_http_send_ns` | Total time spent in `http.send` calls during query evaluation |
| `counter_rego_builtin_http_send_interquery_cache_hits` | Number of inter-query cache hits for `http.send` requests |
| `counter_rego_builtin_http_send_network_requests` | Number of actual network requests made by `http.send` |

High cache hit ratios indicate effective caching and reduced network overhead. These metrics help identify I/O bottlenecks in policies that make external HTTP requests.
10 changes: 10 additions & 0 deletions docs/docs/policy-reference/builtins/regex.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -110,3 +110,13 @@ overlap. This can be useful when using patterns to define permissions or access
rules. The function returns `true` if the two patterns overlap and `false` otherwise.

<PlaygroundExample dir={require.context('../_examples/regex/globs_match/role_patterns')} />

## Performance Metrics

When OPA is configured with metrics enabled, regex operations expose the following metrics in per-query metrics (accessible when `?metrics=true` is specified in API requests):

| Metric | Description |
| ------ | ----------- |
| `counter_rego_builtin_regex_interquery_value_cache_hits` | Number of regex cache hits for compiled patterns |

Effective regex caching improves performance when the same patterns are used repeatedly. High cache hit ratios indicate that regex compilation overhead is being minimized through caching.
9 changes: 6 additions & 3 deletions docs/docs/rest-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -2285,9 +2285,12 @@ Query instrumentation can help diagnose performance problems, however, it can
add significant overhead to query evaluation. We recommend leaving query
instrumentation off unless you are debugging a performance problem.

When instrumentation is enabled there are several additional performance metrics
for the compilation stages. They follow the format of `timer_compile_stage_*_ns`
and `timer_query_compile_stage_*_ns` for the query and module compilation stages.
When query instrumentation is enabled (`instrument=true`), the following additional detailed evaluation metrics are included:
- **timer_eval_op_***: Various evaluation operation timers (e.g., `timer_eval_op_plug_ns`, `timer_eval_op_resolve_ns`)
- **histogram_eval_op_***: Histograms tracking evaluation operation time distributions
- **timer_rego_builtin_***: Built-in function execution times
- **counter_rego_builtin_***: Built-in function call counts and cache hits
- **timer_compile_stage_*_ns**: Compilation stage timers for the query and module compilation stages

## Provenance

Expand Down