Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
599d932
Initial commit
ghanse Aug 28, 2025
b767a07
Update docs/dqx/docs/guide/quality_checks_apply.mdx
mwojtyczka Aug 29, 2025
1dd3806
Update docs/dqx/docs/guide/quality_checks_apply.mdx
mwojtyczka Aug 29, 2025
71c0178
Update docs/dqx/docs/guide/summary_metrics.mdx
mwojtyczka Aug 29, 2025
843316f
Update docs/dqx/docs/guide/summary_metrics.mdx
mwojtyczka Aug 29, 2025
a252923
Update docs/dqx/docs/guide/summary_metrics.mdx
mwojtyczka Aug 29, 2025
fb44082
Update docs/dqx/docs/guide/summary_metrics.mdx
mwojtyczka Aug 29, 2025
ae18de4
Update docs/dqx/docs/guide/summary_metrics.mdx
mwojtyczka Aug 29, 2025
4bd4cc4
Update docs/dqx/docs/guide/summary_metrics.mdx
mwojtyczka Aug 29, 2025
152f28f
Update src/databricks/labs/dqx/engine.py
mwojtyczka Aug 29, 2025
cede7f0
Update engine methods, docs, and tests
ghanse Sep 1, 2025
5f6ea9d
Merge branch 'main' into summary_metrics
mwojtyczka Sep 1, 2025
fd12138
Add streaming support
ghanse Sep 5, 2025
49303d6
Merge branch 'main' into summary_metrics
mwojtyczka Sep 15, 2025
15ae1ec
Merge branch 'main' into summary_metrics
mwojtyczka Sep 16, 2025
31c3b88
Merge branch 'main' into summary_metrics
mwojtyczka Sep 18, 2025
b32a26c
Merge branch 'main' into summary_metrics
mwojtyczka Sep 19, 2025
9916b0f
Refactor
ghanse Sep 21, 2025
d367ef3
Update docs and tests
ghanse Sep 21, 2025
461b90f
Update tests
ghanse Sep 21, 2025
f349b08
Refactor
ghanse Sep 24, 2025
f5a8dc7
Merge branch 'main' into summary_metrics
mwojtyczka Sep 30, 2025
32630fa
Merge branch 'refs/heads/main' into summary_metrics
ghanse Oct 1, 2025
998490a
Update engine methods, tests, and docs
ghanse Oct 1, 2025
7e25a97
Merge remote-tracking branch 'origin/summary_metrics' into summary_me…
ghanse Oct 1, 2025
9da72dc
Update unit tests
ghanse Oct 1, 2025
9cfee10
Fix unit test type
ghanse Oct 1, 2025
b9969a9
Merge branch 'main' into summary_metrics
mwojtyczka Oct 2, 2025
52e3a22
Refactor
ghanse Oct 2, 2025
1950db7
Merge remote-tracking branch 'origin/summary_metrics' into summary_me…
ghanse Oct 2, 2025
9c319e6
Add pytest-benchmark performance baseline
ghanse Oct 2, 2025
835e1a7
Update implementation, tests, and docs
ghanse Oct 3, 2025
f74d877
Fix observation return
ghanse Oct 3, 2025
6fcf084
Fix user metadata in summary metrics table
ghanse Oct 3, 2025
88525ca
Fix user metadata in summary metrics table
ghanse Oct 3, 2025
aadcf42
Merge branch 'main' into summary_metrics
mwojtyczka Oct 3, 2025
57a011a
Merge remote-tracking branch 'origin/summary_metrics' into summary_me…
ghanse Oct 6, 2025
783027d
Merge branch 'refs/heads/main' into summary_metrics
ghanse Oct 6, 2025
a5c2944
Update comments and tests
ghanse Oct 6, 2025
d402446
Update tests
ghanse Oct 6, 2025
a718803
Merge branch 'main' into summary_metrics
mwojtyczka Oct 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion docs/dqx/docs/guide/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,15 @@ Quality rules can be defined in the following ways:

Additionally, quality rule candidates can be auto-generated using the DQX profiler.

For more details, see the [Quality Checks Definition Guide](/docs/guide/quality_checks_definition).
For more details, see the [Quality Checks Definition Guide](/docs/guide/quality_checks_definition).

## Summary metrics and monitoring

DQX can capture and store data summary metrics about your data quality across multiple tables and runs. Metrics are computed lazily and accessible after checked datasets are counted, displayed, or written to a table or files. Users can:

- Capture quality metrics for each checked dataset
- Track both default (e.g. input/error/warning/valid counts) and custom quality metrics
- Store quality metrics in Delta tables for historical analysis and alerting
- Centralize quality metrics across datasets, jobs, or job runs in a unified data quality history table

For more details, see the [Summary Metrics Guide](/docs/guide/summary_metrics).
65 changes: 65 additions & 0 deletions docs/dqx/docs/guide/quality_checks_apply.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -717,6 +717,7 @@ The following fields from the [configuration file](/docs/installation/#configura
- `output_config`: configuration for the output data. 'location' is autogenerated when the workflow is executed for patterns.
- `quarantine_config`: (optional) configuration for the quarantine data. 'location' is autogenerated when the workflow is executed for patterns.
- `checks_location`: location of the quality checks in storage. Autogenerated when the workflow is executed for patterns.
- `metrics_config`: (optional) configuration for storing summary metrics.
- `serverless_clusters`: whether to use serverless clusters for running the workflow (default: `true`). Using serverless clusters is recommended as it allows for automated cluster management and scaling.
- `e2e_spark_conf`: (optional) spark configuration to use for the e2e workflow, only applicable if `serverless_clusters` is set to `false`.
- `e2e_override_clusters`: (optional) cluster configuration to use for the e2e workflow, only applicable if `serverless_clusters` is set to `false`.
Expand All @@ -733,6 +734,7 @@ The following fields from the [configuration file](/docs/installation/#configura
- `limit`: maximum number of records to analyze.
- `filter`: filter for the input data as a string SQL expression (default: None).
- `extra_params`: (optional) extra parameters to pass to the jobs such as result column names and user_metadata
- `custom_metrics`: (optional) list of Spark SQL expressions for capturing custom summary metrics. By default, the number of input, warning, and error rows will be tracked. When custom metrics are defined, they will be tracked in addition to the default metrics.
- `custom_check_functions`: (optional) custom check functions defined in Python files that can be used in the quality checks.
- `reference_tables`: (optional) reference tables that can be used in the quality checks.

Expand Down Expand Up @@ -762,6 +764,10 @@ Example of the configuration file (relevant fields only):
#checkpointLocation: /Volumes/catalog/schema/volume/checkpoint # only applicable if input_config.is_streaming is enabled
#trigger: # streaming trigger, only applicable if input_config.is_streaming is enabled
# availableNow: true
metrics_config: # optional - summary metrics storage
format: delta
location: main.nytaxi.dq_metrics
mode: append
profiler_config:
limit: 1000
sample_fraction: 0.3
Expand All @@ -775,8 +781,67 @@ Example of the configuration file (relevant fields only):
input_config:
format: delta
location: main.nytaxi.ref
# Global custom metrics for summary statistics (optional)
custom_metrics:
- "sum(array_size(_warnings)) as total_warnings"
- "sum(array_size(_errors)) as total_errors"
```

## Summary Metrics

DQX can automatically capture and store summary metrics about your data quality checking processes. When enabled, the system collects both default metrics (input count, error count, warning count, valid count) and any custom metrics you define. Metrics can be configured programmatically or via a configuration file when installing DQX as a tool in the workspace.

### Enabling summary metrics programmatically

To enable summary metrics programmatically, create and pass a `DQMetricsObserver` when initializing the `DQEngine`:

<Tabs>
<TabItem value="Python" label="Python" default>
```python
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.metrics_observer import DQMetricsObserver

# set up a DQMetricsObserver with default name and metrics
dq_observer = DQMetricsObserver()
dq_engine = DQEngine(observer=dq_observer)

# Option 1: apply quality checks, provide a single result DataFrame, and return a metrics observation
valid_and_invalid_df, metrics_observation = dq_engine.apply_checks(input_df, checks)
print(metrics_observation.get)

# Option 2: apply quality checks on the DataFrame, provide valid and invalid (quarantined) DataFrames, and return a metrics observation
valid_df, invalid_df, metrics_observation = dq_engine.apply_checks_and_split(input_df, checks)
print(metrics_observation.get)

# Option 3 End-to-End approach: apply quality checks to a table, save results to valid and invalid (quarantined) tables, and save metrics to a metrics table
dq_engine.apply_checks_and_save_in_table(
checks=checks,
input_config=InputConfig(location="catalog.schema.input"),
output_config=OutputConfig(location="catalog.schema.valid"),
quarantine_config=OutputConfig(location="catalog.schema.quarantine"),
metrics_config=OutputConfig(location="catalog.schema.metrics"),
)

# Option 4 End-to-End approach: apply quality checks to a table, save results to an output table, and save metrics to a metrics table
dq_engine.apply_checks_and_save_in_table(
checks=checks,
input_config=InputConfig(location="catalog.schema.input"),
output_config=OutputConfig(location="catalog.schema.output"),
metrics_config=OutputConfig(location="catalog.schema.metrics"),
)
```
</TabItem>
</Tabs>

### Enabling summary metrics in DQX workflows

Summary metrics can also be enabled in DQX workflows. Metrics are configured:

1. **During installation for default run config**: When prompted, choose to store summary metrics and configure the default metrics table location
2. **Configuration file**: Add `custom_metrics` and `metrics_config` to the [configuration file](/docs/installation/#configuration-file)

For detailed information about summary metrics, including examples and best practices, see the [Summary Metrics Guide](/docs/guide/summary_metrics).

## Quality checking results

Quality check results are added as additional columns to the output or quarantine (if defined) DataFrame or tables (if saved). These columns capture the outcomes of the checks performed on the input data.
Expand Down
Loading
Loading