databrickslabs · ghanse · Aug 28, 2025 · Aug 29, 2025 · Aug 29, 2025 · Aug 29, 2025
@@ -45,4 +45,15 @@ Quality rules can be defined in the following ways:
 
 Additionally, quality rule candidates can be auto-generated using the DQX profiler.
 
-For more details, see the [Quality Checks Definition Guide](/docs/guide/quality_checks_definition).
+For more details, see the [Quality Checks Definition Guide](/docs/guide/quality_checks_definition).
+
+## Summary metrics and monitoring
+
+DQX can capture and store data summary metrics about your data quality across multiple tables and runs. Metrics are computed lazily and accessible after checked datasets are counted, displayed, or written to a table or files. Users can:
+
+- Capture quality metrics for each checked dataset
+- Track both default (e.g. input/error/warning/valid counts) and custom quality metrics
+- Store quality metrics in Delta tables for historical analysis and alerting
+- Centralize quality metrics across datasets, jobs, or job runs in a unified data quality history table
+
+For more details, see the [Summary Metrics Guide](/docs/guide/summary_metrics).
@@ -717,6 +717,7 @@ The following fields from the [configuration file](/docs/installation/#configura
 - `output_config`: configuration for the output data. 'location' is autogenerated when the workflow is executed for patterns.
 - `quarantine_config`: (optional) configuration for the quarantine data. 'location' is autogenerated when the workflow is executed for patterns.
 - `checks_location`: location of the quality checks in storage. Autogenerated when the workflow is executed for patterns.
+- `metrics_config`: (optional) configuration for storing summary metrics.
 - `serverless_clusters`: whether to use serverless clusters for running the workflow (default: `true`). Using serverless clusters is recommended as it allows for automated cluster management and scaling.
 - `e2e_spark_conf`: (optional) spark configuration to use for the e2e workflow, only applicable if `serverless_clusters` is set to `false`.
 - `e2e_override_clusters`: (optional) cluster configuration to use for the e2e workflow, only applicable if `serverless_clusters` is set to `false`.
@@ -733,6 +734,7 @@ The following fields from the [configuration file](/docs/installation/#configura
    - `limit`: maximum number of records to analyze.
    - `filter`: filter for the input data as a string SQL expression (default: None).
 - `extra_params`: (optional) extra parameters to pass to the jobs such as result column names and user_metadata
+- `custom_metrics`: (optional) list of Spark SQL expressions for capturing custom summary metrics. By default, the number of input, warning, and error rows will be tracked. When custom metrics are defined, they will be tracked in addition to the default metrics.
 - `custom_check_functions`: (optional) custom check functions defined in Python files that can be used in the quality checks.
 - `reference_tables`: (optional) reference tables that can be used in the quality checks.
 
@@ -762,6 +764,10 @@ Example of the configuration file (relevant fields only):
       #checkpointLocation: /Volumes/catalog/schema/volume/checkpoint  # only applicable if input_config.is_streaming is enabled
       #trigger: # streaming trigger, only applicable if input_config.is_streaming is enabled
       #  availableNow: true
+    metrics_config: # optional - summary metrics storage
+      format: delta
+      location: main.nytaxi.dq_metrics
+      mode: append
     profiler_config:
       limit: 1000
       sample_fraction: 0.3
@@ -775,8 +781,67 @@ Example of the configuration file (relevant fields only):
         input_config:
           format: delta
           location: main.nytaxi.ref
+  # Global custom metrics for summary statistics (optional)
+  custom_metrics:
+    - "sum(array_size(_warnings)) as total_warnings"
+    - "sum(array_size(_errors)) as total_errors"
 ```
 
+## Summary Metrics
+
+DQX can automatically capture and store summary metrics about your data quality checking processes. When enabled, the system collects both default metrics (input count, error count, warning count, valid count) and any custom metrics you define. Metrics can be configured programmatically or via a configuration file when installing DQX as a tool in the workspace.
+
+### Enabling summary metrics programmatically
+
+To enable summary metrics programmatically, create and pass a `DQMetricsObserver` when initializing the `DQEngine`:
+
+<Tabs>
+  <TabItem value="Python" label="Python" default>
+    ```python
+    from databricks.labs.dqx.engine import DQEngine
+    from databricks.labs.dqx.metrics_observer import DQMetricsObserver
+
+    # set up a DQMetricsObserver with default name and metrics
+    dq_observer = DQMetricsObserver()
+    dq_engine = DQEngine(observer=dq_observer)
+
+    # Option 1: apply quality checks, provide a single result DataFrame, and return a metrics observation
+    valid_and_invalid_df, metrics_observation = dq_engine.apply_checks(input_df, checks)
+    print(metrics_observation.get)
+
+    # Option 2: apply quality checks on the DataFrame, provide valid and invalid (quarantined) DataFrames, and return a metrics observation
+    valid_df, invalid_df, metrics_observation = dq_engine.apply_checks_and_split(input_df, checks)
+    print(metrics_observation.get)
+
+    # Option 3 End-to-End approach: apply quality checks to a table, save results to valid and invalid (quarantined) tables, and save metrics to a metrics table
+    dq_engine.apply_checks_and_save_in_table(
+        checks=checks,
+        input_config=InputConfig(location="catalog.schema.input"),
+        output_config=OutputConfig(location="catalog.schema.valid"),
+        quarantine_config=OutputConfig(location="catalog.schema.quarantine"),
+        metrics_config=OutputConfig(location="catalog.schema.metrics"),
+    )
+
+    # Option 4 End-to-End approach: apply quality checks to a table, save results to an output table, and save metrics to a metrics table
+    dq_engine.apply_checks_and_save_in_table(
+        checks=checks,
+        input_config=InputConfig(location="catalog.schema.input"),
+        output_config=OutputConfig(location="catalog.schema.output"),
+        metrics_config=OutputConfig(location="catalog.schema.metrics"),
+    )
+    ```
+  </TabItem>
+</Tabs>
+
+### Enabling summary metrics in DQX workflows
+
+Summary metrics can also be enabled in DQX workflows. Metrics are configured:
+
+1. **During installation for default run config**: When prompted, choose to store summary metrics and configure the default metrics table location
+2. **Configuration file**: Add `custom_metrics` and `metrics_config` to the [configuration file](/docs/installation/#configuration-file)
+
+For detailed information about summary metrics, including examples and best practices, see the [Summary Metrics Guide](/docs/guide/summary_metrics).
+
 ## Quality checking results
 
 Quality check results are added as additional columns to the output or quarantine (if defined) DataFrame or tables (if saved). These columns capture the outcomes of the checks performed on the input data.