Add summary metrics #553

ghanse · 2025-08-29T00:11:33Z

Changes

This PR introduces summary metrics as outputs of quality checking methods. Summary metrics computation relies on Spark's Observation feature.

Basic usage

The DQObserver can be added to DQEngine to manage Spark observations and track summary metrics on datasets checked with DQX:

observer = DQObserver()
engine = DQEngine(ws, observer=observer)

Methods of DQEngine have been updated to optionally return the Spark observation associated with a given run:

checked_df, observation = engine.apply_checks(input_df, checks)
checked_df.count()  # or any other action like saving the dataframe to a table
metrics = observation.get

Writing summary metrics with checked data

When DQEngine methods write results to an output sink, metrics can also be written:

engine.apply_checks_and_save_in_table(
  checks=...,
  input_config=...,
  output_config=...,
  metrics_config=OutputConfig("main.dqx.summary_metrics")
)

Integration with installed workflows

I have also updated the quality checking and e2e workflows to allow users to specify an output table where metrics are stored.

TODO

Linked issues

Resolves #376

Tests

manually tested
added unit tests
added integration tests
added end-to-end tests

github-actions · 2025-08-29T00:47:42Z

❌ 388/404 passed, 5 flaky, 16 failed, 18 skipped, 4h47m15s total

❌ test_e2e_workflow_with_custom_install_folder: [gw5] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python (726ms)

[gw5] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
file /home/runner/work/dqx/dqx/tests/integration/test_e2e_workflow.py, line 103
  def test_e2e_workflow_with_custom_install_folder(
E       fixture 'setup_workflows_with_custom_folder' not found
>       available fixtures: acc, benchmark, benchmark_weave, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, checks_json_content, checks_json_invalid_content, checks_yaml_content, checks_yaml_invalid_content, class_mocker, cov, debug_env, debug_env_name, doctest_namespace, env_or_skip, expected_checks, expected_quality_checking_output, hello, installation_ctx, installation_ctx_custom_install_folder, is_in_debug, log_account_link, log_workspace_link, make_acc_group, make_alert_permissions, make_authorization_permissions, make_catalog, make_check_file_as_json, make_check_file_as_yaml, make_cluster, make_cluster_permissions, make_cluster_policy, make_cluster_policy_permissions, make_dashboard_permissions, make_directory, make_directory_permissions, make_empty_local_json_file, make_empty_local_yaml_file, make_experiment, make_experiment_permissions, make_feature_table, make_feature_table_permissions, make_group, make_instance_pool, make_instance_pool_permissions, make_invalid_check_file_as_json, make_invalid_check_file_as_yaml, make_invalid_local_check_file_as_json, make_invalid_local_check_file_as_yaml, make_job, make_job_permissions, make_lakeview_dashboard_permissions, make_local_check_file_as_json, make_local_check_file_as_yaml, make_local_check_file_as_yaml_diff_ext, make_model, make_notebook, make_notebook_permissions, make_pipeline, make_pipeline_permissions, make_query, make_query_permissions, make_random, make_registered_model_permissions, make_repo, make_repo_permissions, make_run_as, make_schema, make_secret_scope, make_secret_scope_acl, make_serving_endpoint, make_serving_endpoint_permissions, make_storage_credential, make_table, make_udf, make_user, make_volume, make_volume_check_file_as_json, make_volume_check_file_as_yaml, make_volume_invalid_check_file_as_json, make_volume_invalid_check_file_as_yaml, make_warehouse, make_warehouse_permissions, make_workspace_file, make_workspace_file_path_permissions, make_workspace_file_permissions, mocker, module_mocker, monkeypatch, no_cover, package_mocker, product_info, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, serverless_installation_ctx, session_mocker, set_utc_timezone, setup_serverless_workflows, setup_workflows, setup_workflows_with_metrics, skip_if_runtime_not_geo_compatible, spark, sql_backend, sql_exec, sql_fetch_all, testrun_uid, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory, watchdog_purge_suffix, watchdog_remove_after, webbrowser_open, worker_id, ws
>       use 'pytest --fixtures [testpath]' for help on them.

/home/runner/work/dqx/dqx/tests/integration/test_e2e_workflow.py:103
[gw5] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_quality_checker_workflow_with_metrics: pyspark.errors.exceptions.connect.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `main`.`dummy_scowo`.`metrics_gttua7` cannot be found. Verify the spelling and correctness of the schema and catalog. (2m40.658s)

... (skipped 28896 bytes)
 'criticality': 'error', 'name': 'name_is_not_null_and_not_empty'}
07:52:40 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Resolving function: is_not_null_and_not_empty
07:52:40 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Function is_not_null_and_not_empty resolved successfully: <function is_not_null_and_not_empty at 0xffff2b731f80>
07:52:40 DEBUG [databricks.labs.dqx.checks_serializer] {ThreadPoolExecutor-4_0} Processing check definition: {'check': {'arguments': {'column': 'id'}, 'function': 'is_not_null'}, 'criticality': 'error', 'name': 'id_is_not_null'}
07:52:40 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Resolving function: is_not_null
07:52:40 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Function is_not_null resolved successfully: <function is_not_null at 0xffff2b7320c0>
07:52:40 DEBUG [databricks.labs.dqx.checks_serializer] {ThreadPoolExecutor-4_0} Processing check definition: {'check': {'arguments': {'column': 'name'}, 'function': 'is_not_null_and_not_empty'}, 'criticality': 'error', 'name': 'name_is_not_null_and_not_empty'}
07:52:40 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Resolving function: is_not_null_and_not_empty
07:52:40 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Function is_not_null_and_not_empty resolved successfully: <function is_not_null_and_not_empty at 0xffff2b731f80>
07:52:40 DEBUG [databricks.labs.dqx.telemetry] {ThreadPoolExecutor-4_0} Added User-Agent extra check=is_not_null
07:52:41 DEBUG [databricks.sdk] {ThreadPoolExecutor-4_0} GET /api/2.1/clusters/spark-versions
< 200 OK
< {
<   "versions": [
<     {
<       "key": "12.2.x-scala2.12",
<       "name": "12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)"
<     },
<     {
<       "key": "11.3.x-photon-scala2.12",
<       "name": "11.3 LTS Photon (includes Apache Spark 3.3.0, Scala 2.12)"
<     },
<     "... (53 additional elements)"
<   ]
< }
07:52:41 DEBUG [databricks.labs.dqx.telemetry] {ThreadPoolExecutor-4_0} Added User-Agent extra check=is_not_null_and_not_empty
07:52:41 DEBUG [databricks.sdk] {ThreadPoolExecutor-4_0} GET /api/2.1/clusters/spark-versions
< 200 OK
< {
<   "versions": [
<     {
<       "key": "12.2.x-scala2.12",
<       "name": "12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)"
<     },
<     {
<       "key": "11.3.x-photon-scala2.12",
<       "name": "11.3 LTS Photon (includes Apache Spark 3.3.0, Scala 2.12)"
<     },
<     "... (53 additional elements)"
<   ]
< }
07:52:41 INFO [databricks.labs.dqx.io] {ThreadPoolExecutor-4_0} Saving data to main.dummy_scowo.uwitd3zrv0 table
07:53 INFO [databricks.labs.dqx.quality_checker.quality_checker_runner:apply_checks] Data quality checker completed.
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
07:53 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+4420251007075033 from https://DATABRICKS_HOST
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=959178756216828, as it is no longer needed
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=610615979419241, as it is no longer needed
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=685879735761757, as it is no longer needed
07:53 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw6] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_custom_metrics_in_workflow: pyspark.errors.exceptions.connect.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `main`.`dummy_sccc2`.`metrics_pm3txo` cannot be found. Verify the spelling and correctness of the schema and catalog. (3m1.997s)

... (skipped 28907 bytes)
'criticality': 'error', 'name': 'name_is_not_null_and_not_empty'}
07:53:05 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Resolving function: is_not_null_and_not_empty
07:53:05 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Function is_not_null_and_not_empty resolved successfully: <function is_not_null_and_not_empty at 0xffff2a98df80>
07:53:05 DEBUG [databricks.labs.dqx.checks_serializer] {ThreadPoolExecutor-4_0} Processing check definition: {'check': {'arguments': {'column': 'id'}, 'function': 'is_not_null'}, 'criticality': 'error', 'name': 'id_is_not_null'}
07:53:05 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Resolving function: is_not_null
07:53:05 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Function is_not_null resolved successfully: <function is_not_null at 0xffff2a98e0c0>
07:53:05 DEBUG [databricks.labs.dqx.checks_serializer] {ThreadPoolExecutor-4_0} Processing check definition: {'check': {'arguments': {'column': 'name'}, 'function': 'is_not_null_and_not_empty'}, 'criticality': 'error', 'name': 'name_is_not_null_and_not_empty'}
07:53:05 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Resolving function: is_not_null_and_not_empty
07:53:05 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Function is_not_null_and_not_empty resolved successfully: <function is_not_null_and_not_empty at 0xffff2a98df80>
07:53:05 DEBUG [databricks.labs.dqx.telemetry] {ThreadPoolExecutor-4_0} Added User-Agent extra check=is_not_null
07:53:05 DEBUG [databricks.sdk] {ThreadPoolExecutor-4_0} GET /api/2.1/clusters/spark-versions
< 200 OK
< {
<   "versions": [
<     {
<       "key": "12.2.x-scala2.12",
<       "name": "12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)"
<     },
<     {
<       "key": "11.3.x-photon-scala2.12",
<       "name": "11.3 LTS Photon (includes Apache Spark 3.3.0, Scala 2.12)"
<     },
<     "... (53 additional elements)"
<   ]
< }
07:53:05 DEBUG [databricks.labs.dqx.telemetry] {ThreadPoolExecutor-4_0} Added User-Agent extra check=is_not_null_and_not_empty
07:53:05 DEBUG [databricks.sdk] {ThreadPoolExecutor-4_0} GET /api/2.1/clusters/spark-versions
< 200 OK
< {
<   "versions": [
<     {
<       "key": "12.2.x-scala2.12",
<       "name": "12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)"
<     },
<     {
<       "key": "11.3.x-photon-scala2.12",
<       "name": "11.3 LTS Photon (includes Apache Spark 3.3.0, Scala 2.12)"
<     },
<     "... (53 additional elements)"
<   ]
< }
07:53:06 INFO [databricks.labs.dqx.io] {ThreadPoolExecutor-4_0} Saving data to main.dummy_sccc2.ehipsik7bq table
07:53 INFO [databricks.labs.dqx.quality_checker.quality_checker_runner:apply_checks] Data quality checker completed.
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
07:53 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+4420251007075031 from https://DATABRICKS_HOST
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=820624146080410, as it is no longer needed
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=1043280363035729, as it is no longer needed
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=763085146932591, as it is no longer needed
07:53 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw2] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_quality_checker_workflow_with_quarantine_and_metrics: databricks.sdk.errors.platform.Unknown: apply_checks: Run failed with error message (1m5.863s)

... (skipped 3860 bytes)
abricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
07:53 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
07:53 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
07:53 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.4+4420251007075314
07:53 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
07:53 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
07:53 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
07:53 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
07:53 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
07:53 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
07:53 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.XUrh/dashboards'
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=402420946484129
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=402420946484129
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=402420946484129
07:53 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
07:53 INFO [databricks.labs.dqx.checks_storage] Saving quality rules (checks) to '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.XUrh/checks.yml' in the workspace.
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Started quality-checker workflow: https://DATABRICKS_HOST#job/808792099751747/runs/485001970057993
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
07:54 WARNING [databricks.labs.dqx.installer.workflow_installer] Cannot fetch logs as folder /Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.XUrh/logs/quality-checker does not exist
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
07:54 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+4420251007075314 from https://DATABRICKS_HOST
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=571238872668260, as it is no longer needed
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=808792099751747, as it is no longer needed
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=402420946484129, as it is no longer needed
07:54 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw6] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_profiler_workflow_with_custom_install_folder: [gw9] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python (812ms)

[gw9] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
file /home/runner/work/dqx/dqx/tests/integration/test_profiler_workflow.py, line 93
  def test_profiler_workflow_with_custom_install_folder(ws, spark, setup_workflows_with_custom_folder):
E       fixture 'setup_workflows_with_custom_folder' not found
>       available fixtures: acc, benchmark, benchmark_weave, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, checks_json_content, checks_json_invalid_content, checks_yaml_content, checks_yaml_invalid_content, class_mocker, cov, debug_env, debug_env_name, doctest_namespace, env_or_skip, expected_checks, expected_quality_checking_output, hello, installation_ctx, installation_ctx_custom_install_folder, is_in_debug, log_account_link, log_workspace_link, make_acc_group, make_alert_permissions, make_authorization_permissions, make_catalog, make_check_file_as_json, make_check_file_as_yaml, make_cluster, make_cluster_permissions, make_cluster_policy, make_cluster_policy_permissions, make_dashboard_permissions, make_directory, make_directory_permissions, make_empty_local_json_file, make_empty_local_yaml_file, make_experiment, make_experiment_permissions, make_feature_table, make_feature_table_permissions, make_group, make_instance_pool, make_instance_pool_permissions, make_invalid_check_file_as_json, make_invalid_check_file_as_yaml, make_invalid_local_check_file_as_json, make_invalid_local_check_file_as_yaml, make_job, make_job_permissions, make_lakeview_dashboard_permissions, make_local_check_file_as_json, make_local_check_file_as_yaml, make_local_check_file_as_yaml_diff_ext, make_model, make_notebook, make_notebook_permissions, make_pipeline, make_pipeline_permissions, make_query, make_query_permissions, make_random, make_registered_model_permissions, make_repo, make_repo_permissions, make_run_as, make_schema, make_secret_scope, make_secret_scope_acl, make_serving_endpoint, make_serving_endpoint_permissions, make_storage_credential, make_table, make_udf, make_user, make_volume, make_volume_check_file_as_json, make_volume_check_file_as_yaml, make_volume_invalid_check_file_as_json, make_volume_invalid_check_file_as_yaml, make_warehouse, make_warehouse_permissions, make_workspace_file, make_workspace_file_path_permissions, make_workspace_file_permissions, mocker, module_mocker, monkeypatch, no_cover, package_mocker, product_info, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, serverless_installation_ctx, session_mocker, set_utc_timezone, setup_serverless_workflows, setup_workflows, setup_workflows_with_metrics, skip_if_runtime_not_geo_compatible, spark, sql_backend, sql_exec, sql_fetch_all, testrun_uid, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory, watchdog_purge_suffix, watchdog_remove_after, webbrowser_open, worker_id, ws
>       use 'pytest --fixtures [testpath]' for help on them.

/home/runner/work/dqx/dqx/tests/integration/test_profiler_workflow.py:93
[gw9] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_e2e_workflow_for_multiple_run_configs: databricks.sdk.errors.platform.Unknown: prepare: Run failed with error message (1m33.08s)

... (skipped 3483 bytes)
---
07:58 WARNING [databricks.labs.dqx.installer.workflow_installer] Cannot fetch logs as folder /Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.Oaz1/logs/e2e does not exist
07:58 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
07:57 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
07:57 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
07:57 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.4+4420251007075729
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
07:57 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
07:57 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
07:57 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.Oaz1/dashboards'
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=895640736201408
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=895640736201408
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=895640736201408
07:57 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/895640736201408/runs/915391438010011
07:58 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
07:58 WARNING [databricks.labs.dqx.installer.workflow_installer] Cannot fetch logs as folder /Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.Oaz1/logs/e2e does not exist
07:58 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
07:58 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+4420251007075729 from https://DATABRICKS_HOST
07:58 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=262397866451180, as it is no longer needed
07:58 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=190848360489449, as it is no longer needed
07:58 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=895640736201408, as it is no longer needed
07:59 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw7] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_quality_checker_workflow_for_multiple_run_configs: databricks.labs.dqx.errors.InvalidConfigError: Run config flag is required (2m47.981s)

... (skipped 3788 bytes)
tall] Installing DQX v0.9.4+4420251007075712
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
07:57 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
07:57 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
07:57 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.ttDJ/dashboards'
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=802106398060023
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=802106398060023
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=802106398060023
07:57 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
07:57 INFO [databricks.labs.dqx.checks_storage] Saving quality rules (checks) to '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.ttDJ/checks.yml' in the workspace.
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Started quality-checker workflow: https://DATABRICKS_HOST#job/100565335467088/runs/123588368629372
07:59 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
07:59 INFO [databricks.labs.dqx:apply_checks] DQX v0.9.4+4420251007075712 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.ttDJ/logs/quality-checker/run-123588368629372-0/apply_checks.log
07:59 INFO [databricks.labs.dqx.quality_checker.quality_checker_workflow:apply_checks] Running data quality workflow for all run configs
07:59 ERROR [databricks.labs.dqx:apply_checks] Execute `databricks workspace export //Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.ttDJ/logs/quality-checker/run-123588368629372-0/apply_checks.log` locally to troubleshoot with more details. Run config flag is required
07:59 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
07:59 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+4420251007075712 from https://DATABRICKS_HOST
07:59 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=419986752944811, as it is no longer needed
07:59 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=100565335467088, as it is no longer needed
07:59 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=802106398060023, as it is no longer needed
07:59 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw2] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_e2e_workflow_with_metrics: pyspark.errors.exceptions.connect.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `main`.`dummy_selzg`.`metrics_twfapr` cannot be found. Verify the spelling and correctness of the schema and catalog. (8m28.732s)

... (skipped 17680 bytes)
parsed from ''
07:54 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
07:54 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
07:54 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.OVHO/dashboards'
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=568466388162100
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=568466388162100
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=568466388162100
07:54 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
07:54 INFO [databricks.labs.dqx.checks_storage] Saving quality rules (checks) to '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.OVHO/checks.yml' in the workspace.
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/568466388162100/runs/770013585833644
08:02 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 770013585833644 with state: RunResultState.SUCCESS
08:02 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 770013585833644 duration: 0:08:01.300000 (2025-10-07 07:54:35.637000+00:00 thru 2025-10-07 08:02:36.937000+00:00)
08:02 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
08:02 INFO [databricks.labs.dqx:prepare] DQX v0.9.4+4420251007075420 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.OVHO/logs/e2e/run-770013585833644-0/prepare.log
08:02 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:prepare] End-to-end: prepare start for run config: TEST_SCHEMA
08:02 INFO [databricks.labs.dqx:finalize] DQX v0.9.4+4420251007075420 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.OVHO/logs/e2e/run-770013585833644-0/finalize.log
08:02 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] End-to-end: finalize complete for run config: TEST_SCHEMA
08:02 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] For more details please check the run logs of the profiler and quality checker jobs.
08:02 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
08:02 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+4420251007075420 from https://DATABRICKS_HOST
08:02 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=616187741950763, as it is no longer needed
08:02 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=781031214722595, as it is no longer needed
08:02 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=568466388162100, as it is no longer needed
08:02 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw6] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_quality_checker_workflow_for_multiple_run_configs_table_checks_storage: databricks.labs.dqx.errors.InvalidConfigError: Run config flag is required (3m11.575s)

... (skipped 4556 bytes)
est.output_table' output table as the source table for the dashboard...
08:00 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
08:00 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
08:00 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
08:00 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.xkpl/dashboards'
08:00 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
08:00 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
08:00 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
08:00 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=6423572395703
08:00 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=6423572395703
08:00 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=6423572395703
08:00 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
08:00 INFO [databricks.labs.dqx.checks_storage] Saving quality rules (checks) to '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.xkpl/checks.yml' in the workspace.
08:00 INFO [databricks.labs.dqx.checks_storage] Loading quality rules (checks) from '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.xkpl/checks.yml' in the workspace.
08:00 INFO [databricks.labs.dqx.checks_storage] Saving quality rules (checks) to table 'main.dummy_ssyej.checks'
08:00 INFO [databricks.labs.dqx.checks_storage] Saving quality rules (checks) to table 'main.dummy_ssyej.checks'
08:00 INFO [databricks.labs.dqx.installer.workflow_installer] Started quality-checker workflow: https://DATABRICKS_HOST#job/108216133368396/runs/204303348545801
08:03 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
08:03 INFO [databricks.labs.dqx:apply_checks] DQX v0.9.4+4420251007080001 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.xkpl/logs/quality-checker/run-204303348545801-0/apply_checks.log
08:03 INFO [databricks.labs.dqx.quality_checker.quality_checker_workflow:apply_checks] Running data quality workflow for all run configs
08:03 ERROR [databricks.labs.dqx:apply_checks] Execute `databricks workspace export //Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.xkpl/logs/quality-checker/run-204303348545801-0/apply_checks.log` locally to troubleshoot with more details. Run config flag is required
08:03 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
08:03 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+4420251007080001 from https://DATABRICKS_HOST
08:03 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=716255192023503, as it is no longer needed
08:03 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=108216133368396, as it is no longer needed
08:03 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=6423572395703, as it is no longer needed
08:03 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw2] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_observer_metrics_output: chispa.dataframe_comparer.DataFramesNotEqualError: (11.486s)

... (skipped 538 bytes)
f1                                                                                                                                                             | df2  |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|      �[31mRow(run_name='test_observer', input_location='main.dummy_sd7vc.input_qv78a3', output_location='main.dummy_sd7vc.output_sozioy', quarantine_location=None, checks_location=None, metric_name='input_row_count', metric_value='4', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
|      �[31mRow(run_name='test_observer', input_location='main.dummy_sd7vc.input_qv78a3', output_location='main.dummy_sd7vc.output_sozioy', quarantine_location=None, checks_location=None, metric_name='error_row_count', metric_value='1', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
|     �[31mRow(run_name='test_observer', input_location='main.dummy_sd7vc.input_qv78a3', output_location='main.dummy_sd7vc.output_sozioy', quarantine_location=None, checks_location=None, metric_name='warning_row_count', metric_value='1', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m    | None |
|      �[31mRow(run_name='test_observer', input_location='main.dummy_sd7vc.input_qv78a3', output_location='main.dummy_sd7vc.output_sozioy', quarantine_location=None, checks_location=None, metric_name='valid_row_count', metric_value='2', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
|     �[31mRow(run_name='test_observer', input_location='main.dummy_sd7vc.input_qv78a3', output_location='main.dummy_sd7vc.output_sozioy', quarantine_location=None, checks_location=None, metric_name='avg_error_age', metric_value='35.0', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
| �[31mRow(run_name='test_observer', input_location='main.dummy_sd7vc.input_qv78a3', output_location='main.dummy_sd7vc.output_sozioy', quarantine_location=None, checks_location=None, metric_name='total_warning_salary', metric_value='55000', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m | None |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
[gw7] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
08:03 INFO [databricks.labs.dqx.engine] Applying checks to main.dummy_sd7vc.input_qv78a3
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sd7vc.output_sozioy table
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sd7vc.metrics_fdbc0o table
08:03 INFO [databricks.labs.dqx.engine] Applying checks to main.dummy_sd7vc.input_qv78a3
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sd7vc.output_sozioy table
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sd7vc.metrics_fdbc0o table
[gw7] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_observer_metrics_output_with_quarantine: chispa.dataframe_comparer.DataFramesNotEqualError: (13.597s)

... (skipped 1042 bytes)
------------------------------------------------------------------------------+------+
|      �[31mRow(run_name='test_observer', input_location='main.dummy_sita5.input_lvw3qc', output_location='main.dummy_sita5.output_hnrbrq', quarantine_location='main.dummy_sita5.quarantine_a46wpp', checks_location=None, metric_name='input_row_count', metric_value='4', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
|      �[31mRow(run_name='test_observer', input_location='main.dummy_sita5.input_lvw3qc', output_location='main.dummy_sita5.output_hnrbrq', quarantine_location='main.dummy_sita5.quarantine_a46wpp', checks_location=None, metric_name='error_row_count', metric_value='1', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
|     �[31mRow(run_name='test_observer', input_location='main.dummy_sita5.input_lvw3qc', output_location='main.dummy_sita5.output_hnrbrq', quarantine_location='main.dummy_sita5.quarantine_a46wpp', checks_location=None, metric_name='warning_row_count', metric_value='1', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m    | None |
|      �[31mRow(run_name='test_observer', input_location='main.dummy_sita5.input_lvw3qc', output_location='main.dummy_sita5.output_hnrbrq', quarantine_location='main.dummy_sita5.quarantine_a46wpp', checks_location=None, metric_name='valid_row_count', metric_value='2', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
|     �[31mRow(run_name='test_observer', input_location='main.dummy_sita5.input_lvw3qc', output_location='main.dummy_sita5.output_hnrbrq', quarantine_location='main.dummy_sita5.quarantine_a46wpp', checks_location=None, metric_name='avg_error_age', metric_value='35.0', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
| �[31mRow(run_name='test_observer', input_location='main.dummy_sita5.input_lvw3qc', output_location='main.dummy_sita5.output_hnrbrq', quarantine_location='main.dummy_sita5.quarantine_a46wpp', checks_location=None, metric_name='total_warning_salary', metric_value='55000', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m | None |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
[gw7] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
08:03 INFO [databricks.labs.dqx.engine] Applying checks to main.dummy_sita5.input_lvw3qc
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sita5.output_hnrbrq table
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sita5.quarantine_a46wpp table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sita5.metrics_hx5dnq table
08:03 INFO [databricks.labs.dqx.engine] Applying checks to main.dummy_sita5.input_lvw3qc
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sita5.output_hnrbrq table
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sita5.quarantine_a46wpp table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sita5.metrics_hx5dnq table
[gw7] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_streaming_observer_metrics_output: pyspark.errors.exceptions.connect.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `main`.`dummy_seqiy`.`metrics_p3llvn` cannot be found. Verify the spelling and correctness of the schema and catalog. (12.845s)

... (skipped 10067 bytes)
t.service.RequestContext.withContext(RequestContext.scala:349)
	at com.databricks.spark.connect.service.RequestContext.runWith(RequestContext.scala:329)
	at com.databricks.spark.connect.service.AuthenticationInterceptor$AuthenticatedServerCallListener.onHalfClose(AuthenticationInterceptor.scala:381)
	at grpc_shaded.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
	at grpc_shaded.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
	at grpc_shaded.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
	at grpc_shaded.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:351)
	at grpc_shaded.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
	at grpc_shaded.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at grpc_shaded.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.$anonfun$run$1(SparkThreadLocalForwardingThreadPoolExecutor.scala:165)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$6(SparkThreadLocalForwardingThreadPoolExecutor.scala:119)
	at com.databricks.sql.transaction.tahoe.mst.MSTThreadHelper$.runWithMstTxnId(MSTThreadHelper.scala:57)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$5(SparkThreadLocalForwardingThreadPoolExecutor.scala:118)
	at com.databricks.spark.util.IdentityClaim$.withClaim(IdentityClaim.scala:48)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$4(SparkThreadLocalForwardingThreadPoolExecutor.scala:117)
	at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:116)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured$(SparkThreadLocalForwardingThreadPoolExecutor.scala:93)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:162)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:165)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.lang.Thread.run(Thread.java:840)
[gw9] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
08:04 INFO [databricks.labs.dqx.engine] Applying checks to main.dummy_seqiy.input_kfdfwo
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_seqiy.output_fbeigo table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_seqiy.metrics_p3llvn table
08:04 INFO [databricks.labs.dqx.engine] Applying checks to main.dummy_seqiy.input_kfdfwo
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_seqiy.output_fbeigo table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_seqiy.metrics_p3llvn table
[gw9] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_save_results_in_table_batch_with_metrics: chispa.dataframe_comparer.DataFramesNotEqualError: (10.446s)

chispa.dataframe_comparer.DataFramesNotEqualError: 
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|                                                                                                                                                                 df1                                                                                                                                                                  | df2  |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|  �[31mRow(run_name='test_save_batch_observer', input_location=None, output_location='main.dummy_sxjvs.output_ncbpyw', quarantine_location='main.dummy_sxjvs.quarantine_fo6fj9', checks_location=None, metric_name='input_row_count', metric_value='4', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m  | None |
|  �[31mRow(run_name='test_save_batch_observer', input_location=None, output_location='main.dummy_sxjvs.output_ncbpyw', quarantine_location='main.dummy_sxjvs.quarantine_fo6fj9', checks_location=None, metric_name='error_row_count', metric_value='1', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m  | None |
| �[31mRow(run_name='test_save_batch_observer', input_location=None, output_location='main.dummy_sxjvs.output_ncbpyw', quarantine_location='main.dummy_sxjvs.quarantine_fo6fj9', checks_location=None, metric_name='warning_row_count', metric_value='1', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m | None |
|  �[31mRow(run_name='test_save_batch_observer', input_location=None, output_location='main.dummy_sxjvs.output_ncbpyw', quarantine_location='main.dummy_sxjvs.quarantine_fo6fj9', checks_location=None, metric_name='valid_row_count', metric_value='2', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m  | None |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
[gw9] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sxjvs.output_ncbpyw table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sxjvs.quarantine_fo6fj9 table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sxjvs.metrics_migj7o table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sxjvs.output_ncbpyw table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sxjvs.quarantine_fo6fj9 table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sxjvs.metrics_migj7o table
[gw9] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_save_results_in_table_streaming_with_metrics: pyspark.errors.exceptions.connect.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `main`.`dummy_sj69h`.`metrics_bqkexy` cannot be found. Verify the spelling and correctness of the schema and catalog. (11.513s)

... (skipped 9889 bytes)
.withValue(AttributionContextUtils.scala:242)
	at com.databricks.spark.connect.service.RequestContext.$anonfun$runWith$1(RequestContext.scala:336)
	at com.databricks.spark.connect.service.RequestContext.withContext(RequestContext.scala:349)
	at com.databricks.spark.connect.service.RequestContext.runWith(RequestContext.scala:329)
	at com.databricks.spark.connect.service.AuthenticationInterceptor$AuthenticatedServerCallListener.onHalfClose(AuthenticationInterceptor.scala:381)
	at grpc_shaded.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
	at grpc_shaded.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
	at grpc_shaded.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
	at grpc_shaded.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:351)
	at grpc_shaded.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
	at grpc_shaded.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at grpc_shaded.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.$anonfun$run$1(SparkThreadLocalForwardingThreadPoolExecutor.scala:165)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$6(SparkThreadLocalForwardingThreadPoolExecutor.scala:119)
	at com.databricks.sql.transaction.tahoe.mst.MSTThreadHelper$.runWithMstTxnId(MSTThreadHelper.scala:57)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$5(SparkThreadLocalForwardingThreadPoolExecutor.scala:118)
	at com.databricks.spark.util.IdentityClaim$.withClaim(IdentityClaim.scala:48)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$4(SparkThreadLocalForwardingThreadPoolExecutor.scala:117)
	at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:116)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured$(SparkThreadLocalForwardingThreadPoolExecutor.scala:93)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:162)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:165)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.lang.Thread.run(Thread.java:840)
[gw7] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sj69h.output_qqsubd table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sj69h.metrics_bqkexy table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sj69h.output_qqsubd table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sj69h.metrics_bqkexy table
[gw7] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_quality_checker_workflow_with_custom_install_folder: [gw6] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python (630ms)

[gw6] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
file /home/runner/work/dqx/dqx/tests/integration/test_quality_checker_workflow.py, line 139
  def test_quality_checker_workflow_with_custom_install_folder(
E       fixture 'setup_workflows_with_custom_folder' not found
>       available fixtures: acc, benchmark, benchmark_weave, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, checks_json_content, checks_json_invalid_content, checks_yaml_content, checks_yaml_invalid_content, class_mocker, cov, debug_env, debug_env_name, doctest_namespace, env_or_skip, expected_checks, expected_quality_checking_output, hello, installation_ctx, installation_ctx_custom_install_folder, is_in_debug, log_account_link, log_workspace_link, make_acc_group, make_alert_permissions, make_authorization_permissions, make_catalog, make_check_file_as_json, make_check_file_as_yaml, make_cluster, make_cluster_permissions, make_cluster_policy, make_cluster_policy_permissions, make_dashboard_permissions, make_directory, make_directory_permissions, make_empty_local_json_file, make_empty_local_yaml_file, make_experiment, make_experiment_permissions, make_feature_table, make_feature_table_permissions, make_group, make_instance_pool, make_instance_pool_permissions, make_invalid_check_file_as_json, make_invalid_check_file_as_yaml, make_invalid_local_check_file_as_json, make_invalid_local_check_file_as_yaml, make_job, make_job_permissions, make_lakeview_dashboard_permissions, make_local_check_file_as_json, make_local_check_file_as_yaml, make_local_check_file_as_yaml_diff_ext, make_model, make_notebook, make_notebook_permissions, make_pipeline, make_pipeline_permissions, make_query, make_query_permissions, make_random, make_registered_model_permissions, make_repo, make_repo_permissions, make_run_as, make_schema, make_secret_scope, make_secret_scope_acl, make_serving_endpoint, make_serving_endpoint_permissions, make_storage_credential, make_table, make_udf, make_user, make_volume, make_volume_check_file_as_json, make_volume_check_file_as_yaml, make_volume_invalid_check_file_as_json, make_volume_invalid_check_file_as_yaml, make_warehouse, make_warehouse_permissions, make_workspace_file, make_workspace_file_path_permissions, make_workspace_file_permissions, mocker, module_mocker, monkeypatch, no_cover, package_mocker, product_info, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, serverless_installation_ctx, session_mocker, set_utc_timezone, setup_serverless_workflows, setup_workflows, setup_workflows_with_metrics, skip_if_runtime_not_geo_compatible, spark, sql_backend, sql_exec, sql_fetch_all, testrun_uid, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory, watchdog_purge_suffix, watchdog_remove_after, webbrowser_open, worker_id, ws
>       use 'pytest --fixtures [testpath]' for help on them.

/home/runner/work/dqx/dqx/tests/integration/test_quality_checker_workflow.py:139
[gw6] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

❌ test_list_tables: databricks.sdk.errors.platform.NotFound: Catalog 'ucx_9tmnasv5gbmchcct' does not exist. (18m39.399s)

databricks.sdk.errors.platform.NotFound: Catalog 'ucx_9tmnasv5gbmchcct' does not exist.
[gw1] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
[gw1] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

Flaky tests:

🤪 test_e2e_workflow_for_patterns (1m47.315s)
🤪 test_profiler_workflow (1m20.44s)
🤪 test_e2e_workflow_for_patterns_exclude_patterns (3m44.869s)
🤪 test_e2e_workflow_for_patterns_exclude_output (1m28.257s)
🤪 test_quality_checker_workflow_for_patterns (1m17.628s)

_{Running from acceptance #2724}

Copilot

Pull Request Overview

This PR introduces summary metrics as outputs of quality checking methods, using Spark's Observation feature to track data quality metrics. The DQObserver class manages Spark observations and tracks both default metrics (input/error/warning/valid counts) and custom user-defined SQL expressions.

Adds DQObserver class for managing Spark observations and tracking summary metrics
Updates DQEngine methods to return tuples with both DataFrames and observations
Integrates metrics collection with existing workflows and configuration system

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/databricks/labs/dqx/observer.py	New `DQObserver` class for managing Spark observations and tracking metrics
src/databricks/labs/dqx/engine.py	Updated engine methods to support metrics collection and storage
src/databricks/labs/dqx/config.py	Added metrics configuration fields to `RunConfig` and `WorkspaceConfig`
tests/unit/test_observer.py	Unit tests for the `DQObserver` class functionality
tests/integration/test_summary_metrics.py	Integration tests for end-to-end metrics collection
tests/integration/test_metrics_workflow.py	Tests for workflow-based metrics collection

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

mwojtyczka · 2025-08-29T09:06:30Z

src/databricks/labs/dqx/engine.py

+            save_dataframe_as_table(metrics_df, metrics_config)
+            save_dataframe_as_table(metrics_df, metrics_config)


Copilot · 2025-08-29T08:00:43Z

tests/integration/test_metrics_workflow.py

+    assert run_config.metrics_config is None
+
+    ctx.deployed_workflows.run_workflow("quality-checker", run_config.name)
+    assert not ws.tables.exists(run_config.metrics_config.location).table_exists


This assertion will fail because run_config.metrics_config is None when metrics are disabled, causing a NoneType attribute access error. The assertion should check that run_config.metrics_config is None instead.

Suggested change

assert not ws.tables.exists(run_config.metrics_config.location).table_exists

# Cannot check for metrics table existence as metrics_config is None.

mwojtyczka

Going in the right direction

docs/dqx/docs/guide/quality_checks_apply.mdx

mwojtyczka · 2025-08-29T08:06:05Z

docs/dqx/docs/guide/quality_checks_apply.mdx

   - `sample_seed`: seed for reproducible sampling.
   - `limit`: maximum number of records to analyze.
 - `extra_params`: (optional) extra parameters to pass to the jobs such as result column names and user_metadata
+- `custom_metrics`: (optional) list of Spark SQL expressions for capturing custom summary metrics.


it would be worth to add that a set of default metrics are always used regardless.

Do you think this is clear?

By default, the number of input, warning, and error rows will be tracked. When custom metrics are defined, they will be tracked in addition to the default metrics.

docs/dqx/docs/guide/index.mdx

docs/dqx/docs/guide/quality_checks_apply.mdx

mwojtyczka · 2025-08-29T09:14:51Z

src/databricks/labs/dqx/observer.py

+            A list of Spark SQL expressions as strings
+        """
+        result_columns = self.result_columns or {}
+        errors_column = result_columns.get(ColumnArguments.ERRORS.value, DefaultColumnNames.ERRORS.value)


it would be better to provide this directly to avoid repeating the engine implementation. I would set this in the engine. Otherwise, user have to provide extra params twice and it would be error prone.

We need a way to update the column names in SQL expressions used for default metrics. I added a _set_column_names method in the DQMetricsObserver that is called whenever DQEngine is initialized with an observer.

mwojtyczka · 2025-08-29T09:16:20Z

src/databricks/labs/dqx/observer.py

+        """
+        Spark `Observation` which can be attached to a `DataFrame` to track summary metrics. Metrics will be collected
+        when the 1st action is triggered on the attached `DataFrame`. Subsequent operations on the attached `DataFrame`
+        will not update the observed metrics. See: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Observation.html


make this a proper link so that we render this nicely in api docs

mwojtyczka · 2025-08-29T09:20:44Z

tests/integration/test_metrics_workflow.py

+    assert run_config.metrics_config is None
+
+    ctx.deployed_workflows.run_workflow("quality-checker", run_config.name)
+    assert not ws.tables.exists(run_config.metrics_config.location).table_exists


mwojtyczka · 2025-08-29T09:21:55Z

tests/integration/test_summary_metrics.py

+
+def test_engine_with_observer_before_action(ws, spark):
+    """Test that summary metrics are empty before running a Spark action."""
+    custom_metrics = ["avg(age) as avg_age", "sum(salary) as total_salary"]


do we need custom metrics here? it seems we can remove it for this test

tests/integration/test_summary_metrics.py

updated docs

…trics # Conflicts: # docs/dqx/docs/reference/benchmarks.mdx # src/databricks/labs/dqx/engine.py # tests/perf/.benchmarks/baseline.json

Initial commit

599d932

ghanse requested a review from a team as a code owner August 29, 2025 00:11

ghanse requested review from tombonfert and removed request for a team August 29, 2025 00:11

ghanse had a problem deploying to tool August 29, 2025 00:11 — with GitHub Actions Failure

mwojtyczka requested review from Copilot and mwojtyczka August 29, 2025 07:59

Copilot AI reviewed Aug 29, 2025

View reviewed changes

mwojtyczka requested changes Aug 29, 2025

View reviewed changes

Update docs/dqx/docs/guide/quality_checks_apply.mdx

b767a07

mwojtyczka had a problem deploying to tool August 29, 2025 10:35 — with GitHub Actions Error

Update docs/dqx/docs/guide/quality_checks_apply.mdx

1dd3806

mwojtyczka had a problem deploying to tool August 29, 2025 10:35 — with GitHub Actions Error

Update docs/dqx/docs/guide/summary_metrics.mdx

71c0178

mwojtyczka had a problem deploying to tool August 29, 2025 10:36 — with GitHub Actions Error

Update docs/dqx/docs/guide/summary_metrics.mdx

843316f

updated docs

mwojtyczka had a problem deploying to tool August 29, 2025 10:36 — with GitHub Actions Error

mwojtyczka temporarily deployed to tool October 3, 2025 13:18 — with GitHub Actions Inactive

mwojtyczka had a problem deploying to tool October 3, 2025 13:18 — with GitHub Actions Error

mwojtyczka temporarily deployed to tool October 3, 2025 13:18 — with GitHub Actions Inactive

mwojtyczka had a problem deploying to tool October 3, 2025 13:18 — with GitHub Actions Error

Merge branch 'main' into summary_metrics

aadcf42

mwojtyczka had a problem deploying to tool October 3, 2025 14:41 — with GitHub Actions Failure

ghanse added 3 commits October 6, 2025 10:11

Merge remote-tracking branch 'origin/summary_metrics' into summary_me…

57a011a

…trics # Conflicts: # docs/dqx/docs/reference/benchmarks.mdx # src/databricks/labs/dqx/engine.py # tests/perf/.benchmarks/baseline.json

Merge branch 'refs/heads/main' into summary_metrics

783027d

Update comments and tests

a5c2944

ghanse temporarily deployed to tool October 6, 2025 16:21 — with GitHub Actions Inactive

ghanse had a problem deploying to tool October 6, 2025 16:21 — with GitHub Actions Failure

ghanse temporarily deployed to tool October 6, 2025 16:21 — with GitHub Actions Inactive

Update tests

d402446

ghanse had a problem deploying to tool October 6, 2025 22:23 — with GitHub Actions Failure

ghanse temporarily deployed to tool October 6, 2025 22:23 — with GitHub Actions Inactive

ghanse had a problem deploying to tool October 6, 2025 22:23 — with GitHub Actions Failure

ghanse temporarily deployed to tool October 6, 2025 22:23 — with GitHub Actions Inactive

Merge branch 'main' into summary_metrics

a718803

mwojtyczka temporarily deployed to tool October 7, 2025 07:37 — with GitHub Actions Inactive

mwojtyczka had a problem deploying to tool October 7, 2025 07:37 — with GitHub Actions Failure

mwojtyczka temporarily deployed to tool October 7, 2025 07:37 — with GitHub Actions Inactive

		save_dataframe_as_table(metrics_df, metrics_config)
		save_dataframe_as_table(metrics_df, metrics_config)

	assert not ws.tables.exists(run_config.metrics_config.location).table_exists
	# Cannot check for metrics table existence as metrics_config is None.

Add summary metrics #553

Are you sure you want to change the base?

Add summary metrics #553

Uh oh!

Conversation

ghanse commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Basic usage

Writing summary metrics with checked data

Integration with installed workflows

TODO

Linked issues

Tests

Uh oh!

github-actions bot commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwojtyczka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ghanse commented Aug 29, 2025 •

edited

Loading

github-actions bot commented Aug 29, 2025 •

edited

Loading