Skip to content

Conversation

ghanse
Copy link
Contributor

@ghanse ghanse commented Aug 29, 2025

Changes

This PR introduces summary metrics as outputs of quality checking methods. Summary metrics computation relies on Spark's Observation feature.

Basic usage

The DQObserver can be added to DQEngine to manage Spark observations and track summary metrics on datasets checked with DQX:

observer = DQObserver()
engine = DQEngine(ws, observer=observer)

Methods of DQEngine have been updated to optionally return the Spark observation associated with a given run:

checked_df, observation = engine.apply_checks(input_df, checks)
checked_df.count()  # or any other action like saving the dataframe to a table
metrics = observation.get

Writing summary metrics with checked data

When DQEngine methods write results to an output sink, metrics can also be written:

engine.apply_checks_and_save_in_table(
  checks=...,
  input_config=...,
  output_config=...,
  metrics_config=OutputConfig("main.dqx.summary_metrics")
)

Integration with installed workflows

I have also updated the quality checking and e2e workflows to allow users to specify an output table where metrics are stored.

TODO

  • Baseline functionality
  • Add handlers for streaming
  • Add handlers for custom callback functions
  • Update tests
  • Update docs
  • Update demos

Linked issues

Resolves #376

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests

Copy link

github-actions bot commented Aug 29, 2025

❌ 388/404 passed, 5 flaky, 16 failed, 18 skipped, 4h47m15s total

❌ test_e2e_workflow_with_custom_install_folder: [gw5] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python (726ms)
[gw5] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
file /home/runner/work/dqx/dqx/tests/integration/test_e2e_workflow.py, line 103
  def test_e2e_workflow_with_custom_install_folder(
E       fixture 'setup_workflows_with_custom_folder' not found
>       available fixtures: acc, benchmark, benchmark_weave, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, checks_json_content, checks_json_invalid_content, checks_yaml_content, checks_yaml_invalid_content, class_mocker, cov, debug_env, debug_env_name, doctest_namespace, env_or_skip, expected_checks, expected_quality_checking_output, hello, installation_ctx, installation_ctx_custom_install_folder, is_in_debug, log_account_link, log_workspace_link, make_acc_group, make_alert_permissions, make_authorization_permissions, make_catalog, make_check_file_as_json, make_check_file_as_yaml, make_cluster, make_cluster_permissions, make_cluster_policy, make_cluster_policy_permissions, make_dashboard_permissions, make_directory, make_directory_permissions, make_empty_local_json_file, make_empty_local_yaml_file, make_experiment, make_experiment_permissions, make_feature_table, make_feature_table_permissions, make_group, make_instance_pool, make_instance_pool_permissions, make_invalid_check_file_as_json, make_invalid_check_file_as_yaml, make_invalid_local_check_file_as_json, make_invalid_local_check_file_as_yaml, make_job, make_job_permissions, make_lakeview_dashboard_permissions, make_local_check_file_as_json, make_local_check_file_as_yaml, make_local_check_file_as_yaml_diff_ext, make_model, make_notebook, make_notebook_permissions, make_pipeline, make_pipeline_permissions, make_query, make_query_permissions, make_random, make_registered_model_permissions, make_repo, make_repo_permissions, make_run_as, make_schema, make_secret_scope, make_secret_scope_acl, make_serving_endpoint, make_serving_endpoint_permissions, make_storage_credential, make_table, make_udf, make_user, make_volume, make_volume_check_file_as_json, make_volume_check_file_as_yaml, make_volume_invalid_check_file_as_json, make_volume_invalid_check_file_as_yaml, make_warehouse, make_warehouse_permissions, make_workspace_file, make_workspace_file_path_permissions, make_workspace_file_permissions, mocker, module_mocker, monkeypatch, no_cover, package_mocker, product_info, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, serverless_installation_ctx, session_mocker, set_utc_timezone, setup_serverless_workflows, setup_workflows, setup_workflows_with_metrics, skip_if_runtime_not_geo_compatible, spark, sql_backend, sql_exec, sql_fetch_all, testrun_uid, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory, watchdog_purge_suffix, watchdog_remove_after, webbrowser_open, worker_id, ws
>       use 'pytest --fixtures [testpath]' for help on them.

/home/runner/work/dqx/dqx/tests/integration/test_e2e_workflow.py:103
[gw5] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_quality_checker_workflow_with_metrics: pyspark.errors.exceptions.connect.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `main`.`dummy_scowo`.`metrics_gttua7` cannot be found. Verify the spelling and correctness of the schema and catalog. (2m40.658s)
... (skipped 28896 bytes)
 'criticality': 'error', 'name': 'name_is_not_null_and_not_empty'}
07:52:40 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Resolving function: is_not_null_and_not_empty
07:52:40 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Function is_not_null_and_not_empty resolved successfully: <function is_not_null_and_not_empty at 0xffff2b731f80>
07:52:40 DEBUG [databricks.labs.dqx.checks_serializer] {ThreadPoolExecutor-4_0} Processing check definition: {'check': {'arguments': {'column': 'id'}, 'function': 'is_not_null'}, 'criticality': 'error', 'name': 'id_is_not_null'}
07:52:40 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Resolving function: is_not_null
07:52:40 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Function is_not_null resolved successfully: <function is_not_null at 0xffff2b7320c0>
07:52:40 DEBUG [databricks.labs.dqx.checks_serializer] {ThreadPoolExecutor-4_0} Processing check definition: {'check': {'arguments': {'column': 'name'}, 'function': 'is_not_null_and_not_empty'}, 'criticality': 'error', 'name': 'name_is_not_null_and_not_empty'}
07:52:40 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Resolving function: is_not_null_and_not_empty
07:52:40 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Function is_not_null_and_not_empty resolved successfully: <function is_not_null_and_not_empty at 0xffff2b731f80>
07:52:40 DEBUG [databricks.labs.dqx.telemetry] {ThreadPoolExecutor-4_0} Added User-Agent extra check=is_not_null
07:52:41 DEBUG [databricks.sdk] {ThreadPoolExecutor-4_0} GET /api/2.1/clusters/spark-versions
< 200 OK
< {
<   "versions": [
<     {
<       "key": "12.2.x-scala2.12",
<       "name": "12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)"
<     },
<     {
<       "key": "11.3.x-photon-scala2.12",
<       "name": "11.3 LTS Photon (includes Apache Spark 3.3.0, Scala 2.12)"
<     },
<     "... (53 additional elements)"
<   ]
< }
07:52:41 DEBUG [databricks.labs.dqx.telemetry] {ThreadPoolExecutor-4_0} Added User-Agent extra check=is_not_null_and_not_empty
07:52:41 DEBUG [databricks.sdk] {ThreadPoolExecutor-4_0} GET /api/2.1/clusters/spark-versions
< 200 OK
< {
<   "versions": [
<     {
<       "key": "12.2.x-scala2.12",
<       "name": "12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)"
<     },
<     {
<       "key": "11.3.x-photon-scala2.12",
<       "name": "11.3 LTS Photon (includes Apache Spark 3.3.0, Scala 2.12)"
<     },
<     "... (53 additional elements)"
<   ]
< }
07:52:41 INFO [databricks.labs.dqx.io] {ThreadPoolExecutor-4_0} Saving data to main.dummy_scowo.uwitd3zrv0 table
07:53 INFO [databricks.labs.dqx.quality_checker.quality_checker_runner:apply_checks] Data quality checker completed.
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
07:53 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+4420251007075033 from https://DATABRICKS_HOST
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=959178756216828, as it is no longer needed
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=610615979419241, as it is no longer needed
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=685879735761757, as it is no longer needed
07:53 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw6] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_custom_metrics_in_workflow: pyspark.errors.exceptions.connect.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `main`.`dummy_sccc2`.`metrics_pm3txo` cannot be found. Verify the spelling and correctness of the schema and catalog. (3m1.997s)
... (skipped 28907 bytes)
'criticality': 'error', 'name': 'name_is_not_null_and_not_empty'}
07:53:05 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Resolving function: is_not_null_and_not_empty
07:53:05 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Function is_not_null_and_not_empty resolved successfully: <function is_not_null_and_not_empty at 0xffff2a98df80>
07:53:05 DEBUG [databricks.labs.dqx.checks_serializer] {ThreadPoolExecutor-4_0} Processing check definition: {'check': {'arguments': {'column': 'id'}, 'function': 'is_not_null'}, 'criticality': 'error', 'name': 'id_is_not_null'}
07:53:05 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Resolving function: is_not_null
07:53:05 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Function is_not_null resolved successfully: <function is_not_null at 0xffff2a98e0c0>
07:53:05 DEBUG [databricks.labs.dqx.checks_serializer] {ThreadPoolExecutor-4_0} Processing check definition: {'check': {'arguments': {'column': 'name'}, 'function': 'is_not_null_and_not_empty'}, 'criticality': 'error', 'name': 'name_is_not_null_and_not_empty'}
07:53:05 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Resolving function: is_not_null_and_not_empty
07:53:05 DEBUG [databricks.labs.dqx.checks_resolver] {ThreadPoolExecutor-4_0} Function is_not_null_and_not_empty resolved successfully: <function is_not_null_and_not_empty at 0xffff2a98df80>
07:53:05 DEBUG [databricks.labs.dqx.telemetry] {ThreadPoolExecutor-4_0} Added User-Agent extra check=is_not_null
07:53:05 DEBUG [databricks.sdk] {ThreadPoolExecutor-4_0} GET /api/2.1/clusters/spark-versions
< 200 OK
< {
<   "versions": [
<     {
<       "key": "12.2.x-scala2.12",
<       "name": "12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)"
<     },
<     {
<       "key": "11.3.x-photon-scala2.12",
<       "name": "11.3 LTS Photon (includes Apache Spark 3.3.0, Scala 2.12)"
<     },
<     "... (53 additional elements)"
<   ]
< }
07:53:05 DEBUG [databricks.labs.dqx.telemetry] {ThreadPoolExecutor-4_0} Added User-Agent extra check=is_not_null_and_not_empty
07:53:05 DEBUG [databricks.sdk] {ThreadPoolExecutor-4_0} GET /api/2.1/clusters/spark-versions
< 200 OK
< {
<   "versions": [
<     {
<       "key": "12.2.x-scala2.12",
<       "name": "12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)"
<     },
<     {
<       "key": "11.3.x-photon-scala2.12",
<       "name": "11.3 LTS Photon (includes Apache Spark 3.3.0, Scala 2.12)"
<     },
<     "... (53 additional elements)"
<   ]
< }
07:53:06 INFO [databricks.labs.dqx.io] {ThreadPoolExecutor-4_0} Saving data to main.dummy_sccc2.ehipsik7bq table
07:53 INFO [databricks.labs.dqx.quality_checker.quality_checker_runner:apply_checks] Data quality checker completed.
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
07:53 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+4420251007075031 from https://DATABRICKS_HOST
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=820624146080410, as it is no longer needed
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=1043280363035729, as it is no longer needed
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=763085146932591, as it is no longer needed
07:53 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw2] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_quality_checker_workflow_with_quarantine_and_metrics: databricks.sdk.errors.platform.Unknown: apply_checks: Run failed with error message (1m5.863s)
... (skipped 3860 bytes)
abricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
07:53 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
07:53 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
07:53 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.4+4420251007075314
07:53 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
07:53 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
07:53 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
07:53 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
07:53 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
07:53 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
07:53 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.XUrh/dashboards'
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=402420946484129
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=402420946484129
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=402420946484129
07:53 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
07:53 INFO [databricks.labs.dqx.checks_storage] Saving quality rules (checks) to '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.XUrh/checks.yml' in the workspace.
07:53 INFO [databricks.labs.dqx.installer.workflow_installer] Started quality-checker workflow: https://DATABRICKS_HOST#job/808792099751747/runs/485001970057993
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
07:54 WARNING [databricks.labs.dqx.installer.workflow_installer] Cannot fetch logs as folder /Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.XUrh/logs/quality-checker does not exist
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
07:54 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+4420251007075314 from https://DATABRICKS_HOST
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=571238872668260, as it is no longer needed
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=808792099751747, as it is no longer needed
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=402420946484129, as it is no longer needed
07:54 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw6] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_profiler_workflow_with_custom_install_folder: [gw9] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python (812ms)
[gw9] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
file /home/runner/work/dqx/dqx/tests/integration/test_profiler_workflow.py, line 93
  def test_profiler_workflow_with_custom_install_folder(ws, spark, setup_workflows_with_custom_folder):
E       fixture 'setup_workflows_with_custom_folder' not found
>       available fixtures: acc, benchmark, benchmark_weave, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, checks_json_content, checks_json_invalid_content, checks_yaml_content, checks_yaml_invalid_content, class_mocker, cov, debug_env, debug_env_name, doctest_namespace, env_or_skip, expected_checks, expected_quality_checking_output, hello, installation_ctx, installation_ctx_custom_install_folder, is_in_debug, log_account_link, log_workspace_link, make_acc_group, make_alert_permissions, make_authorization_permissions, make_catalog, make_check_file_as_json, make_check_file_as_yaml, make_cluster, make_cluster_permissions, make_cluster_policy, make_cluster_policy_permissions, make_dashboard_permissions, make_directory, make_directory_permissions, make_empty_local_json_file, make_empty_local_yaml_file, make_experiment, make_experiment_permissions, make_feature_table, make_feature_table_permissions, make_group, make_instance_pool, make_instance_pool_permissions, make_invalid_check_file_as_json, make_invalid_check_file_as_yaml, make_invalid_local_check_file_as_json, make_invalid_local_check_file_as_yaml, make_job, make_job_permissions, make_lakeview_dashboard_permissions, make_local_check_file_as_json, make_local_check_file_as_yaml, make_local_check_file_as_yaml_diff_ext, make_model, make_notebook, make_notebook_permissions, make_pipeline, make_pipeline_permissions, make_query, make_query_permissions, make_random, make_registered_model_permissions, make_repo, make_repo_permissions, make_run_as, make_schema, make_secret_scope, make_secret_scope_acl, make_serving_endpoint, make_serving_endpoint_permissions, make_storage_credential, make_table, make_udf, make_user, make_volume, make_volume_check_file_as_json, make_volume_check_file_as_yaml, make_volume_invalid_check_file_as_json, make_volume_invalid_check_file_as_yaml, make_warehouse, make_warehouse_permissions, make_workspace_file, make_workspace_file_path_permissions, make_workspace_file_permissions, mocker, module_mocker, monkeypatch, no_cover, package_mocker, product_info, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, serverless_installation_ctx, session_mocker, set_utc_timezone, setup_serverless_workflows, setup_workflows, setup_workflows_with_metrics, skip_if_runtime_not_geo_compatible, spark, sql_backend, sql_exec, sql_fetch_all, testrun_uid, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory, watchdog_purge_suffix, watchdog_remove_after, webbrowser_open, worker_id, ws
>       use 'pytest --fixtures [testpath]' for help on them.

/home/runner/work/dqx/dqx/tests/integration/test_profiler_workflow.py:93
[gw9] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_e2e_workflow_for_multiple_run_configs: databricks.sdk.errors.platform.Unknown: prepare: Run failed with error message (1m33.08s)
... (skipped 3483 bytes)
---
07:58 WARNING [databricks.labs.dqx.installer.workflow_installer] Cannot fetch logs as folder /Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.Oaz1/logs/e2e does not exist
07:58 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
07:57 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
07:57 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
07:57 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.4+4420251007075729
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
07:57 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
07:57 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
07:57 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.Oaz1/dashboards'
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=895640736201408
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=895640736201408
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=895640736201408
07:57 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/895640736201408/runs/915391438010011
07:58 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
07:58 WARNING [databricks.labs.dqx.installer.workflow_installer] Cannot fetch logs as folder /Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.Oaz1/logs/e2e does not exist
07:58 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
07:58 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+4420251007075729 from https://DATABRICKS_HOST
07:58 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=262397866451180, as it is no longer needed
07:58 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=190848360489449, as it is no longer needed
07:58 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=895640736201408, as it is no longer needed
07:59 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw7] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_quality_checker_workflow_for_multiple_run_configs: databricks.labs.dqx.errors.InvalidConfigError: Run config flag is required (2m47.981s)
... (skipped 3788 bytes)
tall] Installing DQX v0.9.4+4420251007075712
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
07:57 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
07:57 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
07:57 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
07:57 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.ttDJ/dashboards'
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=802106398060023
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=802106398060023
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=802106398060023
07:57 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
07:57 INFO [databricks.labs.dqx.checks_storage] Saving quality rules (checks) to '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.ttDJ/checks.yml' in the workspace.
07:57 INFO [databricks.labs.dqx.installer.workflow_installer] Started quality-checker workflow: https://DATABRICKS_HOST#job/100565335467088/runs/123588368629372
07:59 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
07:59 INFO [databricks.labs.dqx:apply_checks] DQX v0.9.4+4420251007075712 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.ttDJ/logs/quality-checker/run-123588368629372-0/apply_checks.log
07:59 INFO [databricks.labs.dqx.quality_checker.quality_checker_workflow:apply_checks] Running data quality workflow for all run configs
07:59 ERROR [databricks.labs.dqx:apply_checks] Execute `databricks workspace export //Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.ttDJ/logs/quality-checker/run-123588368629372-0/apply_checks.log` locally to troubleshoot with more details. Run config flag is required
07:59 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
07:59 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+4420251007075712 from https://DATABRICKS_HOST
07:59 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=419986752944811, as it is no longer needed
07:59 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=100565335467088, as it is no longer needed
07:59 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=802106398060023, as it is no longer needed
07:59 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw2] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_e2e_workflow_with_metrics: pyspark.errors.exceptions.connect.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `main`.`dummy_selzg`.`metrics_twfapr` cannot be found. Verify the spelling and correctness of the schema and catalog. (8m28.732s)
... (skipped 17680 bytes)
parsed from ''
07:54 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
07:54 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
07:54 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.OVHO/dashboards'
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=568466388162100
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=568466388162100
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=568466388162100
07:54 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
07:54 INFO [databricks.labs.dqx.checks_storage] Saving quality rules (checks) to '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.OVHO/checks.yml' in the workspace.
07:54 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/568466388162100/runs/770013585833644
08:02 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 770013585833644 with state: RunResultState.SUCCESS
08:02 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 770013585833644 duration: 0:08:01.300000 (2025-10-07 07:54:35.637000+00:00 thru 2025-10-07 08:02:36.937000+00:00)
08:02 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
08:02 INFO [databricks.labs.dqx:prepare] DQX v0.9.4+4420251007075420 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.OVHO/logs/e2e/run-770013585833644-0/prepare.log
08:02 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:prepare] End-to-end: prepare start for run config: TEST_SCHEMA
08:02 INFO [databricks.labs.dqx:finalize] DQX v0.9.4+4420251007075420 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.OVHO/logs/e2e/run-770013585833644-0/finalize.log
08:02 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] End-to-end: finalize complete for run config: TEST_SCHEMA
08:02 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] For more details please check the run logs of the profiler and quality checker jobs.
08:02 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
08:02 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+4420251007075420 from https://DATABRICKS_HOST
08:02 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=616187741950763, as it is no longer needed
08:02 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=781031214722595, as it is no longer needed
08:02 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=568466388162100, as it is no longer needed
08:02 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw6] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_quality_checker_workflow_for_multiple_run_configs_table_checks_storage: databricks.labs.dqx.errors.InvalidConfigError: Run config flag is required (3m11.575s)
... (skipped 4556 bytes)
est.output_table' output table as the source table for the dashboard...
08:00 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
08:00 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
08:00 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
08:00 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.xkpl/dashboards'
08:00 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
08:00 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
08:00 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
08:00 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=6423572395703
08:00 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=6423572395703
08:00 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=6423572395703
08:00 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
08:00 INFO [databricks.labs.dqx.checks_storage] Saving quality rules (checks) to '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.xkpl/checks.yml' in the workspace.
08:00 INFO [databricks.labs.dqx.checks_storage] Loading quality rules (checks) from '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.xkpl/checks.yml' in the workspace.
08:00 INFO [databricks.labs.dqx.checks_storage] Saving quality rules (checks) to table 'main.dummy_ssyej.checks'
08:00 INFO [databricks.labs.dqx.checks_storage] Saving quality rules (checks) to table 'main.dummy_ssyej.checks'
08:00 INFO [databricks.labs.dqx.installer.workflow_installer] Started quality-checker workflow: https://DATABRICKS_HOST#job/108216133368396/runs/204303348545801
08:03 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
08:03 INFO [databricks.labs.dqx:apply_checks] DQX v0.9.4+4420251007080001 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.xkpl/logs/quality-checker/run-204303348545801-0/apply_checks.log
08:03 INFO [databricks.labs.dqx.quality_checker.quality_checker_workflow:apply_checks] Running data quality workflow for all run configs
08:03 ERROR [databricks.labs.dqx:apply_checks] Execute `databricks workspace export //Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.xkpl/logs/quality-checker/run-204303348545801-0/apply_checks.log` locally to troubleshoot with more details. Run config flag is required
08:03 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
08:03 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+4420251007080001 from https://DATABRICKS_HOST
08:03 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=716255192023503, as it is no longer needed
08:03 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=108216133368396, as it is no longer needed
08:03 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=6423572395703, as it is no longer needed
08:03 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw2] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_observer_metrics_output: chispa.dataframe_comparer.DataFramesNotEqualError: (11.486s)
... (skipped 538 bytes)
f1                                                                                                                                                             | df2  |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|      �[31mRow(run_name='test_observer', input_location='main.dummy_sd7vc.input_qv78a3', output_location='main.dummy_sd7vc.output_sozioy', quarantine_location=None, checks_location=None, metric_name='input_row_count', metric_value='4', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
|      �[31mRow(run_name='test_observer', input_location='main.dummy_sd7vc.input_qv78a3', output_location='main.dummy_sd7vc.output_sozioy', quarantine_location=None, checks_location=None, metric_name='error_row_count', metric_value='1', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
|     �[31mRow(run_name='test_observer', input_location='main.dummy_sd7vc.input_qv78a3', output_location='main.dummy_sd7vc.output_sozioy', quarantine_location=None, checks_location=None, metric_name='warning_row_count', metric_value='1', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m    | None |
|      �[31mRow(run_name='test_observer', input_location='main.dummy_sd7vc.input_qv78a3', output_location='main.dummy_sd7vc.output_sozioy', quarantine_location=None, checks_location=None, metric_name='valid_row_count', metric_value='2', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
|     �[31mRow(run_name='test_observer', input_location='main.dummy_sd7vc.input_qv78a3', output_location='main.dummy_sd7vc.output_sozioy', quarantine_location=None, checks_location=None, metric_name='avg_error_age', metric_value='35.0', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
| �[31mRow(run_name='test_observer', input_location='main.dummy_sd7vc.input_qv78a3', output_location='main.dummy_sd7vc.output_sozioy', quarantine_location=None, checks_location=None, metric_name='total_warning_salary', metric_value='55000', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m | None |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
[gw7] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
08:03 INFO [databricks.labs.dqx.engine] Applying checks to main.dummy_sd7vc.input_qv78a3
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sd7vc.output_sozioy table
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sd7vc.metrics_fdbc0o table
08:03 INFO [databricks.labs.dqx.engine] Applying checks to main.dummy_sd7vc.input_qv78a3
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sd7vc.output_sozioy table
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sd7vc.metrics_fdbc0o table
[gw7] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_observer_metrics_output_with_quarantine: chispa.dataframe_comparer.DataFramesNotEqualError: (13.597s)
... (skipped 1042 bytes)
------------------------------------------------------------------------------+------+
|      �[31mRow(run_name='test_observer', input_location='main.dummy_sita5.input_lvw3qc', output_location='main.dummy_sita5.output_hnrbrq', quarantine_location='main.dummy_sita5.quarantine_a46wpp', checks_location=None, metric_name='input_row_count', metric_value='4', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
|      �[31mRow(run_name='test_observer', input_location='main.dummy_sita5.input_lvw3qc', output_location='main.dummy_sita5.output_hnrbrq', quarantine_location='main.dummy_sita5.quarantine_a46wpp', checks_location=None, metric_name='error_row_count', metric_value='1', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
|     �[31mRow(run_name='test_observer', input_location='main.dummy_sita5.input_lvw3qc', output_location='main.dummy_sita5.output_hnrbrq', quarantine_location='main.dummy_sita5.quarantine_a46wpp', checks_location=None, metric_name='warning_row_count', metric_value='1', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m    | None |
|      �[31mRow(run_name='test_observer', input_location='main.dummy_sita5.input_lvw3qc', output_location='main.dummy_sita5.output_hnrbrq', quarantine_location='main.dummy_sita5.quarantine_a46wpp', checks_location=None, metric_name='valid_row_count', metric_value='2', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
|     �[31mRow(run_name='test_observer', input_location='main.dummy_sita5.input_lvw3qc', output_location='main.dummy_sita5.output_hnrbrq', quarantine_location='main.dummy_sita5.quarantine_a46wpp', checks_location=None, metric_name='avg_error_age', metric_value='35.0', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m     | None |
| �[31mRow(run_name='test_observer', input_location='main.dummy_sita5.input_lvw3qc', output_location='main.dummy_sita5.output_hnrbrq', quarantine_location='main.dummy_sita5.quarantine_a46wpp', checks_location=None, metric_name='total_warning_salary', metric_value='55000', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m | None |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
[gw7] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
08:03 INFO [databricks.labs.dqx.engine] Applying checks to main.dummy_sita5.input_lvw3qc
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sita5.output_hnrbrq table
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sita5.quarantine_a46wpp table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sita5.metrics_hx5dnq table
08:03 INFO [databricks.labs.dqx.engine] Applying checks to main.dummy_sita5.input_lvw3qc
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sita5.output_hnrbrq table
08:03 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sita5.quarantine_a46wpp table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sita5.metrics_hx5dnq table
[gw7] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_streaming_observer_metrics_output: pyspark.errors.exceptions.connect.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `main`.`dummy_seqiy`.`metrics_p3llvn` cannot be found. Verify the spelling and correctness of the schema and catalog. (12.845s)
... (skipped 10067 bytes)
t.service.RequestContext.withContext(RequestContext.scala:349)
	at com.databricks.spark.connect.service.RequestContext.runWith(RequestContext.scala:329)
	at com.databricks.spark.connect.service.AuthenticationInterceptor$AuthenticatedServerCallListener.onHalfClose(AuthenticationInterceptor.scala:381)
	at grpc_shaded.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
	at grpc_shaded.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
	at grpc_shaded.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
	at grpc_shaded.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:351)
	at grpc_shaded.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
	at grpc_shaded.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at grpc_shaded.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.$anonfun$run$1(SparkThreadLocalForwardingThreadPoolExecutor.scala:165)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$6(SparkThreadLocalForwardingThreadPoolExecutor.scala:119)
	at com.databricks.sql.transaction.tahoe.mst.MSTThreadHelper$.runWithMstTxnId(MSTThreadHelper.scala:57)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$5(SparkThreadLocalForwardingThreadPoolExecutor.scala:118)
	at com.databricks.spark.util.IdentityClaim$.withClaim(IdentityClaim.scala:48)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$4(SparkThreadLocalForwardingThreadPoolExecutor.scala:117)
	at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:116)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured$(SparkThreadLocalForwardingThreadPoolExecutor.scala:93)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:162)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:165)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.lang.Thread.run(Thread.java:840)
[gw9] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
08:04 INFO [databricks.labs.dqx.engine] Applying checks to main.dummy_seqiy.input_kfdfwo
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_seqiy.output_fbeigo table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_seqiy.metrics_p3llvn table
08:04 INFO [databricks.labs.dqx.engine] Applying checks to main.dummy_seqiy.input_kfdfwo
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_seqiy.output_fbeigo table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_seqiy.metrics_p3llvn table
[gw9] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_save_results_in_table_batch_with_metrics: chispa.dataframe_comparer.DataFramesNotEqualError: (10.446s)
chispa.dataframe_comparer.DataFramesNotEqualError: 
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|                                                                                                                                                                 df1                                                                                                                                                                  | df2  |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|  �[31mRow(run_name='test_save_batch_observer', input_location=None, output_location='main.dummy_sxjvs.output_ncbpyw', quarantine_location='main.dummy_sxjvs.quarantine_fo6fj9', checks_location=None, metric_name='input_row_count', metric_value='4', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m  | None |
|  �[31mRow(run_name='test_save_batch_observer', input_location=None, output_location='main.dummy_sxjvs.output_ncbpyw', quarantine_location='main.dummy_sxjvs.quarantine_fo6fj9', checks_location=None, metric_name='error_row_count', metric_value='1', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m  | None |
| �[31mRow(run_name='test_save_batch_observer', input_location=None, output_location='main.dummy_sxjvs.output_ncbpyw', quarantine_location='main.dummy_sxjvs.quarantine_fo6fj9', checks_location=None, metric_name='warning_row_count', metric_value='1', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m | None |
|  �[31mRow(run_name='test_save_batch_observer', input_location=None, output_location='main.dummy_sxjvs.output_ncbpyw', quarantine_location='main.dummy_sxjvs.quarantine_fo6fj9', checks_location=None, metric_name='valid_row_count', metric_value='2', error_column_name='_errors', warning_column_name='_warnings', user_metadata=None)�[0m  | None |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
[gw9] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sxjvs.output_ncbpyw table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sxjvs.quarantine_fo6fj9 table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sxjvs.metrics_migj7o table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sxjvs.output_ncbpyw table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sxjvs.quarantine_fo6fj9 table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sxjvs.metrics_migj7o table
[gw9] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_save_results_in_table_streaming_with_metrics: pyspark.errors.exceptions.connect.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `main`.`dummy_sj69h`.`metrics_bqkexy` cannot be found. Verify the spelling and correctness of the schema and catalog. (11.513s)
... (skipped 9889 bytes)
.withValue(AttributionContextUtils.scala:242)
	at com.databricks.spark.connect.service.RequestContext.$anonfun$runWith$1(RequestContext.scala:336)
	at com.databricks.spark.connect.service.RequestContext.withContext(RequestContext.scala:349)
	at com.databricks.spark.connect.service.RequestContext.runWith(RequestContext.scala:329)
	at com.databricks.spark.connect.service.AuthenticationInterceptor$AuthenticatedServerCallListener.onHalfClose(AuthenticationInterceptor.scala:381)
	at grpc_shaded.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
	at grpc_shaded.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
	at grpc_shaded.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
	at grpc_shaded.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:351)
	at grpc_shaded.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
	at grpc_shaded.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at grpc_shaded.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.$anonfun$run$1(SparkThreadLocalForwardingThreadPoolExecutor.scala:165)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$6(SparkThreadLocalForwardingThreadPoolExecutor.scala:119)
	at com.databricks.sql.transaction.tahoe.mst.MSTThreadHelper$.runWithMstTxnId(MSTThreadHelper.scala:57)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$5(SparkThreadLocalForwardingThreadPoolExecutor.scala:118)
	at com.databricks.spark.util.IdentityClaim$.withClaim(IdentityClaim.scala:48)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$4(SparkThreadLocalForwardingThreadPoolExecutor.scala:117)
	at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:116)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured$(SparkThreadLocalForwardingThreadPoolExecutor.scala:93)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:162)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:165)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.lang.Thread.run(Thread.java:840)
[gw7] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sj69h.output_qqsubd table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sj69h.metrics_bqkexy table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sj69h.output_qqsubd table
08:04 INFO [databricks.labs.dqx.io] Saving data to main.dummy_sj69h.metrics_bqkexy table
[gw7] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_quality_checker_workflow_with_custom_install_folder: [gw6] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python (630ms)
[gw6] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
file /home/runner/work/dqx/dqx/tests/integration/test_quality_checker_workflow.py, line 139
  def test_quality_checker_workflow_with_custom_install_folder(
E       fixture 'setup_workflows_with_custom_folder' not found
>       available fixtures: acc, benchmark, benchmark_weave, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, checks_json_content, checks_json_invalid_content, checks_yaml_content, checks_yaml_invalid_content, class_mocker, cov, debug_env, debug_env_name, doctest_namespace, env_or_skip, expected_checks, expected_quality_checking_output, hello, installation_ctx, installation_ctx_custom_install_folder, is_in_debug, log_account_link, log_workspace_link, make_acc_group, make_alert_permissions, make_authorization_permissions, make_catalog, make_check_file_as_json, make_check_file_as_yaml, make_cluster, make_cluster_permissions, make_cluster_policy, make_cluster_policy_permissions, make_dashboard_permissions, make_directory, make_directory_permissions, make_empty_local_json_file, make_empty_local_yaml_file, make_experiment, make_experiment_permissions, make_feature_table, make_feature_table_permissions, make_group, make_instance_pool, make_instance_pool_permissions, make_invalid_check_file_as_json, make_invalid_check_file_as_yaml, make_invalid_local_check_file_as_json, make_invalid_local_check_file_as_yaml, make_job, make_job_permissions, make_lakeview_dashboard_permissions, make_local_check_file_as_json, make_local_check_file_as_yaml, make_local_check_file_as_yaml_diff_ext, make_model, make_notebook, make_notebook_permissions, make_pipeline, make_pipeline_permissions, make_query, make_query_permissions, make_random, make_registered_model_permissions, make_repo, make_repo_permissions, make_run_as, make_schema, make_secret_scope, make_secret_scope_acl, make_serving_endpoint, make_serving_endpoint_permissions, make_storage_credential, make_table, make_udf, make_user, make_volume, make_volume_check_file_as_json, make_volume_check_file_as_yaml, make_volume_invalid_check_file_as_json, make_volume_invalid_check_file_as_yaml, make_warehouse, make_warehouse_permissions, make_workspace_file, make_workspace_file_path_permissions, make_workspace_file_permissions, mocker, module_mocker, monkeypatch, no_cover, package_mocker, product_info, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, serverless_installation_ctx, session_mocker, set_utc_timezone, setup_serverless_workflows, setup_workflows, setup_workflows_with_metrics, skip_if_runtime_not_geo_compatible, spark, sql_backend, sql_exec, sql_fetch_all, testrun_uid, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory, watchdog_purge_suffix, watchdog_remove_after, webbrowser_open, worker_id, ws
>       use 'pytest --fixtures [testpath]' for help on them.

/home/runner/work/dqx/dqx/tests/integration/test_quality_checker_workflow.py:139
[gw6] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
❌ test_list_tables: databricks.sdk.errors.platform.NotFound: Catalog 'ucx_9tmnasv5gbmchcct' does not exist. (18m39.399s)
databricks.sdk.errors.platform.NotFound: Catalog 'ucx_9tmnasv5gbmchcct' does not exist.
[gw1] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
[gw1] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

Flaky tests:

  • 🤪 test_e2e_workflow_for_patterns (1m47.315s)
  • 🤪 test_profiler_workflow (1m20.44s)
  • 🤪 test_e2e_workflow_for_patterns_exclude_patterns (3m44.869s)
  • 🤪 test_e2e_workflow_for_patterns_exclude_output (1m28.257s)
  • 🤪 test_quality_checker_workflow_for_patterns (1m17.628s)

Running from acceptance #2724

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces summary metrics as outputs of quality checking methods, using Spark's Observation feature to track data quality metrics. The DQObserver class manages Spark observations and tracks both default metrics (input/error/warning/valid counts) and custom user-defined SQL expressions.

  • Adds DQObserver class for managing Spark observations and tracking summary metrics
  • Updates DQEngine methods to return tuples with both DataFrames and observations
  • Integrates metrics collection with existing workflows and configuration system

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/databricks/labs/dqx/observer.py New DQObserver class for managing Spark observations and tracking metrics
src/databricks/labs/dqx/engine.py Updated engine methods to support metrics collection and storage
src/databricks/labs/dqx/config.py Added metrics configuration fields to RunConfig and WorkspaceConfig
tests/unit/test_observer.py Unit tests for the DQObserver class functionality
tests/integration/test_summary_metrics.py Integration tests for end-to-end metrics collection
tests/integration/test_metrics_workflow.py Tests for workflow-based metrics collection

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 621 to 622
save_dataframe_as_table(metrics_df, metrics_config)
save_dataframe_as_table(metrics_df, metrics_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

assert run_config.metrics_config is None

ctx.deployed_workflows.run_workflow("quality-checker", run_config.name)
assert not ws.tables.exists(run_config.metrics_config.location).table_exists
Copy link

Copilot AI Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion will fail because run_config.metrics_config is None when metrics are disabled, causing a NoneType attribute access error. The assertion should check that run_config.metrics_config is None instead.

Suggested change
assert not ws.tables.exists(run_config.metrics_config.location).table_exists
# Cannot check for metrics table existence as metrics_config is None.

Copilot uses AI. Check for mistakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going in the right direction

- `sample_seed`: seed for reproducible sampling.
- `limit`: maximum number of records to analyze.
- `extra_params`: (optional) extra parameters to pass to the jobs such as result column names and user_metadata
- `custom_metrics`: (optional) list of Spark SQL expressions for capturing custom summary metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be worth to add that a set of default metrics are always used regardless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is clear?

By default, the number of input, warning, and error rows will be tracked. When custom metrics are defined, they will be tracked in addition to the default metrics.

A list of Spark SQL expressions as strings
"""
result_columns = self.result_columns or {}
errors_column = result_columns.get(ColumnArguments.ERRORS.value, DefaultColumnNames.ERRORS.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better to provide this directly to avoid repeating the engine implementation. I would set this in the engine. Otherwise, user have to provide extra params twice and it would be error prone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way to update the column names in SQL expressions used for default metrics. I added a _set_column_names method in the DQMetricsObserver that is called whenever DQEngine is initialized with an observer.

"""
Spark `Observation` which can be attached to a `DataFrame` to track summary metrics. Metrics will be collected
when the 1st action is triggered on the attached `DataFrame`. Subsequent operations on the attached `DataFrame`
will not update the observed metrics. See: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Observation.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a proper link so that we render this nicely in api docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

assert run_config.metrics_config is None

ctx.deployed_workflows.run_workflow("quality-checker", run_config.name)
assert not ws.tables.exists(run_config.metrics_config.location).table_exists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true


def test_engine_with_observer_before_action(ws, spark):
"""Test that summary metrics are empty before running a Spark action."""
custom_metrics = ["avg(age) as avg_age", "sum(salary) as total_salary"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need custom metrics here? it seems we can remove it for this test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

ghanse added 3 commits October 6, 2025 10:11
…trics

# Conflicts:
#	docs/dqx/docs/reference/benchmarks.mdx
#	src/databricks/labs/dqx/engine.py
#	tests/perf/.benchmarks/baseline.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Add summary statistics as an additional output of quality checking

2 participants