Skip to content

Conversation

souravg-db2
Copy link

Changes

Linked issues

Resolves #..

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

Copy link

github-actions bot commented Sep 15, 2025

❌ 359/360 passed, 3 flaky, 1 failed, 1 skipped, 3h13m6s total

❌ test_e2e_workflow: pyspark.errors.exceptions.connect.SparkConnectGrpcException: () BAD_REQUEST: session_id is no longer usable. Generate a new session_id by detaching and reattaching the compute and then try again [sessionId=6c16fe1b-7232-4f41-9f9b-d997bf6a6890, reason=INACTIVITY_TIMEOUT]. (requestId=8537a2da-e41e-4ecd-a0b5-693b24c0c70c) (12m28.318s)
pyspark.errors.exceptions.connect.SparkConnectGrpcException: () BAD_REQUEST: session_id is no longer usable. Generate a new session_id by detaching and reattaching the compute and then try again [sessionId=6c16fe1b-7232-4f41-9f9b-d997bf6a6890, reason=INACTIVITY_TIMEOUT]. (requestId=8537a2da-e41e-4ecd-a0b5-693b24c0c70c)
[gw5] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python
22:03 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
22:03 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
22:03 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.3+1820251002220314
22:03 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
22:03 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
22:03 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
22:03 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
22:03 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
22:03 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
22:03 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.QQS9/dashboards'
22:03 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
22:03 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
22:03 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
22:03 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=251088666848602
22:03 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=251088666848602
22:03 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=251088666848602
22:03 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
22:03 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/251088666848602/runs/398358204537636
22:15 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 398358204537636 with state: RunResultState.SUCCESS
22:15 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 398358204537636 duration: 0:11:58.313000 (2025-10-02 22:03:28.554000+00:00 thru 2025-10-02 22:15:26.867000+00:00)
22:15 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
22:15 INFO [databricks.labs.dqx:prepare] DQX v0.9.3+1820251002220314 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.QQS9/logs/e2e/run-398358204537636-0/prepare.log
22:15 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:prepare] End-to-end: prepare start for run config: TEST_SCHEMA
22:15 INFO [databricks.labs.dqx:finalize] DQX v0.9.3+1820251002220314 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.QQS9/logs/e2e/run-398358204537636-0/finalize.log
22:15 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] End-to-end: finalize complete for run config: TEST_SCHEMA
22:15 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] For more details please check the run logs of the profiler and quality checker jobs.
22:15 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
22:15 INFO [databricks.labs.dqx.checks_storage] Loading quality rules (checks) from '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.QQS9/checks.yml' in the workspace.
22:03 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
22:03 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
22:03 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.3+1820251002220314
22:03 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
22:03 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
22:03 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
22:03 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
22:03 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
22:03 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
22:03 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.QQS9/dashboards'
22:03 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
22:03 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
22:03 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
22:03 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=251088666848602
22:03 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=251088666848602
22:03 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=251088666848602
22:03 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
22:03 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/251088666848602/runs/398358204537636
22:15 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 398358204537636 with state: RunResultState.SUCCESS
22:15 INFO [databricks.labs.dqx.installer.workflow_installer] Completed e2e workflow run 398358204537636 duration: 0:11:58.313000 (2025-10-02 22:03:28.554000+00:00 thru 2025-10-02 22:15:26.867000+00:00)
22:15 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
22:15 INFO [databricks.labs.dqx:prepare] DQX v0.9.3+1820251002220314 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.QQS9/logs/e2e/run-398358204537636-0/prepare.log
22:15 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:prepare] End-to-end: prepare start for run config: TEST_SCHEMA
22:15 INFO [databricks.labs.dqx:finalize] DQX v0.9.3+1820251002220314 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.QQS9/logs/e2e/run-398358204537636-0/finalize.log
22:15 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] End-to-end: finalize complete for run config: TEST_SCHEMA
22:15 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:finalize] For more details please check the run logs of the profiler and quality checker jobs.
22:15 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
22:15 INFO [databricks.labs.dqx.checks_storage] Loading quality rules (checks) from '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.QQS9/checks.yml' in the workspace.
22:15 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.3+1820251002220314 from https://DATABRICKS_HOST
22:15 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=566147128685003, as it is no longer needed
22:15 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=965266542957203, as it is no longer needed
22:15 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=251088666848602, as it is no longer needed
22:15 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw5] linux -- Python 3.12.11 /home/runner/work/dqx/dqx/.venv/bin/python

Flaky tests:

  • 🤪 test_define_user_metadata_and_extract_dq_results (10.024s)
  • 🤪 test_quality_checker_workflow (3.719s)
  • 🤪 test_e2e_workflow (11m9.809s)

Running from acceptance #2669

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces LLM-assisted data quality rules generation capabilities to the DQX framework. The implementation leverages DSPy for LLM orchestration and includes comprehensive training examples and validation mechanisms.

  • Adds core LLM functionality with DSPy integration for automated data quality rule generation
  • Implements training dataset and validation framework for optimizing rule generation accuracy
  • Provides utility functions for schema metadata extraction and function documentation

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/databricks/labs/dqx/llm/llm_core.py Core LLM implementation with DSPy module, validation logic, and compiler configuration
src/databricks/labs/dqx/llm/llm_engine.py High-level interface for generating business rules using LLM
src/databricks/labs/dqx/llm/utils.py Utility functions for schema extraction, training data loading, and function documentation
src/databricks/labs/dqx/llm/resources/training_examples.yml Training dataset with business scenarios and expected quality rules
tests/unit/test_llm_utils.py Unit tests for LLM utility functions and training data validation
pyproject.toml Adds dspy dependency for LLM functionality

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@mwojtyczka mwojtyczka requested review from mwojtyczka and removed request for tombonfert September 16, 2025 08:35
@@ -0,0 +1,93 @@
- name: "product_code_not_null_or_empty"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be extended to contain more examples

logger.warning(f"✗ Rules validation errors: {validation_status.errors}")

# Content similarity score (30%)
similarity_score = _calculate_rule_similarity(expected_rules, actual_rules)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed: DQEngineCore.validate_checks(actual_rules) validates the generated rule is correct. There should be no need to compare against expected. It would also be problematic to maintain because a lot of arguments can be optional and providing examples of all possible combinations won't be possible


# Json parsing score (20%)
try:
actual_rules = json.loads(actual)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not seem to be necessary since the forward method in DQRuleGeneration is making sure a valid json is outputed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants