-
Notifications
You must be signed in to change notification settings - Fork 63
LLM assisted rules generation #577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
❌ 359/360 passed, 3 flaky, 1 failed, 1 skipped, 3h13m6s total ❌ test_e2e_workflow: pyspark.errors.exceptions.connect.SparkConnectGrpcException: () BAD_REQUEST: session_id is no longer usable. Generate a new session_id by detaching and reattaching the compute and then try again [sessionId=6c16fe1b-7232-4f41-9f9b-d997bf6a6890, reason=INACTIVITY_TIMEOUT]. (requestId=8537a2da-e41e-4ecd-a0b5-693b24c0c70c) (12m28.318s)
Flaky tests:
Running from acceptance #2669 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces LLM-assisted data quality rules generation capabilities to the DQX framework. The implementation leverages DSPy for LLM orchestration and includes comprehensive training examples and validation mechanisms.
- Adds core LLM functionality with DSPy integration for automated data quality rule generation
- Implements training dataset and validation framework for optimizing rule generation accuracy
- Provides utility functions for schema metadata extraction and function documentation
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
src/databricks/labs/dqx/llm/llm_core.py | Core LLM implementation with DSPy module, validation logic, and compiler configuration |
src/databricks/labs/dqx/llm/llm_engine.py | High-level interface for generating business rules using LLM |
src/databricks/labs/dqx/llm/utils.py | Utility functions for schema extraction, training data loading, and function documentation |
src/databricks/labs/dqx/llm/resources/training_examples.yml | Training dataset with business scenarios and expected quality rules |
tests/unit/test_llm_utils.py | Unit tests for LLM utility functions and training data validation |
pyproject.toml | Adds dspy dependency for LLM functionality |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
@@ -0,0 +1,93 @@ | |||
- name: "product_code_not_null_or_empty" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be extended to contain more examples
logger.warning(f"✗ Rules validation errors: {validation_status.errors}") | ||
|
||
# Content similarity score (30%) | ||
similarity_score = _calculate_rule_similarity(expected_rules, actual_rules) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed: DQEngineCore.validate_checks(actual_rules)
validates the generated rule is correct. There should be no need to compare against expected. It would also be problematic to maintain because a lot of arguments can be optional and providing examples of all possible combinations won't be possible
|
||
# Json parsing score (20%) | ||
try: | ||
actual_rules = json.loads(actual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not seem to be necessary since the forward
method in DQRuleGeneration
is making sure a valid json is outputed
Changes
Linked issues
Resolves #..
Tests