-
Notifications
You must be signed in to change notification settings - Fork 64
llm based pk detector added #543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jominjohny
wants to merge
28
commits into
main
Choose a base branch
from
llm_based_pk_identification
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
a0612a8
feat: add LLM-based primary key detection with clean dependency injec…
jominjohny 8725a75
Merge branch 'main' into llm_based_pk_identification
mwojtyczka c69c933
updated the dependency version and some fixes
jominjohny a5bae54
Merge branch 'llm_based_pk_identification' of github.com:databricksla…
jominjohny 4b60ac1
refactor
mwojtyczka c149cd6
Merge remote-tracking branch 'origin/llm_based_pk_identification' int…
mwojtyczka 5e181bb
refactor
mwojtyczka 85f1522
refactor
mwojtyczka ced8ba5
refactor
mwojtyczka 7809228
refactor
mwojtyczka 477c9c1
fixes added
jominjohny 5477222
table_name changed to table
jominjohny fcc31df
fixes added
jominjohny cf4f5c1
fmt fix added
jominjohny e156c34
Merge branch 'main' into llm_based_pk_identification
mwojtyczka 8695c57
table_name to table
jominjohny 0f5e394
fmt issues fixed
jominjohny 3c07faf
Update demos/dqx_llm_demo.py
jominjohny 2a4ee58
Update docs/dqx/docs/guide/data_profiling.mdx
jominjohny eaaf3a6
updated the review comments
jominjohny 792cf8d
Merge branch 'main' into llm_based_pk_identification
jominjohny fb97038
Merge branch 'llm_based_pk_identification' of github.com:databricksla…
jominjohny f72ccf6
Merge branch 'main' into llm_based_pk_identification
mwojtyczka 0b88dee
Merge branch 'main' into llm_based_pk_identification
mwojtyczka 4789fa1
Merge branch 'main' into llm_based_pk_identification
jominjohny c6c096a
Merge branch 'main' into llm_based_pk_identification
mwojtyczka 68603b0
fix added
jominjohny 1773ee3
Merge branch 'main' into llm_based_pk_identification
jominjohny File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,316 @@ | ||
# Databricks notebook source | ||
# MAGIC %md | ||
# MAGIC # Using DQX for LLM-based Primary Key Detection | ||
# MAGIC DQX provides optional LLM-based primary key detection capabilities that can intelligently identify primary key columns from table schema and metadata. This feature uses Large Language Models to analyze table structures and suggest potential primary keys, enhancing the data profiling and quality rules generation process. | ||
# MAGIC | ||
# MAGIC The LLM-based primary key detection is completely optional and only activates when users explicitly request it. Regular DQX functionality works without any LLM dependencies. | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC # Install DQX with LLM extras | ||
# MAGIC | ||
# MAGIC To enable LLM-based primary key detection, DQX has to be installed with `llm` extras: | ||
# MAGIC | ||
# MAGIC `%pip install databricks-labs-dqx[llm]` | ||
|
||
# COMMAND ---------- | ||
|
||
dbutils.widgets.text("test_library_ref", "", "Test Library Ref") | ||
|
||
if dbutils.widgets.get("test_library_ref") != "": | ||
%pip install 'databricks-labs-dqx[llm] @ {dbutils.widgets.get("test_library_ref")}' | ||
else: | ||
%pip install databricks-labs-dqx[llm] | ||
|
||
# COMMAND ---------- | ||
|
||
dbutils.library.restartPython() | ||
|
||
# COMMAND ---------- | ||
|
||
from databricks.sdk import WorkspaceClient | ||
from databricks.labs.dqx.config import ProfilerConfig, LLMConfig | ||
from databricks.labs.dqx.profiler.profiler import DQProfiler | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ## Regular Profiling (No LLM Dependencies Required) | ||
# MAGIC | ||
# MAGIC By default, DQX works without any LLM dependencies. Regular profiling functionality is always available. | ||
|
||
# COMMAND ---------- | ||
|
||
# Default configuration - no LLM features | ||
config = ProfilerConfig() | ||
print(f"LLM PK Detection: {config.llm_config.enable_pk_detection}") # False by default | ||
|
||
# This works without any LLM dependencies! | ||
print("✅ Regular profiling works out of the box!") | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ## LLM-Based Primary Key Detection | ||
# MAGIC | ||
# MAGIC When explicitly requested, DQX can use LLM-based analysis to detect potential primary keys in your tables. | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ### Method 1: Configuration-based enablement | ||
|
||
# COMMAND ---------- | ||
|
||
# Enable LLM-based PK detection via configuration | ||
config = ProfilerConfig(llm_config=LLMConfig(enable_pk_detection=True)) | ||
print(f"LLM PK Detection: {config.llm_config.enable_pk_detection}") | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ### Method 2: Options-based enablement | ||
|
||
# COMMAND ---------- | ||
|
||
ws = WorkspaceClient() | ||
profiler = DQProfiler(ws) | ||
|
||
# Enable via options parameter | ||
summary_stats, dq_rules = profiler.profile_table( | ||
"catalog.schema.table", | ||
options={"llm": True} # Simple LLM enablement | ||
) | ||
print("✅ LLM-based profiling enabled!") | ||
|
||
# Check if primary key was detected | ||
if "llm_primary_key_detection" in summary_stats: | ||
pk_info = summary_stats["llm_primary_key_detection"] | ||
print(f"Detected PK: {pk_info['detected_columns']}") | ||
print(f"Confidence: {pk_info['confidence']}") | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ### Method 3: Direct detection method | ||
|
||
# COMMAND ---------- | ||
|
||
# Direct LLM-based primary key detection | ||
result = profiler.detect_primary_keys_with_llm( | ||
table="customers", | ||
llm=True, # Explicit LLM enablement required | ||
options={ | ||
"llm_pk_detection_endpoint": "databricks-meta-llama-3-1-8b-instruct" | ||
} | ||
) | ||
|
||
if result and result.get("success", False): | ||
print(f"✅ Detected PK: {result['primary_key_columns']}") | ||
print(f"Confidence: {result['confidence']}") | ||
print(f"Reasoning: {result['reasoning']}") | ||
else: | ||
print("❌ Primary key detection failed or returned no results") | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the demo should also showcase the rules generation with |
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ## Rules Generation with is_unique | ||
# MAGIC | ||
# MAGIC Once primary keys are detected via LLM, DQX can automatically generate `is_unique` data quality rules to validate the uniqueness of those columns. | ||
|
||
# COMMAND ---------- | ||
|
||
from databricks.labs.dqx.profiler.generator import DQGenerator | ||
|
||
# Example: Generate is_unique rule from LLM-detected primary key | ||
detected_pk_columns = ["customer_id", "order_id"] # Example detected PK | ||
confidence = "high" | ||
reasoning = "LLM analysis indicates these columns form a composite primary key based on schema patterns" | ||
|
||
# Generate is_unique rule using the detected primary key | ||
is_unique_rule = DQGenerator.dq_generate_is_unique( | ||
column=",".join(detected_pk_columns), | ||
level="error", | ||
columns=detected_pk_columns, | ||
confidence=confidence, | ||
reasoning=reasoning, | ||
llm_detected=True, | ||
nulls_distinct=True # Default behavior: NULLs are treated as distinct | ||
) | ||
|
||
print("Generated is_unique rule:") | ||
print(f"Rule name: {is_unique_rule['name']}") | ||
print(f"Function: {is_unique_rule['check']['function']}") | ||
print(f"Columns: {is_unique_rule['check']['arguments']['columns']}") | ||
print(f"Criticality: {is_unique_rule['criticality']}") | ||
print(f"LLM detected: {is_unique_rule['user_metadata']['llm_based_detection']}") | ||
print(f"Confidence: {is_unique_rule['user_metadata']['pk_detection_confidence']}") | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ### Integrated Workflow: LLM Detection + Rule Generation | ||
# MAGIC | ||
# MAGIC Here's how to combine LLM-based primary key detection with automatic rule generation: | ||
|
||
# COMMAND ---------- | ||
|
||
# Create sample data for demonstration | ||
sample_data = [ | ||
(1, "A001", "John", "Doe"), | ||
(2, "A002", "Jane", "Smith"), | ||
(3, "A001", "Bob", "Johnson"), # Duplicate customer_id - should fail uniqueness | ||
(4, "A003", "Alice", "Brown") | ||
] | ||
|
||
sample_df = spark.createDataFrame( | ||
sample_data, | ||
["id", "customer_id", "first_name", "last_name"] | ||
) | ||
|
||
# Display sample data | ||
sample_df.show() | ||
|
||
# COMMAND ---------- | ||
|
||
# Simulate LLM detection result (in practice, this would come from the LLM) | ||
llm_detection_result = { | ||
"success": True, | ||
"primary_key_columns": ["id"], # LLM detected 'id' as primary key | ||
"confidence": "high", | ||
"reasoning": "Column 'id' appears to be an auto-incrementing identifier based on naming patterns and data distribution" | ||
} | ||
|
||
if llm_detection_result["success"]: | ||
# Generate is_unique rule from LLM detection | ||
pk_columns = llm_detection_result["primary_key_columns"] | ||
|
||
generated_rule = DQGenerator.dq_generate_is_unique( | ||
column=",".join(pk_columns), | ||
level="error", | ||
columns=pk_columns, | ||
confidence=llm_detection_result["confidence"], | ||
reasoning=llm_detection_result["reasoning"], | ||
llm_detected=True | ||
) | ||
|
||
print("✅ Generated is_unique rule from LLM detection:") | ||
print(f" Rule: {generated_rule['name']}") | ||
print(f" Columns: {generated_rule['check']['arguments']['columns']}") | ||
print(f" Metadata: LLM-based detection with {generated_rule['user_metadata']['pk_detection_confidence']} confidence") | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ### Applying the Generated is_unique Rule | ||
# MAGIC | ||
# MAGIC Now let's apply the generated rule to validate data quality: | ||
|
||
# COMMAND ---------- | ||
|
||
from databricks.labs.dqx.engine import DQEngine | ||
from databricks.labs.dqx.rule import DQDatasetRule | ||
from databricks.labs.dqx import check_funcs | ||
|
||
# Convert the generated rule to a DQDatasetRule for execution | ||
pk_columns = generated_rule['check']['arguments']['columns'] | ||
dq_rule = DQDatasetRule( | ||
name=generated_rule['name'], | ||
criticality=generated_rule['criticality'], | ||
check_func=check_funcs.is_unique, | ||
columns=pk_columns, | ||
check_func_kwargs={ | ||
"nulls_distinct": generated_rule['check']['arguments']['nulls_distinct'] | ||
} | ||
) | ||
|
||
# Apply the rule using DQEngine | ||
dq_engine = DQEngine(workspace_client=ws) | ||
result_df = dq_engine.apply_checks(sample_df, [dq_rule]) | ||
|
||
print("✅ Applied is_unique rule to sample data") | ||
print("Result columns:", result_df.columns) | ||
|
||
# Show results - the rule should pass since 'id' column has unique values | ||
result_df.select("id", "customer_id", f"dq_check_{generated_rule['name']}").show() | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ### Composite Primary Key Example | ||
# MAGIC | ||
# MAGIC Let's demonstrate with a composite primary key detected by LLM: | ||
|
||
# COMMAND ---------- | ||
|
||
# Sample data with composite key scenario | ||
composite_data = [ | ||
("store_1", "2024-01-01", 100.0), | ||
("store_1", "2024-01-02", 150.0), | ||
("store_2", "2024-01-01", 200.0), | ||
("store_2", "2024-01-02", 175.0), | ||
("store_1", "2024-01-01", 120.0), # Duplicate composite key - should fail | ||
] | ||
|
||
composite_df = spark.createDataFrame( | ||
composite_data, | ||
["store_id", "date", "sales_amount"] | ||
) | ||
|
||
# Simulate LLM detecting composite primary key | ||
composite_llm_result = { | ||
"success": True, | ||
"primary_key_columns": ["store_id", "date"], | ||
"confidence": "medium", | ||
"reasoning": "Combination of store_id and date appears to uniquely identify sales records based on business logic patterns" | ||
} | ||
|
||
# Generate composite is_unique rule | ||
composite_rule = DQGenerator.dq_generate_is_unique( | ||
column=",".join(composite_llm_result["primary_key_columns"]), | ||
level="warn", # Use warning level for this example | ||
columns=composite_llm_result["primary_key_columns"], | ||
confidence=composite_llm_result["confidence"], | ||
reasoning=composite_llm_result["reasoning"], | ||
llm_detected=True | ||
) | ||
|
||
print("Generated composite is_unique rule:") | ||
print(f"Rule name: {composite_rule['name']}") | ||
print(f"Columns: {composite_rule['check']['arguments']['columns']}") | ||
print(f"Criticality: {composite_rule['criticality']}") | ||
|
||
# COMMAND ---------- | ||
|
||
# Apply composite key validation | ||
composite_dq_rule = DQDatasetRule( | ||
name=composite_rule['name'], | ||
criticality=composite_rule['criticality'], | ||
check_func=check_funcs.is_unique, | ||
columns=composite_rule['check']['arguments']['columns'], | ||
check_func_kwargs={ | ||
"nulls_distinct": composite_rule['check']['arguments']['nulls_distinct'] | ||
} | ||
) | ||
|
||
composite_result_df = dq_engine.apply_checks(composite_df, [composite_dq_rule]) | ||
|
||
print("✅ Applied composite is_unique rule") | ||
print("Data with duplicates detected:") | ||
composite_result_df.show() | ||
|
||
# COMMAND ---------- | ||
|
||
# MAGIC %md | ||
# MAGIC ## Key Features | ||
# MAGIC | ||
# MAGIC - **🔧 Completely Optional**: Not activated by default - requires explicit enablement | ||
# MAGIC - **🤖 Intelligent Detection**: Uses LLM analysis of table schema and metadata | ||
# MAGIC - **✨ Multiple Activation Methods**: Various ways to enable when needed | ||
# MAGIC - **🛡️ Graceful Fallback**: Clear messaging when dependencies unavailable | ||
# MAGIC - **📊 Confidence Scoring**: Provides confidence levels and reasoning | ||
# MAGIC - **🔄 Validation**: Optionally validates detected PKs for duplicates | ||
# MAGIC - **⚡ Automatic Rule Generation**: Converts detected PKs into executable `is_unique` rules | ||
# MAGIC - **🔗 End-to-End Workflow**: From LLM detection to data quality validation |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please link the demo in the docs/demos page