Skip to content

Conversation

jominjohny
Copy link
Contributor

Changes

LLM based Pk detector

Linked issues

#484

Resolves #..

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests

@jominjohny jominjohny requested a review from a team as a code owner August 25, 2025 05:39
@jominjohny jominjohny requested review from grusin-db and removed request for a team August 25, 2025 05:39
Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. We should clarify the scope. The primary purpose of PK detection is to use it in compare_datasets check, for cases where user don't know pk keys for comparison. There should be a way to call this as a standalone method as well. Profiler seems to be a good place. So a new method that can be called from the profiler should be added, e.g. detect_primary_keys_with_llm. If we want to generate uniqueness check from the profiler, then it should suggest existing is_unique check func. Yes, we can add this as as another profile, and use it for rules generation.

@mwojtyczka mwojtyczka requested a review from Copilot August 29, 2025 10:28
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds LLM-based primary key detection capabilities to the DQX data quality framework. The functionality is completely optional and only activates when explicitly requested by users.

Key changes:

  • Implements intelligent primary key detection using Large Language Models via DSPy and Databricks Model Serving
  • Adds comprehensive configuration options for LLM-based detection with graceful fallback when dependencies are unavailable
  • Integrates seamlessly with existing profiling workflow while maintaining backward compatibility

Reviewed Changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/databricks/labs/dqx/llm/pk_identifier.py Core LLM detection engine with table metadata analysis and duplicate validation
src/databricks/labs/dqx/profiler/profiler.py Enhanced profiler with LLM detection methods and lazy import handling
src/databricks/labs/dqx/profiler/generator.py Added primary key rule generation with LLM-specific metadata
src/databricks/labs/dqx/profiler/runner.py Updated runner to support table-based profiling with PK detection
src/databricks/labs/dqx/config.py Added LLM configuration fields to ProfilerConfig
src/databricks/labs/dqx/check_funcs.py Implemented is_primary_key validation function
tests/unit/test_llm_based_pk_identifier.py Comprehensive unit tests with graceful dependency handling
tests/integration/test_pk_detection_integration.py End-to-end integration tests for the complete workflow
src/databricks/labs/dqx/llm/demo.py Usage demonstration showing optional LLM activation
src/databricks/labs/dqx/llm/README.md Detailed documentation with examples and best practices

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

github-actions bot commented Sep 5, 2025

✅ 405/405 passed, 5 flaky, 2 skipped, 4h12m46s total

Flaky tests:

  • 🤪 test_apply_checks_and_save_in_tables_for_patterns_with_custom_suffix (2.607s)
  • 🤪 test_save_results_in_table (1.253s)
  • 🤪 test_quality_checker_workflow_with_custom_check_func_rel_path (1m48.517s)
  • 🤪 test_save_streaming_results_in_table (10.013s)
  • 🤪 test_e2e_workflow (8m21.604s)

Running from acceptance #2746

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants