Profile subset dataframe #589

STEFANOVIVAS · 2025-09-26T23:04:59Z

Changes

Added ability to profile and generate rules on subset of the input data by introducing a sql expression filter.

Linked issues

Resolves #569

Tests

… submited to profile.

… dataset_filter_expression as one of his parameter

…ethod. Added test in the test_rules_generator.py file.

…nserting key, value pair in parameter attribute to adding new class attribute called filter.

…m inserting key, value pair in parameter attribute to adding new class attribute called filter.Filter is now another key for the check dict.

… and running test for save and load checks FileChecksStorageConfig, WorkspaceFileChecksStorageConfig, InstallationChecksStorageConfig, TableChecksStorageConfig, VolumeFileChecksStorageConfig.

… submited to profile.

… dataset_filter_expression as one of his parameter

…ethod. Added test in the test_rules_generator.py file.

…nserting key, value pair in parameter attribute to adding new class attribute called filter.

…m inserting key, value pair in parameter attribute to adding new class attribute called filter.Filter is now another key for the check dict.

… and running test for save and load checks FileChecksStorageConfig, WorkspaceFileChecksStorageConfig, InstallationChecksStorageConfig, TableChecksStorageConfig, VolumeFileChecksStorageConfig.

… if filter is not None

Copilot

Pull Request Overview

This pull request adds support for profile subset dataframes by introducing a filter capability to the data quality profiling system. The changes enable filtering datasets during profiling to generate rules specific to data subsets rather than entire datasets.

Adds a filter field to the DQProfile class to store filter conditions
Enhances the profiler to apply filters during data sampling and propagate them to generated rules
Updates the rule generator to include filter conditions in the output data quality rules

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/databricks/labs/dqx/profiler/profiler.py	Adds filter field to DQProfile class and implements filter logic in profiler
src/databricks/labs/dqx/profiler/generator.py	Updates rule generator to handle filter conditions in generated rules
tests/integration/test_profiler.py	Adds comprehensive test cases for filter functionality including duplicated test
tests/integration/test_rules_generator.py	Updates existing tests and adds new tests for filter handling
tests/integration/test_save_and_load_checks_from_table.py	Updates test data to include filter fields
tests/integration/test_save_checks_to_workspace_file.py	Updates test data to include filter fields
tests/integration/test_load_checks_from_uc_volume.py	Updates test data to include filter fields

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tests/integration/test_profiler.py

tests/integration/test_rules_generator.py

tests/integration/test_profiler.py

mwojtyczka

Please format the code before commiting: make fmt

tests/integration/test_load_checks_from_uc_volume.py

tests/integration/test_profiler.py

src/databricks/labs/dqx/profiler/profiler.py

tests/integration/test_profiler.py

Copilot

Pull Request Overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tests/integration/test_save_checks_to_workspace_file.py

tests/integration/test_load_checks_from_uc_volume.py

mwojtyczka

Please address remaining comments

…st file

Copilot

Pull Request Overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tests/integration/test_rules_generator.py

tests/integration/test_save_checks_to_workspace_file.py

mwojtyczka

formatting is still wrong

Co-authored-by: Copilot <[email protected]>

mwojtyczka

LGTM - will merge after testing

mwojtyczka

Documentation has to be updated:

The filter parameter should also be added to the ProfilerConfig so that workflows can use it.

Everything else looking good

…he ProfilerConfig dataclass so that workflows can use it.

Copilot

Pull Request Overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 4 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/databricks/labs/dqx/profiler/profiler.py

src/databricks/labs/dqx/profiler/generator.py

refactor

mwojtyczka

LGTM

* Added support for running checks on multiple tables ([#566](#566)). Added more flexibility and functionality in running data quality checks, allowing users to run checks on multiple tables in a single method call and as part of Workflows execution. Provided options to run checks for all configured run configs or for a specific run config, or for tables/views matching wildcard patterns. The CLI commands for running workflows have been updated to reflect and support these new functionalities. Additionally, new parameters have been added to configuration file to control the level of parallelism for these operations, such as `profiler_max_parallelism` and `quality_checker_max_parallelism`. A new demo has been added to showcases how to use the profiler and apply checks across multiple tables. The changes aim to improve scalability of DQX. * Added New Row-level Checks: IPv6 Address Validation ([#578](#578)). DQX now includes 2 new row-level checks: validation of IPv6 address (`is_valid_ipv6_address` check function), and validation if IPv6 address is within provided CIDR block (`is_ipv6_address_in_cidr` check function). * Added New Dataset-level Check: Schema Validation check ([#568](#568)). The `has_valid_schema` check function has been introduced to validate whether a DataFrame conforms to a specified schema, with results reported at the row level for consistency with other checks. This function can operate in non-strict mode, where it verifies the existence of expected columns with compatible types, or in strict mode, where it enforces an exact schema match, including column order and types. It accepts parameters such as the expected schema, which can be defined as a DDL string or a StructType object, and optional arguments to specify columns to validate and strict mode. * Added New Row-level Checks: Spatial data validations ([#581](#581)). Specialized data validation checks for geospatial data have been introduced, enabling verification of valid latitude and longitude values, various geometry and geography types, such as points, linestrings, polygons, multipoints, multilinestrings, and multipolygons, as well as checks for Open Geospatial Consortium (OGC) validity, non-empty geometries, and specific dimensions or coordinate ranges. These checks are implemented as check functions, including `is_latitude`, `is_longitude`, `is_geometry`, `is_geography`, `is_point`, `is_linestring`, `is_polygon`, `is_multipoint`, `is_multilinestring`, `is_multipolygon`, `is_ogc_valid`, `is_non_empty_geometry`, `has_dimension`, `has_x_coordinate_between`, and `has_y_coordinate_between`. The addition of these geospatial data validation checks enhances the overall data quality capabilities, allowing for more accurate and reliable geospatial data processing and analysis. Running these checks requires Databricks serverless or cluster with runtime 17.1 or above. * Added absolute and relative tolerance to comparison of datasets ([#574](#574)). The `compare_datasets` check has been enhanced with the introduction of absolute and relative tolerance parameters, enabling more flexible comparisons of decimal values. These tolerances can be applied to numeric columns. * Added detailed telemetry ([#561](#561)). Telemetry has been enhanced across multiple functionalities to provide better visibility into DQX usage, including which features and checks are used most frequently. This will help us focus development efforts on the areas that matter most to our users. * Allow installation in a custom folder ([#575](#575)). The installation process for the library has been enhanced to offer flexible installation options, allowing users to install the library in a custom workspace folder, in addition to the default user home directory or a global folder. When installing DQX as a workspace tool using the Databricks CLI, users are prompted to optionally specify a custom workspace path for the installation. Allowing custom installation folder makes it possible to use DQX on [group assigned cluster](https://docs.databricks.com/aws/en/compute/group-access). * Profile subset dataframe ([#589](#589)). The data profiling feature has been enhanced to allow users to profile and generate rules on a subset of the input data by introducing a filter option, which is a string SQL expression that can be used to filter the input data. This filter can be specified in the configuration file or when using the profiler, providing more flexibility in analyzing subsets of data. The profiler supports extensive configuration options to customize the profiling process, including sampling, limiting, and computing statistics on the sampled data. The new filter option enables users to generate more targeted and relevant rules, and it can be used to focus on particular segments of the data, such as rows that match certain conditions. * Added custom exceptions ([#582](#582)). The codebase now utilizes custom exceptions to handle various error scenarios, providing more specific and informative error messages compared to generic exceptions. BREAKING CHANGES! * Workflows run by default for all run configs from configuration file. Previously, the default behaviour was to run them for a specific run config only. * The following depreciated methods are removed from the `DQEngine`: `load_checks_from_local_file`, `load_checks_from_workspace_file`, `load_checks_from_table`, `load_checks_from_installation`, `save_checks_in_local_file`, `save_checks_in_workspace_file`, `save_checks_in_table`,, `save_checks_in_installation`, `load_run_config`. For loading and saving checks, users are advised to use `load_checks` and `save_checks` of the `DQEngine` described [here](https://databrickslabs.github.io/dqx/docs/guide/quality_checks_storage/), which support various storage types.

STEFANOVIVAS added 24 commits September 23, 2025 19:55

First commit: Installing DQX from feature branch

4ffb641

Implement methods in the DQProfiler class to filter dataframes before…

17eb4ac

… submited to profile.

Refactor _sample method to utilize filtered dataframe

d75b09d

Refactor _sample method to utilize return the filtered dataframe

530b8fb

Implement print statement to test dataset_filter feature

2c21d2f

Modifying DQProfiler class to generate DQProfile class instances with…

e8ee1d2

… dataset_filter_expression as one of his parameter

Modifying DQGenerator class to show filter in the generate_dq_rules m…

0b6a267

…ethod. Added test in the test_rules_generator.py file.

Refactor filter implementation in the profiler class. Changing from i…

7d9251d

…nserting key, value pair in parameter attribute to adding new class attribute called filter.

Refactor filter implementation in the DQGenerator class. Changing fro…

3b9a829

…m inserting key, value pair in parameter attribute to adding new class attribute called filter.Filter is now another key for the check dict.

Creating and running tests for profiler and generator class. Creating…

c79b9bc

… and running test for save and load checks FileChecksStorageConfig, WorkspaceFileChecksStorageConfig, InstallationChecksStorageConfig, TableChecksStorageConfig, VolumeFileChecksStorageConfig.

Formatting code to submit pull request

287df6d

First commit: Installing DQX from feature branch

889a557

Implement methods in the DQProfiler class to filter dataframes before…

b4f29c9

… submited to profile.

Refactor _sample method to utilize filtered dataframe

6b92cdc

Refactor _sample method to utilize return the filtered dataframe

505d51d

Implement print statement to test dataset_filter feature

0ed896e

Modifying DQProfiler class to generate DQProfile class instances with…

6a17458

… dataset_filter_expression as one of his parameter

Modifying DQGenerator class to show filter in the generate_dq_rules m…

0732a0b

…ethod. Added test in the test_rules_generator.py file.

Refactor filter implementation in the profiler class. Changing from i…

5dc8243

…nserting key, value pair in parameter attribute to adding new class attribute called filter.

Refactor filter implementation in the DQGenerator class. Changing fro…

4494393

…m inserting key, value pair in parameter attribute to adding new class attribute called filter.Filter is now another key for the check dict.

Creating and running tests for profiler and generator class. Creating…

bb5977d

… and running test for save and load checks FileChecksStorageConfig, WorkspaceFileChecksStorageConfig, InstallationChecksStorageConfig, TableChecksStorageConfig, VolumeFileChecksStorageConfig.

resolving merge conflict markersin the code and deleting local files

5f1b185

Refactor DQGenerator class to only return filter key in the rule dict…

ebe714b

… if filter is not None

Formatting code with make fmt

3a3a7a4

STEFANOVIVAS requested a review from a team as a code owner September 26, 2025 23:05

STEFANOVIVAS requested review from gergo-databricks and removed request for a team September 26, 2025 23:05

mwojtyczka requested a review from Copilot September 30, 2025 08:41

Copilot AI reviewed Sep 30, 2025

View reviewed changes

mwojtyczka requested changes Sep 30, 2025

View reviewed changes

tests/integration/test_load_checks_from_uc_volume.py Outdated Show resolved Hide resolved

tests/integration/test_profiler.py Outdated Show resolved Hide resolved

src/databricks/labs/dqx/profiler/profiler.py Outdated Show resolved Hide resolved

mwojtyczka reviewed Sep 30, 2025

View reviewed changes

tests/integration/test_profiler.py Outdated Show resolved Hide resolved

mwojtyczka reviewed Sep 30, 2025

View reviewed changes

tests/integration/test_profiler.py Outdated Show resolved Hide resolved

STEFANOVIVAS requested a review from mwojtyczka September 30, 2025 18:10

mwojtyczka requested a review from Copilot October 1, 2025 07:33

Copilot AI reviewed Oct 1, 2025

View reviewed changes

mwojtyczka reviewed Oct 1, 2025

View reviewed changes

tests/integration/test_load_checks_from_uc_volume.py Outdated Show resolved Hide resolved

mwojtyczka requested changes Oct 1, 2025

View reviewed changes

Removing filter inside check key in the load checks from uc volume te…

fe9b6e3

…st file

STEFANOVIVAS requested a review from mwojtyczka October 1, 2025 12:23

mwojtyczka requested a review from Copilot October 1, 2025 17:24

Copilot AI reviewed Oct 1, 2025

View reviewed changes

tests/integration/test_rules_generator.py Outdated Show resolved Hide resolved

tests/integration/test_rules_generator.py Outdated Show resolved Hide resolved

mwojtyczka reviewed Oct 1, 2025

View reviewed changes

tests/integration/test_save_checks_to_workspace_file.py Outdated Show resolved Hide resolved

mwojtyczka requested changes Oct 1, 2025

View reviewed changes

mwojtyczka and others added 3 commits October 1, 2025 19:52

Update tests/integration/test_save_checks_to_workspace_file.py

e01420b

Update tests/integration/test_rules_generator.py

586293b

Co-authored-by: Copilot <[email protected]>

Update tests/integration/test_rules_generator.py

f10b3c0

Co-authored-by: Copilot <[email protected]>

mwojtyczka approved these changes Oct 1, 2025

View reviewed changes

mwojtyczka requested changes Oct 1, 2025

View reviewed changes

Update doc to add filter feature reference. Add filter attribute in t…

b7c8911

…he ProfilerConfig dataclass so that workflows can use it.

STEFANOVIVAS requested a review from mwojtyczka October 2, 2025 14:31

mwojtyczka requested a review from Copilot October 2, 2025 18:28

Copilot AI reviewed Oct 2, 2025

View reviewed changes

src/databricks/labs/dqx/profiler/profiler.py Show resolved Hide resolved

src/databricks/labs/dqx/profiler/generator.py Show resolved Hide resolved

src/databricks/labs/dqx/profiler/generator.py Show resolved Hide resolved

src/databricks/labs/dqx/profiler/generator.py Show resolved Hide resolved

mwojtyczka added 3 commits October 2, 2025 20:52

pass filter to profiler workflow

f64c207

refactor

fmt

92187ca

fixed tests

2c4576c

mwojtyczka approved these changes Oct 2, 2025

View reviewed changes

mwojtyczka merged commit 1003ab5 into databrickslabs:main Oct 2, 2025
14 checks passed

mwojtyczka mentioned this pull request Oct 3, 2025

Release v0.9.3 #597

Merged

Profile subset dataframe #589

Profile subset dataframe #589

Uh oh!

Conversation

STEFANOVIVAS commented Sep 26, 2025 • edited by mwojtyczka Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Linked issues

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mwojtyczka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mwojtyczka left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mwojtyczka left a comment

Choose a reason for hiding this comment

Uh oh!

mwojtyczka left a comment

Choose a reason for hiding this comment

Uh oh!

mwojtyczka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mwojtyczka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

STEFANOVIVAS commented Sep 26, 2025 •

edited by mwojtyczka

Loading

mwojtyczka left a comment •

edited

Loading