Skip to content

Conversation

STEFANOVIVAS
Copy link
Contributor

@STEFANOVIVAS STEFANOVIVAS commented Sep 26, 2025

Changes

  • Added ability to profile and generate rules on subset of the input data by introducing a sql expression filter.

Linked issues

Resolves #569

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

… dataset_filter_expression as one of his parameter
…ethod. Added test in the test_rules_generator.py file.
…nserting key, value pair in parameter attribute to adding new class attribute called filter.
…m inserting key, value pair in parameter attribute to adding new class attribute called filter.Filter is now another key for the check dict.
… and running test for save and load checks FileChecksStorageConfig, WorkspaceFileChecksStorageConfig, InstallationChecksStorageConfig, TableChecksStorageConfig, VolumeFileChecksStorageConfig.
… dataset_filter_expression as one of his parameter
…ethod. Added test in the test_rules_generator.py file.
…nserting key, value pair in parameter attribute to adding new class attribute called filter.
…m inserting key, value pair in parameter attribute to adding new class attribute called filter.Filter is now another key for the check dict.
… and running test for save and load checks FileChecksStorageConfig, WorkspaceFileChecksStorageConfig, InstallationChecksStorageConfig, TableChecksStorageConfig, VolumeFileChecksStorageConfig.
@STEFANOVIVAS STEFANOVIVAS requested a review from a team as a code owner September 26, 2025 23:05
@STEFANOVIVAS STEFANOVIVAS requested review from gergo-databricks and removed request for a team September 26, 2025 23:05
@mwojtyczka mwojtyczka requested a review from Copilot September 30, 2025 08:41
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds support for profile subset dataframes by introducing a filter capability to the data quality profiling system. The changes enable filtering datasets during profiling to generate rules specific to data subsets rather than entire datasets.

  • Adds a filter field to the DQProfile class to store filter conditions
  • Enhances the profiler to apply filters during data sampling and propagate them to generated rules
  • Updates the rule generator to include filter conditions in the output data quality rules

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/databricks/labs/dqx/profiler/profiler.py Adds filter field to DQProfile class and implements filter logic in profiler
src/databricks/labs/dqx/profiler/generator.py Updates rule generator to handle filter conditions in generated rules
tests/integration/test_profiler.py Adds comprehensive test cases for filter functionality including duplicated test
tests/integration/test_rules_generator.py Updates existing tests and adds new tests for filter handling
tests/integration/test_save_and_load_checks_from_table.py Updates test data to include filter fields
tests/integration/test_save_checks_to_workspace_file.py Updates test data to include filter fields
tests/integration/test_load_checks_from_uc_volume.py Updates test data to include filter fields

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format the code before commiting: make fmt

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address remaining comments

@mwojtyczka mwojtyczka requested a review from Copilot October 1, 2025 17:24
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting is still wrong

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - will merge after testing

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…he ProfilerConfig dataclass so that workflows can use it.
@mwojtyczka mwojtyczka requested a review from Copilot October 2, 2025 18:28
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 11 out of 12 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mwojtyczka mwojtyczka merged commit 1003ab5 into databrickslabs:main Oct 2, 2025
14 checks passed
mwojtyczka added a commit that referenced this pull request Oct 3, 2025
* Added support for running checks on multiple tables ([#566](#566)). Added more flexibility and functionality in running data quality checks, allowing users to run checks on multiple tables in a single method call and as part of Workflows execution. Provided options to run checks for all configured run configs or for a specific run config, or for tables/views matching wildcard patterns. The CLI commands for running workflows have been updated to reflect and support these new functionalities. Additionally, new parameters have been added to configuration file to control the level of parallelism for these operations, such as `profiler_max_parallelism` and `quality_checker_max_parallelism`. A new demo has been added to showcases how to use the profiler and apply checks across multiple tables. The changes aim to improve scalability of DQX.
* Added New Row-level Checks: IPv6 Address Validation ([#578](#578)). DQX now includes 2 new row-level checks: validation of IPv6 address (`is_valid_ipv6_address` check function), and validation if IPv6 address is within provided CIDR block (`is_ipv6_address_in_cidr` check function).
* Added New Dataset-level Check: Schema Validation check ([#568](#568)). The `has_valid_schema` check function has been introduced to validate whether a DataFrame conforms to a specified schema, with results reported at the row level for consistency with other checks. This function can operate in non-strict mode, where it verifies the existence of expected columns with compatible types, or in strict mode, where it enforces an exact schema match, including column order and types. It accepts parameters such as the expected schema, which can be defined as a DDL string or a StructType object, and optional arguments to specify columns to validate and strict mode.
* Added New Row-level Checks: Spatial data validations ([#581](#581)). Specialized data validation checks for geospatial data have been introduced, enabling verification of valid latitude and longitude values, various geometry and geography types, such as points, linestrings, polygons, multipoints, multilinestrings, and multipolygons, as well as checks for Open Geospatial Consortium (OGC) validity, non-empty geometries, and specific dimensions or coordinate ranges. These checks are implemented as check functions, including `is_latitude`, `is_longitude`, `is_geometry`, `is_geography`, `is_point`, `is_linestring`, `is_polygon`, `is_multipoint`, `is_multilinestring`, `is_multipolygon`, `is_ogc_valid`, `is_non_empty_geometry`, `has_dimension`, `has_x_coordinate_between`, and `has_y_coordinate_between`. The addition of these geospatial data validation checks enhances the overall data quality capabilities, allowing for more accurate and reliable geospatial data processing and analysis. Running these checks requires Databricks serverless or cluster with runtime 17.1 or above.
* Added absolute and relative tolerance to comparison of datasets ([#574](#574)). The `compare_datasets` check has been enhanced with the introduction of absolute and relative tolerance parameters, enabling more flexible comparisons of decimal values. These tolerances can be applied to numeric columns.
* Added detailed telemetry ([#561](#561)). Telemetry has been enhanced across multiple functionalities to provide better visibility into DQX usage, including which features and checks are used most frequently. This will help us focus development efforts on the areas that matter most to our users.
* Allow installation in a custom folder ([#575](#575)). The installation process for the library has been enhanced to offer flexible installation options, allowing users to install the library in a custom workspace folder, in addition to the default user home directory or a global folder. When installing DQX as a workspace tool using the Databricks CLI, users are prompted to optionally specify a custom workspace path for the installation.  Allowing custom installation folder makes it possible to use DQX on [group assigned cluster](https://docs.databricks.com/aws/en/compute/group-access).
* Profile subset dataframe ([#589](#589)). The data profiling feature has been enhanced to allow users to profile and generate rules on a subset of the input data by introducing a filter option, which is a string SQL expression that can be used to filter the input data. This filter can be specified in the configuration file or when using the profiler, providing more flexibility in analyzing subsets of data. The profiler supports extensive configuration options to customize the profiling process, including sampling, limiting, and computing statistics on the sampled data. The new filter option enables users to generate more targeted and relevant rules, and it can be used to focus on particular segments of the data, such as rows that match certain conditions.
* Added custom exceptions ([#582](#582)). The codebase now utilizes custom exceptions to handle various error scenarios, providing more specific and informative error messages compared to generic exceptions.

BREAKING CHANGES!

* Workflows run by default for all run configs from configuration file. Previously, the default behaviour was to run them for a specific run config only.
* The following depreciated methods are removed from the `DQEngine`: `load_checks_from_local_file`, `load_checks_from_workspace_file`,  `load_checks_from_table`,  `load_checks_from_installation`, `save_checks_in_local_file`, `save_checks_in_workspace_file`, `save_checks_in_table`,, `save_checks_in_installation`, `load_run_config`.  For loading and saving checks, users are advised to use `load_checks` and `save_checks` of the `DQEngine` described [here](https://databrickslabs.github.io/dqx/docs/guide/quality_checks_storage/), which support various storage types.
@mwojtyczka mwojtyczka mentioned this pull request Oct 3, 2025
mwojtyczka added a commit that referenced this pull request Oct 3, 2025
* Added support for running checks on multiple tables
([#566](#566)). Added more
flexibility and functionality in running data quality checks, allowing
users to run checks on multiple tables in a single method call and as
part of Workflows execution. Provided options to run checks for all
configured run configs or for a specific run config, or for tables/views
matching wildcard patterns. The CLI commands for running workflows have
been updated to reflect and support these new functionalities.
Additionally, new parameters have been added to configuration file to
control the level of parallelism for these operations, such as
`profiler_max_parallelism` and `quality_checker_max_parallelism`. A new
demo has been added to showcases how to use the profiler and apply
checks across multiple tables. The changes aim to improve scalability of
DQX.
* Added New Row-level Checks: IPv6 Address Validation
([#578](#578)). DQX now
includes 2 new row-level checks: validation of IPv6 address
(`is_valid_ipv6_address` check function), and validation if IPv6 address
is within provided CIDR block (`is_ipv6_address_in_cidr` check
function).
* Added New Dataset-level Check: Schema Validation check
([#568](#568)). The
`has_valid_schema` check function has been introduced to validate
whether a DataFrame conforms to a specified schema, with results
reported at the row level for consistency with other checks. This
function can operate in non-strict mode, where it verifies the existence
of expected columns with compatible types, or in strict mode, where it
enforces an exact schema match, including column order and types. It
accepts parameters such as the expected schema, which can be defined as
a DDL string or a StructType object, and optional arguments to specify
columns to validate and strict mode.
* Added New Row-level Checks: Spatial data validations
([#581](#581)). Specialized
data validation checks for geospatial data have been introduced,
enabling verification of valid latitude and longitude values, various
geometry and geography types, such as points, linestrings, polygons,
multipoints, multilinestrings, and multipolygons, as well as checks for
Open Geospatial Consortium (OGC) validity, non-empty geometries, and
specific dimensions or coordinate ranges. These checks are implemented
as check functions, including `is_latitude`, `is_longitude`,
`is_geometry`, `is_geography`, `is_point`, `is_linestring`,
`is_polygon`, `is_multipoint`, `is_multilinestring`, `is_multipolygon`,
`is_ogc_valid`, `is_non_empty_geometry`, `has_dimension`,
`has_x_coordinate_between`, and `has_y_coordinate_between`. The addition
of these geospatial data validation checks enhances the overall data
quality capabilities, allowing for more accurate and reliable geospatial
data processing and analysis. Running these checks requires Databricks
serverless or cluster with runtime 17.1 or above.
* Added absolute and relative tolerance to comparison of datasets
([#574](#574)). The
`compare_datasets` check has been enhanced with the introduction of
absolute and relative tolerance parameters, enabling more flexible
comparisons of decimal values. These tolerances can be applied to
numeric columns.
* Added detailed telemetry
([#561](#561)). Telemetry
has been enhanced across multiple functionalities to provide better
visibility into DQX usage, including which features and checks are used
most frequently. This will help us focus development efforts on the
areas that matter most to our users.
* Allow installation in a custom folder
([#575](#575)). The
installation process for the library has been enhanced to offer flexible
installation options, allowing users to install the library in a custom
workspace folder, in addition to the default user home directory or a
global folder. When installing DQX as a workspace tool using the
Databricks CLI, users are prompted to optionally specify a custom
workspace path for the installation. Allowing custom installation folder
makes it possible to use DQX on [group assigned
cluster](https://docs.databricks.com/aws/en/compute/group-access).
* Profile subset dataframe
([#589](#589)). The data
profiling feature has been enhanced to allow users to profile and
generate rules on a subset of the input data by introducing a filter
option, which is a string SQL expression that can be used to filter the
input data. This filter can be specified in the configuration file or
when using the profiler, providing more flexibility in analyzing subsets
of data. The profiler supports extensive configuration options to
customize the profiling process, including sampling, limiting, and
computing statistics on the sampled data. The new filter option enables
users to generate more targeted and relevant rules, and it can be used
to focus on particular segments of the data, such as rows that match
certain conditions.
* Added custom exceptions
([#582](#582)). The codebase
now utilizes custom exceptions to handle various error scenarios,
providing more specific and informative error messages compared to
generic exceptions.

BREAKING CHANGES!

* Workflows run by default for all run configs from configuration file.
Previously, the default behaviour was to run them for a specific run
config only.
* The following depreciated methods are removed from the `DQEngine`:
`load_checks_from_local_file`, `load_checks_from_workspace_file`,
`load_checks_from_table`, `load_checks_from_installation`,
`save_checks_in_local_file`, `save_checks_in_workspace_file`,
`save_checks_in_table`,, `save_checks_in_installation`,
`load_run_config`. For loading and saving checks, users are advised to
use `load_checks` and `save_checks` of the `DQEngine` described
[here](https://databrickslabs.github.io/dqx/docs/guide/quality_checks_storage/),
which support various storage types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Run data Profile in a subset of the input data

2 participants