-
Notifications
You must be signed in to change notification settings - Fork 778
Description
DataDefinition should restrict column processing, not just annotate it
Description
When creating a Dataset with an explicit DataDefinition specifying numerical_columns and categorical_columns, Evidently still iterates over and attempts to infer types for all columns in the DataFrame, not just the ones specified. This leads to unexpected errors and defeats the purpose of providing an explicit schema.
Steps to Reproduce
import pandas as pd
from evidently import Dataset
from evidently.core.datasets import DataDefinition
df = pd.DataFrame({
"feature_a": [1, 2, 3, 4, 5],
"feature_b": ["x", "y", "z", "x", "y"],
"internal_col": [None, None, None, None, None], # Not relevant to analysis
})
# Explicitly define only the columns I care about
definition = DataDefinition(
numerical_columns=["feature_a"],
categorical_columns=["feature_b"],
)
# This still fails because Evidently processes "internal_col"
dataset = Dataset.from_pandas(df, data_definition=definition)Expected Behavior
When I provide a DataDefinition with explicit column lists, Evidently should only process those columns. The DataDefinition should act as a contract/schema that restricts scope, not just a hint that gets merged with auto-inference.
Columns not mentioned in the DataDefinition should be ignored entirely.
Actual Behavior
Evidently iterates over all columns in the DataFrame and runs infer_column_type on each, regardless of whether they appear in the DataDefinition. This causes:
- Unexpected errors from columns the user explicitly chose not to include (e.g., all-null columns, malformed data)
- Wasted computation on columns that won't be used in the analysis
- Confusion about what
DataDefinitionactually does
Current Workaround
Subset the DataFrame manually before passing to Evidently:
cols = ["feature_a", "feature_b"]
dataset = Dataset.from_pandas(df[cols], data_definition=definition)This works but is redundant—I'm specifying the columns twice (once in the subset, once in DataDefinition).
Suggested Behavior
If DataDefinition specifies columns explicitly, only process those columns:
def _generate_data_definition(self, data, reserved_fields, service_columns):
# If user provided explicit column lists, only iterate over those
columns_to_process = set()
if self._data_definition:
if self._data_definition.numerical_columns:
columns_to_process.update(self._data_definition.numerical_columns)
if self._data_definition.categorical_columns:
columns_to_process.update(self._data_definition.categorical_columns)
# ... etc for other column types
# Fall back to all columns only if no explicit definition provided
if not columns_to_process:
columns_to_process = set(data.columns)
for column in columns_to_process:
# ... existing logicAlternatively, add a parameter like strict=True to DataDefinition that enforces this behavior for users who want it.