Skip to content

DataDefinition should restrict column processing, not just annotate it #1765

@yiannimercer

Description

@yiannimercer

DataDefinition should restrict column processing, not just annotate it

Description

When creating a Dataset with an explicit DataDefinition specifying numerical_columns and categorical_columns, Evidently still iterates over and attempts to infer types for all columns in the DataFrame, not just the ones specified. This leads to unexpected errors and defeats the purpose of providing an explicit schema.

Steps to Reproduce

import pandas as pd
from evidently import Dataset
from evidently.core.datasets import DataDefinition

df = pd.DataFrame({
    "feature_a": [1, 2, 3, 4, 5],
    "feature_b": ["x", "y", "z", "x", "y"],
    "internal_col": [None, None, None, None, None],  # Not relevant to analysis
})

# Explicitly define only the columns I care about
definition = DataDefinition(
    numerical_columns=["feature_a"],
    categorical_columns=["feature_b"],
)

# This still fails because Evidently processes "internal_col"
dataset = Dataset.from_pandas(df, data_definition=definition)

Expected Behavior

When I provide a DataDefinition with explicit column lists, Evidently should only process those columns. The DataDefinition should act as a contract/schema that restricts scope, not just a hint that gets merged with auto-inference.

Columns not mentioned in the DataDefinition should be ignored entirely.

Actual Behavior

Evidently iterates over all columns in the DataFrame and runs infer_column_type on each, regardless of whether they appear in the DataDefinition. This causes:

  1. Unexpected errors from columns the user explicitly chose not to include (e.g., all-null columns, malformed data)
  2. Wasted computation on columns that won't be used in the analysis
  3. Confusion about what DataDefinition actually does

Current Workaround

Subset the DataFrame manually before passing to Evidently:

cols = ["feature_a", "feature_b"]
dataset = Dataset.from_pandas(df[cols], data_definition=definition)

This works but is redundant—I'm specifying the columns twice (once in the subset, once in DataDefinition).

Suggested Behavior

If DataDefinition specifies columns explicitly, only process those columns:

def _generate_data_definition(self, data, reserved_fields, service_columns):
    # If user provided explicit column lists, only iterate over those
    columns_to_process = set()
    if self._data_definition:
        if self._data_definition.numerical_columns:
            columns_to_process.update(self._data_definition.numerical_columns)
        if self._data_definition.categorical_columns:
            columns_to_process.update(self._data_definition.categorical_columns)
        # ... etc for other column types
    
    # Fall back to all columns only if no explicit definition provided
    if not columns_to_process:
        columns_to_process = set(data.columns)
    
    for column in columns_to_process:
        # ... existing logic

Alternatively, add a parameter like strict=True to DataDefinition that enforces this behavior for users who want it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions