Skip to content

feat: add sklearn-style input validation to LLMFeatureEngineer#35

Open
P-r-e-m-i-u-m wants to merge 5 commits intoRobertoCorti:mainfrom
P-r-e-m-i-u-m:feat/input-validation
Open

feat: add sklearn-style input validation to LLMFeatureEngineer#35
P-r-e-m-i-u-m wants to merge 5 commits intoRobertoCorti:mainfrom
P-r-e-m-i-u-m:feat/input-validation

Conversation

@P-r-e-m-i-u-m
Copy link
Copy Markdown

Closes #34

Summary

Added sklearn-style input validation to all public method boundaries in LLMFeatureEngineer.

Changes Made

skfeaturellm/feature_engineer.py

  • __init__(): Validates max_features (positive int or None) and verbose (non-negative int)
  • fit(): Validates X is non-empty DataFrame, y is Series of same length. Stores n_features_in_ and feature_names_in_
  • transform(): Validates X is DataFrame and raises ValueError if columns present during fit are missing
  • fit_selective(): Validates X, y, n_rounds >= 1, and eval_set tuple format. Stores n_features_in_ and feature_names_in_
  • evaluate_features(is_transformed=True): Raises ValueError if expected generated feature columns are missing

tests/test_feature_engineer.py

  • Added 11 new tests covering all new validation paths
  • All existing tests pass unchanged

Signed-off-by: 🄂ʏᴇᴅ 🄰ʙᴅᴜʟ 🄰ᴍᴀ🄝 ✧ <amanbaba9404522@gmail.com>
Comment thread skfeaturellm/feature_engineer.py Outdated
self : LLMFeatureEngineer
The fitted transformer
"""
if not isinstance(X, pd.DataFrame):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you abstract this valudation logic into a standalone validate_data() function in utils.validation? Similar to how scikit-learn handles it (see here), this would keep fit() cleaner and make the validation reusable across other methods/classes down the line.

Comment thread skfeaturellm/feature_engineer.py Outdated
check_is_fitted(self)

# Convert LLM output to executor config and apply prefix to feature names
if not isinstance(X, pd.DataFrame):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as here. A general validate_data can be used also here and avoid possible duplications

Comment thread skfeaturellm/feature_engineer.py Outdated
The fitted transformer. Call ``transform()`` to apply the selected
features and ``to_transformer()`` to export them for production.
"""
if not isinstance(X, pd.DataFrame):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as here. A general validate_data can be used also here and avoid possible duplications

@RobertoCorti
Copy link
Copy Markdown
Owner

RobertoCorti commented Mar 10, 2026

It looks like pre-commit is failing in CI. Could you install it locally and run it before pushing?

Use the command

poetry run pre-commit install

After that, it'll run automatically on each commit and catch any issues before they hit CI.

…methods

Signed-off-by: 🄂ʏᴇᴅ 🄰ʙᴅᴜʟ 🄰ᴍᴀ🄝 ✧ <amanbaba9404522@gmail.com>
@P-r-e-m-i-u-m
Copy link
Copy Markdown
Author

"Refactored validation logic into a standalone validate_data() function in utils/validation.py and updated fit(), transform(), and fit_selective() to use it. Ready for re-review @RobertoCorti

@RobertoCorti
Copy link
Copy Markdown
Owner

Again, it looks like that CI/CD is failing. Install pre-commit checks through:

poetry run pre-commit install

them to run the pre-commit checks execute

poetry run pre-commit run --all-files

Copy link
Copy Markdown
Owner

@RobertoCorti RobertoCorti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes looks good, please fix my only comment and ci/cd

Comment thread skfeaturellm/utils/validation.py Outdated
TypeError
If X is not a DataFrame or y is not a Series.
"""
import pandas as pd
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move import pandas as pd to the top of the file, outside the function.

Signed-off-by: 🄂ʏᴇᴅ 🄰ʙᴅᴜʟ 🄰ᴍᴀ🄝 ✧ <amanbaba9404522@gmail.com>
@P-r-e-m-i-u-m
Copy link
Copy Markdown
Author

"Moved import pandas as pd to the top of validation.py. Pre-commit should pass now. Ready for re-review @RobertoCorti

Signed-off-by: 🄂ʏᴇᴅ 🄰ʙᴅᴜʟ 🄰ᴍᴀ🄝 ✧ <amanbaba9404522@gmail.com>
Signed-off-by: 🄂ʏᴇᴅ 🄰ʙᴅᴜʟ 🄰ᴍᴀ🄝 ✧ <amanbaba9404522@gmail.com>
@P-r-e-m-i-u-m
Copy link
Copy Markdown
Author

"Fixed feature_names_in_ missing in transform() — added fallback to set it from X.columns if not present. Ready for re-review @RobertoCorti 👍"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sklearn-style input validation to LLMFeatureEngineer

2 participants