Feature/add dataset stats to prompt by RobertoCorti · Pull Request #29 · RobertoCorti/skfeaturellm

RobertoCorti · 2026-03-03T10:54:11Z

Summary

Auto-compute dataset statistics in fit() and inject them into the LLM prompt as a {dataset_statistics} block, giving the model real signal about the data before it suggests transformations — with zero extra effort from the user.

What's in the stats block

Target statistics

Regression: min, max, mean, std
Classification: class counts + balance %

Feature statistics

(numeric columns, features as rows — fixed 9 columns regardless of dataset size)

Full describe() output + skewness per feature

Feature statistics vs target

(only when y is provided to fit())

Regression: Pearson correlation per feature
Classification: class-wise mean per feature (groupby(y).mean().T)

Changes

llm_interface.py
- Added _format_dataset_statistics(X, y, problem_type) static method
- Threaded dataset_statistics param through:
  - generate_prompt_context()
  - generate_engineered_features()
prompts.py
- Injected {dataset_statistics} placeholder between target description and additional context
feature_engineer.py
- fit() now accepts y: Optional[pd.Series]
- Stats computed and passed through the call chain
- Falls back to "Not provided." when y is absent
pyproject.toml
- Added tabulate (required by to_markdown())
- Added langchain-anthropic

Tests

8 new tests covering:
- All branches of the stats formatter
- Updated generate_prompt_context
- fit() with and without y

Total tests: 121
Status: ✅ All green

…chment Adds a static method on LLMInterface that computes and formats dataset statistics as a human-readable markdown block: - Target stats: min/max/mean/std for regression, class counts + balance % for classification - Feature stats: transposed describe() + skewness (features as rows, fixed 9 columns) - Feature vs target: Pearson corr per feature for regression; class-wise mean per feature for classification (groupby mean, transposed) Uses pandas/tabulate only (no new runtime dependencies beyond tabulate).

- Add {dataset_statistics} placeholder to FEATURE_ENGINEERING_PROMPT after target_description - Add dataset_statistics parameter to generate_prompt_context() and generate_engineered_features() - Add y parameter to fit() in LLMFeatureEngineer; compute stats via _format_dataset_statistics() and pass through the call chain - Falls back to "Not provided." when y is None or dataset_statistics is not supplied

- 4 tests for _format_dataset_statistics (regression, classification, y=None, no numeric cols) - 2 tests for generate_prompt_context (dataset_statistics key present, default fallback) - 2 tests for fit() (dataset_statistics forwarded to LLM with and without y)

RobertoCorti added 5 commits March 3, 2026 10:59

add(dependencies) tabulate package to pyproject.toml

4f93a59

add(dependencies) langchain-anthropic to pyproject.toml

2d28dcf

RobertoCorti merged commit 2d1472c into main Mar 3, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/add dataset stats to prompt#29

Feature/add dataset stats to prompt#29
RobertoCorti merged 5 commits intomainfrom
feature/add-dataset-stats-to-prompt

RobertoCorti commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RobertoCorti commented Mar 3, 2026

Summary

What's in the stats block

Target statistics

Feature statistics

Feature statistics vs target

Changes

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant