Merged
Conversation
…chment Adds a static method on LLMInterface that computes and formats dataset statistics as a human-readable markdown block: - Target stats: min/max/mean/std for regression, class counts + balance % for classification - Feature stats: transposed describe() + skewness (features as rows, fixed 9 columns) - Feature vs target: Pearson corr per feature for regression; class-wise mean per feature for classification (groupby mean, transposed) Uses pandas/tabulate only (no new runtime dependencies beyond tabulate).
- Add {dataset_statistics} placeholder to FEATURE_ENGINEERING_PROMPT after target_description
- Add dataset_statistics parameter to generate_prompt_context() and generate_engineered_features()
- Add y parameter to fit() in LLMFeatureEngineer; compute stats via _format_dataset_statistics() and pass through the call chain
- Falls back to "Not provided." when y is None or dataset_statistics is not supplied
- 4 tests for _format_dataset_statistics (regression, classification, y=None, no numeric cols) - 2 tests for generate_prompt_context (dataset_statistics key present, default fallback) - 2 tests for fit() (dataset_statistics forwarded to LLM with and without y)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Auto-compute dataset statistics in
fit()and inject them into the LLM prompt as a{dataset_statistics}block, giving the model real signal about the data before it suggests transformations — with zero extra effort from the user.What's in the stats block
Target statistics
Feature statistics
(numeric columns, features as rows — fixed 9 columns regardless of dataset size)
describe()output + skewness per featureFeature statistics vs target
(only when
yis provided tofit())groupby(y).mean().T)Changes
llm_interface.py_format_dataset_statistics(X, y, problem_type)static methoddataset_statisticsparam through:generate_prompt_context()generate_engineered_features()prompts.py{dataset_statistics}placeholder between target description and additional contextfeature_engineer.pyfit()now acceptsy: Optional[pd.Series]"Not provided."whenyis absentpyproject.tomltabulate(required byto_markdown())langchain-anthropicTests
generate_prompt_contextfit()with and withoutyTotal tests: 121
Status: ✅ All green