Skip to content

Feature/add dataset stats to prompt#29

Merged
RobertoCorti merged 5 commits intomainfrom
feature/add-dataset-stats-to-prompt
Mar 3, 2026
Merged

Feature/add dataset stats to prompt#29
RobertoCorti merged 5 commits intomainfrom
feature/add-dataset-stats-to-prompt

Conversation

@RobertoCorti
Copy link
Copy Markdown
Owner

Summary

Auto-compute dataset statistics in fit() and inject them into the LLM prompt as a {dataset_statistics} block, giving the model real signal about the data before it suggests transformations — with zero extra effort from the user.


What's in the stats block

Target statistics

  • Regression: min, max, mean, std
  • Classification: class counts + balance %

Feature statistics

(numeric columns, features as rows — fixed 9 columns regardless of dataset size)

  • Full describe() output + skewness per feature

Feature statistics vs target

(only when y is provided to fit())

  • Regression: Pearson correlation per feature
  • Classification: class-wise mean per feature (groupby(y).mean().T)

Changes

  • llm_interface.py

    • Added _format_dataset_statistics(X, y, problem_type) static method
    • Threaded dataset_statistics param through:
      • generate_prompt_context()
      • generate_engineered_features()
  • prompts.py

    • Injected {dataset_statistics} placeholder between target description and additional context
  • feature_engineer.py

    • fit() now accepts y: Optional[pd.Series]
    • Stats computed and passed through the call chain
    • Falls back to "Not provided." when y is absent
  • pyproject.toml

    • Added tabulate (required by to_markdown())
    • Added langchain-anthropic

Tests

  • 8 new tests covering:
    • All branches of the stats formatter
    • Updated generate_prompt_context
    • fit() with and without y

Total tests: 121
Status: ✅ All green

…chment

Adds a static method on LLMInterface that computes and formats dataset
statistics as a human-readable markdown block:
- Target stats: min/max/mean/std for regression, class counts + balance % for classification
- Feature stats: transposed describe() + skewness (features as rows, fixed 9 columns)
- Feature vs target: Pearson corr per feature for regression; class-wise mean per feature for classification (groupby mean, transposed)

Uses pandas/tabulate only (no new runtime dependencies beyond tabulate).
- Add {dataset_statistics} placeholder to FEATURE_ENGINEERING_PROMPT after target_description
- Add dataset_statistics parameter to generate_prompt_context() and generate_engineered_features()
- Add y parameter to fit() in LLMFeatureEngineer; compute stats via _format_dataset_statistics() and pass through the call chain
- Falls back to "Not provided." when y is None or dataset_statistics is not supplied
- 4 tests for _format_dataset_statistics (regression, classification, y=None, no numeric cols)
- 2 tests for generate_prompt_context (dataset_statistics key present, default fallback)
- 2 tests for fit() (dataset_statistics forwarded to LLM with and without y)
@RobertoCorti RobertoCorti merged commit 2d1472c into main Mar 3, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant