Skip to content

enh(skore): Enable data_source="all" in metrics accessor #1446

@sylvaincom

Description

@sylvaincom

Is your feature request related to a problem? Please describe.

As a data scientist, I would like to see if my model is overfitting or not, by comparing the score of my model on the train set and on the test set.

As of v0.9.1, to get the summary of metrics on the test set, I can do:

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from skore import EstimatorReport

X, y = fetch_california_housing(return_X_y=True, as_frame=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

ridge_report = EstimatorReport(
    make_pipeline(StandardScaler(), Ridge()),
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test,
)

ridge_report.metrics.summarize().frame()  # default on test

and for the train:

ridge_report.metrics.summarize(data_source="train").frame() 

But I can't have both the train and test sets on the same dataframe.

Describe the solution you'd like

Have something like:

ridge_report.metrics.summarize(data_source="all").frame()

for both train and test sets, which would return this kind of dataframe:

Image

Describe alternatives you've considered, if relevant

No response

Additional context

No response

Metadata

Metadata

Labels

API 🧑‍💻Improvement of the API facing usersready for dev 💻Issue specified enough and ready to be implemented

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions