From 806e7f7fa43e49b2cb51795e1c4eed0d2ac9517e Mon Sep 17 00:00:00 2001 From: Pavan Date: Sat, 8 Jun 2024 16:04:15 +0530 Subject: [PATCH 1/6] updated the cross_validation.rst with k-fold cross validation method example that replaces example based on train_test_split() method --- docs/source/cross_validation.rst | 38 +++++++++++++++++++++++++------- 1 file changed, 30 insertions(+), 8 deletions(-) diff --git a/docs/source/cross_validation.rst b/docs/source/cross_validation.rst index da8a0df97..bd9de70db 100644 --- a/docs/source/cross_validation.rst +++ b/docs/source/cross_validation.rst @@ -4,25 +4,47 @@ Cross Validation See the `scikit-learn cross validation documentation`_ for a fuller discussion of cross validation. This document only describes the extensions made to support Dask arrays. -The simplest way to split one or more Dask arrays is with :func:`dask_ml.model_selection.train_test_split`: - +The simpler way to split a dataset into k-fold is with :func:`https://ml.dask.org/modules/generated/dask_ml.model_selection.KFold.html` .. ipython:: python import dask.array as da + from dask_ml.model_selection import KFold from dask_ml.datasets import make_regression - from dask_ml.model_selection import train_test_split + from dask_ml.linear_model import LinearRegression + from statistics import mean + + X, y = make_regression(n_samples=200, # choosing number of observations + n_features=5, # number of features + random_state=0, # random seed + chunks=20) # partitions to be made + +The Dask kFold method splits the data into k consecutive sets of data. Here we specify k to be 5, hence, 5-fold cross validation +.. ipython:: python + + kf = KFold(n_splits=5) + + train_scores: list[int] = [] + test_scores: list[int] = [] - X, y = make_regression(n_samples=125, n_features=4, random_state=0, chunks=50) - X + for i, j in kf.split(X): + X_train, X_test = X[i], X[j] + y_train, y_test = y[i], y[j] + + model.fit(X_train, y_train) + + train_score = model.score(X_train, y_train) + test_score = model.score(X_test, y_test) + + train_scores.append(train_score) + test_scores.append(test_score) The interface for splitting Dask arrays is the same as scikit-learn's version. .. ipython:: python - X_train, X_test, y_train, y_test = train_test_split(X, y) - X_train # A dask Array + print("mean training score:", mean(train_scores)) + print("mean testing score:", mean(train_scores)) - X_train.compute()[:3] While it's possible to pass dask arrays to :func:`sklearn.model_selection.train_test_split`, we recommend From 47b3d31c2491824759fb47491a7cd122d0c933f7 Mon Sep 17 00:00:00 2001 From: Pavan Date: Sat, 8 Jun 2024 16:20:14 +0530 Subject: [PATCH 2/6] updated the cross_validation.rst with k-fold cross validation method example --- docs/source/cross_validation.rst | 39 ++++++++++++++++++++++++-------- 1 file changed, 29 insertions(+), 10 deletions(-) diff --git a/docs/source/cross_validation.rst b/docs/source/cross_validation.rst index bd9de70db..89664b2f6 100644 --- a/docs/source/cross_validation.rst +++ b/docs/source/cross_validation.rst @@ -4,7 +4,28 @@ Cross Validation See the `scikit-learn cross validation documentation`_ for a fuller discussion of cross validation. This document only describes the extensions made to support Dask arrays. -The simpler way to split a dataset into k-fold is with :func:`https://ml.dask.org/modules/generated/dask_ml.model_selection.KFold.html` +The simplest way to split one or more Dask arrays is with :func:`dask_ml.model_selection.train_test_split`: + +.. ipython:: python + + import dask.array as da + from dask_ml.datasets import make_regression + from dask_ml.model_selection import train_test_split + + X, y = make_regression(n_samples=125, n_features=4, random_state=0, chunks=50) + X + +The interface for splitting Dask arrays is the same as scikit-learn's version. + +.. ipython:: python + + X_train, X_test, y_train, y_test = train_test_split(X, y) + X_train # A dask Array + + X_train.compute()[:3] + +Here is another illustration of performing k-fold cross validation purely in Dask. Here a link to gather more information on k-fold cross validation :func:`https://ml.dask.org/modules/generated/dask_ml.model_selection.KFold.html`: + .. ipython:: python import dask.array as da @@ -17,15 +38,17 @@ The simpler way to split a dataset into k-fold is with :func:`https://ml.dask.or n_features=5, # number of features random_state=0, # random seed chunks=20) # partitions to be made - -The Dask kFold method splits the data into k consecutive sets of data. Here we specify k to be 5, hence, 5-fold cross validation -.. ipython:: python - - kf = KFold(n_splits=5) train_scores: list[int] = [] test_scores: list[int] = [] + model = LinearRegression() + +The Dask kFold method splits the data into k consecutive subsets of data. Here we specify k to be 5, hence, 5-fold cross validation + +.. ipython:: python + kf = KFold(n_splits=5) + for i, j in kf.split(X): X_train, X_test = X[i], X[j] y_train, y_test = y[i], y[j] @@ -38,10 +61,6 @@ The Dask kFold method splits the data into k consecutive sets of data. Here we s train_scores.append(train_score) test_scores.append(test_score) -The interface for splitting Dask arrays is the same as scikit-learn's version. - -.. ipython:: python - print("mean training score:", mean(train_scores)) print("mean testing score:", mean(train_scores)) From 5a80184648ce627d1e9a1c44fa16f482e50d69fe Mon Sep 17 00:00:00 2001 From: Pavan Date: Sun, 9 Jun 2024 10:58:37 +0530 Subject: [PATCH 3/6] Update docs/source/cross_validation.rst Co-authored-by: Tom Augspurger --- docs/source/cross_validation.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/cross_validation.rst b/docs/source/cross_validation.rst index 89664b2f6..06515fcfa 100644 --- a/docs/source/cross_validation.rst +++ b/docs/source/cross_validation.rst @@ -24,7 +24,7 @@ The interface for splitting Dask arrays is the same as scikit-learn's version. X_train.compute()[:3] -Here is another illustration of performing k-fold cross validation purely in Dask. Here a link to gather more information on k-fold cross validation :func:`https://ml.dask.org/modules/generated/dask_ml.model_selection.KFold.html`: +Here is another illustration of performing k-fold cross validation purely in Dask. Here a link to gather more information on k-fold cross validation :class:`dask_ml.model_selection.KFold`: .. ipython:: python From c89ac72046ee5343bf241190339d520590a5cca7 Mon Sep 17 00:00:00 2001 From: Pavan Date: Sun, 9 Jun 2024 10:58:49 +0530 Subject: [PATCH 4/6] Update docs/source/cross_validation.rst Co-authored-by: Tom Augspurger --- docs/source/cross_validation.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/cross_validation.rst b/docs/source/cross_validation.rst index 06515fcfa..afffaa431 100644 --- a/docs/source/cross_validation.rst +++ b/docs/source/cross_validation.rst @@ -46,6 +46,7 @@ Here is another illustration of performing k-fold cross validation purely in Das The Dask kFold method splits the data into k consecutive subsets of data. Here we specify k to be 5, hence, 5-fold cross validation + .. ipython:: python kf = KFold(n_splits=5) From 7c5099b7ad070c6cd6160d3db29fc03d401f0e4a Mon Sep 17 00:00:00 2001 From: Pavan Date: Sun, 9 Jun 2024 11:33:27 +0530 Subject: [PATCH 5/6] incorporating the suggestions to cross_validation.rst --- docs/source/cross_validation.rst | 35 ++++++++++++++++---------------- 1 file changed, 18 insertions(+), 17 deletions(-) diff --git a/docs/source/cross_validation.rst b/docs/source/cross_validation.rst index afffaa431..6fa07001f 100644 --- a/docs/source/cross_validation.rst +++ b/docs/source/cross_validation.rst @@ -24,6 +24,17 @@ The interface for splitting Dask arrays is the same as scikit-learn's version. X_train.compute()[:3] +While it's possible to pass dask arrays to :func:`sklearn.model_selection.train_test_split`, we recommend +using the Dask version for performance reasons: the Dask version is faster +for two reasons: + +First, **the Dask version shuffles blockwise**. +In a distributed setting, shuffling *between* blocks may require sending large amounts of data between machines, which can be slow. +However, if there's a strong pattern in your data, you'll want to perform a full shuffle. + +Second, the Dask version avoids allocating large intermediate NumPy arrays storing the indexes for slicing. +For very large datasets, creating and transmitting ``np.arange(n_samples)`` can be expensive. + Here is another illustration of performing k-fold cross validation purely in Dask. Here a link to gather more information on k-fold cross validation :class:`dask_ml.model_selection.KFold`: .. ipython:: python @@ -51,31 +62,21 @@ The Dask kFold method splits the data into k consecutive subsets of data. Here w kf = KFold(n_splits=5) for i, j in kf.split(X): - X_train, X_test = X[i], X[j] - y_train, y_test = y[i], y[j] + X_train, X_test = X[i], X[j] + y_train, y_test = y[i], y[j] - model.fit(X_train, y_train) + model.fit(X_train, y_train) - train_score = model.score(X_train, y_train) - test_score = model.score(X_test, y_test) + train_score = model.score(X_train, y_train) + test_score = model.score(X_test, y_test) - train_scores.append(train_score) - test_scores.append(test_score) + train_scores.append(train_score) + test_scores.append(test_score) print("mean training score:", mean(train_scores)) print("mean testing score:", mean(train_scores)) -While it's possible to pass dask arrays to :func:`sklearn.model_selection.train_test_split`, we recommend -using the Dask version for performance reasons: the Dask version is faster -for two reasons: - -First, **the Dask version shuffles blockwise**. -In a distributed setting, shuffling *between* blocks may require sending large amounts of data between machines, which can be slow. -However, if there's a strong pattern in your data, you'll want to perform a full shuffle. - -Second, the Dask version avoids allocating large intermediate NumPy arrays storing the indexes for slicing. -For very large datasets, creating and transmitting ``np.arange(n_samples)`` can be expensive. .. _scikit-learn cross validation documentation: http:/scikit-learn.org/stable/modules/cross_validation.html From ec84a314047f633ff7e8d909d8ee5ce0721139d4 Mon Sep 17 00:00:00 2001 From: Pavan Date: Tue, 18 Jun 2024 18:44:28 +0530 Subject: [PATCH 6/6] incorporated the suggestions --- docs/source/cross_validation.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/cross_validation.rst b/docs/source/cross_validation.rst index 6fa07001f..511dd0419 100644 --- a/docs/source/cross_validation.rst +++ b/docs/source/cross_validation.rst @@ -59,6 +59,7 @@ The Dask kFold method splits the data into k consecutive subsets of data. Here w .. ipython:: python + kf = KFold(n_splits=5) for i, j in kf.split(X):