From 806e7f7fa43e49b2cb51795e1c4eed0d2ac9517e Mon Sep 17 00:00:00 2001
From: Pavan <pavan.narayanan@gmail.com>
Date: Sat, 8 Jun 2024 16:04:15 +0530
Subject: [PATCH 1/6] updated the cross_validation.rst with k-fold cross
 validation method example that replaces example based on train_test_split()
 method

---
 docs/source/cross_validation.rst | 38 +++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 8 deletions(-)

diff --git a/docs/source/cross_validation.rst b/docs/source/cross_validation.rst
index da8a0df97..bd9de70db 100644
--- a/docs/source/cross_validation.rst
+++ b/docs/source/cross_validation.rst
@@ -4,25 +4,47 @@ Cross Validation
 See the `scikit-learn cross validation documentation`_ for a fuller discussion of cross validation.
 This document only describes the extensions made to support Dask arrays.
 
-The simplest way to split one or more Dask arrays is with :func:`dask_ml.model_selection.train_test_split`:
-
+The simpler way to split a dataset into k-fold is with :func:`https://ml.dask.org/modules/generated/dask_ml.model_selection.KFold.html`
 .. ipython:: python
 
    import dask.array as da
+   from dask_ml.model_selection import KFold
    from dask_ml.datasets import make_regression
-   from dask_ml.model_selection import train_test_split
+   from dask_ml.linear_model import LinearRegression
+   from statistics import mean 
+
+   X, y = make_regression(n_samples=200, # choosing number of observations
+				 n_features=5, # number of features
+				 random_state=0, # random seed
+				 chunks=20) # partitions to be made 
+   
+The Dask kFold method splits the data into k consecutive sets of data. Here we specify k to be 5, hence, 5-fold cross validation
+.. ipython:: python
+
+   kf = KFold(n_splits=5)
+
+   train_scores: list[int] = []
+   test_scores: list[int] = []
 
-   X, y = make_regression(n_samples=125, n_features=4, random_state=0, chunks=50)
-   X
+   for i, j in kf.split(X):
+      X_train, X_test = X[i], X[j]
+      y_train, y_test = y[i], y[j]
+      
+      model.fit(X_train, y_train)
+      
+      train_score = model.score(X_train, y_train)
+      test_score = model.score(X_test, y_test)
+      
+      train_scores.append(train_score)
+      test_scores.append(test_score)
 
 The interface for splitting Dask arrays is the same as scikit-learn's version.
 
 .. ipython:: python
 
-   X_train, X_test, y_train, y_test = train_test_split(X, y)
-   X_train  # A dask Array
+   print("mean training score:", mean(train_scores))
+   print("mean testing score:", mean(train_scores))
 
-   X_train.compute()[:3]
 
 
 While it's possible to pass dask arrays to :func:`sklearn.model_selection.train_test_split`, we recommend

From 47b3d31c2491824759fb47491a7cd122d0c933f7 Mon Sep 17 00:00:00 2001
From: Pavan <pavan.narayanan@gmail.com>
Date: Sat, 8 Jun 2024 16:20:14 +0530
Subject: [PATCH 2/6] updated the cross_validation.rst with k-fold cross
 validation method example

---
 docs/source/cross_validation.rst | 39 ++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/docs/source/cross_validation.rst b/docs/source/cross_validation.rst
index bd9de70db..89664b2f6 100644
--- a/docs/source/cross_validation.rst
+++ b/docs/source/cross_validation.rst
@@ -4,7 +4,28 @@ Cross Validation
 See the `scikit-learn cross validation documentation`_ for a fuller discussion of cross validation.
 This document only describes the extensions made to support Dask arrays.
 
-The simpler way to split a dataset into k-fold is with :func:`https://ml.dask.org/modules/generated/dask_ml.model_selection.KFold.html`
+The simplest way to split one or more Dask arrays is with :func:`dask_ml.model_selection.train_test_split`:
+
+.. ipython:: python
+
+   import dask.array as da
+   from dask_ml.datasets import make_regression
+   from dask_ml.model_selection import train_test_split
+
+   X, y = make_regression(n_samples=125, n_features=4, random_state=0, chunks=50)
+   X
+
+The interface for splitting Dask arrays is the same as scikit-learn's version.
+
+.. ipython:: python
+
+   X_train, X_test, y_train, y_test = train_test_split(X, y)
+   X_train  # A dask Array
+
+   X_train.compute()[:3]
+
+Here is another illustration of performing k-fold cross validation purely in Dask. Here a link to gather more information on k-fold cross validation :func:`https://ml.dask.org/modules/generated/dask_ml.model_selection.KFold.html`:
+
 .. ipython:: python
 
    import dask.array as da
@@ -17,15 +38,17 @@ The simpler way to split a dataset into k-fold is with :func:`https://ml.dask.or
 				 n_features=5, # number of features
 				 random_state=0, # random seed
 				 chunks=20) # partitions to be made 
-   
-The Dask kFold method splits the data into k consecutive sets of data. Here we specify k to be 5, hence, 5-fold cross validation
-.. ipython:: python
-
-   kf = KFold(n_splits=5)
 
    train_scores: list[int] = []
    test_scores: list[int] = []
 
+   model = LinearRegression()
+
+The Dask kFold method splits the data into k consecutive subsets of data. Here we specify k to be 5, hence, 5-fold cross validation
+
+.. ipython:: python
+   kf = KFold(n_splits=5)
+
    for i, j in kf.split(X):
       X_train, X_test = X[i], X[j]
       y_train, y_test = y[i], y[j]
@@ -38,10 +61,6 @@ The Dask kFold method splits the data into k consecutive sets of data. Here we s
       train_scores.append(train_score)
       test_scores.append(test_score)
 
-The interface for splitting Dask arrays is the same as scikit-learn's version.
-
-.. ipython:: python
-
    print("mean training score:", mean(train_scores))
    print("mean testing score:", mean(train_scores))
 

From 5a80184648ce627d1e9a1c44fa16f482e50d69fe Mon Sep 17 00:00:00 2001
From: Pavan <pavan.narayanan@gmail.com>
Date: Sun, 9 Jun 2024 10:58:37 +0530
Subject: [PATCH 3/6] Update docs/source/cross_validation.rst

Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com>
---
 docs/source/cross_validation.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/cross_validation.rst b/docs/source/cross_validation.rst
index 89664b2f6..06515fcfa 100644
--- a/docs/source/cross_validation.rst
+++ b/docs/source/cross_validation.rst
@@ -24,7 +24,7 @@ The interface for splitting Dask arrays is the same as scikit-learn's version.
 
    X_train.compute()[:3]
 
-Here is another illustration of performing k-fold cross validation purely in Dask. Here a link to gather more information on k-fold cross validation :func:`https://ml.dask.org/modules/generated/dask_ml.model_selection.KFold.html`:
+Here is another illustration of performing k-fold cross validation purely in Dask. Here a link to gather more information on k-fold cross validation :class:`dask_ml.model_selection.KFold`:
 
 .. ipython:: python
 

From c89ac72046ee5343bf241190339d520590a5cca7 Mon Sep 17 00:00:00 2001
From: Pavan <pavan.narayanan@gmail.com>
Date: Sun, 9 Jun 2024 10:58:49 +0530
Subject: [PATCH 4/6] Update docs/source/cross_validation.rst

Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com>
---
 docs/source/cross_validation.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/source/cross_validation.rst b/docs/source/cross_validation.rst
index 06515fcfa..afffaa431 100644
--- a/docs/source/cross_validation.rst
+++ b/docs/source/cross_validation.rst
@@ -46,6 +46,7 @@ Here is another illustration of performing k-fold cross validation purely in Das
 
 The Dask kFold method splits the data into k consecutive subsets of data. Here we specify k to be 5, hence, 5-fold cross validation
 
+
 .. ipython:: python
    kf = KFold(n_splits=5)
 

From 7c5099b7ad070c6cd6160d3db29fc03d401f0e4a Mon Sep 17 00:00:00 2001
From: Pavan <pavan.narayanan@gmail.com>
Date: Sun, 9 Jun 2024 11:33:27 +0530
Subject: [PATCH 5/6] incorporating the suggestions to cross_validation.rst

---
 docs/source/cross_validation.rst | 35 ++++++++++++++++----------------
 1 file changed, 18 insertions(+), 17 deletions(-)

diff --git a/docs/source/cross_validation.rst b/docs/source/cross_validation.rst
index afffaa431..6fa07001f 100644
--- a/docs/source/cross_validation.rst
+++ b/docs/source/cross_validation.rst
@@ -24,6 +24,17 @@ The interface for splitting Dask arrays is the same as scikit-learn's version.
 
    X_train.compute()[:3]
 
+While it's possible to pass dask arrays to :func:`sklearn.model_selection.train_test_split`, we recommend
+using the Dask version for performance reasons: the Dask version is faster
+for two reasons:
+
+First, **the Dask version shuffles blockwise**.
+In a distributed setting, shuffling *between* blocks may require sending large amounts of data between machines, which can be slow.
+However, if there's a strong pattern in your data, you'll want to perform a full shuffle.
+
+Second, the Dask version avoids allocating large intermediate NumPy arrays storing the indexes for slicing.
+For very large datasets, creating and transmitting ``np.arange(n_samples)`` can be expensive.
+
 Here is another illustration of performing k-fold cross validation purely in Dask. Here a link to gather more information on k-fold cross validation :class:`dask_ml.model_selection.KFold`:
 
 .. ipython:: python
@@ -51,31 +62,21 @@ The Dask kFold method splits the data into k consecutive subsets of data. Here w
    kf = KFold(n_splits=5)
 
    for i, j in kf.split(X):
-      X_train, X_test = X[i], X[j]
-      y_train, y_test = y[i], y[j]
+       X_train, X_test = X[i], X[j]
+       y_train, y_test = y[i], y[j]
       
-      model.fit(X_train, y_train)
+       model.fit(X_train, y_train)
       
-      train_score = model.score(X_train, y_train)
-      test_score = model.score(X_test, y_test)
+       train_score = model.score(X_train, y_train)
+       test_score = model.score(X_test, y_test)
       
-      train_scores.append(train_score)
-      test_scores.append(test_score)
+       train_scores.append(train_score)
+       test_scores.append(test_score)
 
    print("mean training score:", mean(train_scores))
    print("mean testing score:", mean(train_scores))
 
 
 
-While it's possible to pass dask arrays to :func:`sklearn.model_selection.train_test_split`, we recommend
-using the Dask version for performance reasons: the Dask version is faster
-for two reasons:
-
-First, **the Dask version shuffles blockwise**.
-In a distributed setting, shuffling *between* blocks may require sending large amounts of data between machines, which can be slow.
-However, if there's a strong pattern in your data, you'll want to perform a full shuffle.
-
-Second, the Dask version avoids allocating large intermediate NumPy arrays storing the indexes for slicing.
-For very large datasets, creating and transmitting ``np.arange(n_samples)`` can be expensive.
 
 .. _scikit-learn cross validation documentation: http:/scikit-learn.org/stable/modules/cross_validation.html

From ec84a314047f633ff7e8d909d8ee5ce0721139d4 Mon Sep 17 00:00:00 2001
From: Pavan <pavan.narayanan@gmail.com>
Date: Tue, 18 Jun 2024 18:44:28 +0530
Subject: [PATCH 6/6] incorporated the suggestions

---
 docs/source/cross_validation.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/source/cross_validation.rst b/docs/source/cross_validation.rst
index 6fa07001f..511dd0419 100644
--- a/docs/source/cross_validation.rst
+++ b/docs/source/cross_validation.rst
@@ -59,6 +59,7 @@ The Dask kFold method splits the data into k consecutive subsets of data. Here w
 
 
 .. ipython:: python
+   
    kf = KFold(n_splits=5)
 
    for i, j in kf.split(X):