diff --git a/about.md b/about.md index aa89f2c..7b95222 100644 --- a/about.md +++ b/about.md @@ -36,7 +36,7 @@ SmartCore is developed and maintained by Smartcore developers. Our goal is to bu ### Version 0.1.0 -This is our first realease, enjoy! In this version you'll find: +This is our first release, enjoy! In this version you'll find: - KNN + distance metrics (Euclidian, Minkowski, Manhattan, Hamming, Mahalanobis) - Linear Regression (OLS) - Logistic Regression @@ -53,4 +53,4 @@ This is our first realease, enjoy! In this version you'll find: - LU, QR, SVD, EVD - Evaluation Metrics -Please let us know if you found a problem. The best way to report it is to [open an issue](https://github.com/smartcorelib/smartcore/issues) on GitHub. \ No newline at end of file +Please let us know if you found a problem. The best way to report it is to [open an issue](https://github.com/smartcorelib/smartcore/issues) on GitHub. diff --git a/user_guide/developer.md b/user_guide/developer.md index ee56c30..73dd36b 100644 --- a/user_guide/developer.md +++ b/user_guide/developer.md @@ -26,7 +26,7 @@ If you found a bug or problem please do not hesitate to report it by [opening an The best way to request a new feature is by [opening an issue](https://github.com/smartcorelib/smartcore/issues) in GitHub. When you submit your idea, please keep in mind these recommendations: -* If you are requesting new algorithm, please add references to papers describing this algorithm. If you have a particular implementation in mind, feel free to share references to it as well. If not, we will do our best to find the best implementation available ourselves. +* If you are requesting a new algorithm, please add references to papers describing this algorithm. If you have a particular implementation in mind, feel free to share references to it as well. If not, we will do our best to find the best implementation available ourselves. * Please tell us why this feature is important to you. ## Contributing code @@ -43,6 +43,6 @@ To make sure your PR is swiftly approved and merged, please make sure new featur ## Changes to documentation -If you found a problem in documentation please do not hesitate to correct it and submit your proposed change as a [pull request](https://github.com/smartcorelib/smartcore/pulls) (PR) in GutHub. At this moment documentation is found in several places: [API](https://github.com/smartcorelib/smartcore), [website](https://github.com/smartcorelib/smartcorelib.org) and [examples](https://github.com/smartcorelib/smartcore-examples). Please submit your pull request to a corresponding repository. If your change is a minor correction (e.g. misspelling or grammar error) there is no need to open a separate issue describing what you've found, just correct it and submit your PR! +If you found a problem in documentation please do not hesitate to correct it and submit your proposed change as a [pull request](https://github.com/smartcorelib/smartcore/pulls) (PR) in GitHub. At this moment documentation is found in several places: [API](https://github.com/smartcorelib/smartcore), [website](https://github.com/smartcorelib/smartcorelib.org) and [examples](https://github.com/smartcorelib/smartcore-examples). Please submit your pull request to a corresponding repository. If your change is a minor correction (e.g. misspelling or grammar error) there is no need to open a separate issue describing what you've found, just correct it and submit your PR! Another way to make a change in documentation is to [open an issue](https://github.com/smartcorelib/smartcore/issues) in GitHub. diff --git a/user_guide/model_selection.md b/user_guide/model_selection.md index 75a9a99..3a92b94 100644 --- a/user_guide/model_selection.md +++ b/user_guide/model_selection.md @@ -8,7 +8,7 @@ description: Tools for model selection and evaluation. K-fold cross validation, *SmartCore* comes with a lot of easy-to-use algorithms and it is straightforward to fit many different machine learning models to a given dataset. Once you have many algorithms to choose from the question becomes how to choose the best machine learning model among a range of different models that you can use for your data. The problem of choosing the right model becomes even harder if you consider many different combinations of hyperparameters for each algorithm. -Model selection is the process of selecting one final machine learning model from among a collection of candidate models for you problem at hand. The process of assessing a model’s performance is known as model evaluation. +Model selection is the process of selecting one final machine learning model from among a collection of candidate models for the problem at hand. The process of assessing a model’s performance is known as model evaluation. [K-fold Cross-Validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) (k-fold CV) is a commonly used technique for model selection and evaluation. Another alternative is to split your data into three separate sets: _training_, _validation_, _test_. You use the _training_ set to train your model and _validation_ set for model selection and hyperparameter tuning. The _test_ set can be used to get an unbiased estimate of model performance. @@ -34,12 +34,12 @@ let y = boston_data.target; let (x_train, x_test, y_train, y_test) = train_test_split(&x, &y, 0.2, true); ``` -While a simple test/train split method is good for a very large dataset, the test score dependents on how the data is split into train and test sets. To get a better indication of how well your model performs on new data use k-fold CV. +While a simple test/train split method is good for a very large dataset, the test score depends on how the data is split into train and test sets. To get a better indication of how well your model performs on new data use k-fold CV. To evaluate performance of your model with k-fold CV use [`cross_validate`]({{site.api_base_url}}/model_selection/fn.cross_validate.html) function. -This function splits datasets up into k groups. One of the groups is used as the test set and the rest are used as the training set. The model is trained on the training set and evaluated on the test set. Then the process is repeated until each unique group as been used as the test set. +This function splits datasets up into k groups. One of the groups is used as the test set and the rest are used as the training set. The model is trained on the training set and evaluated on the test set. Then the process is repeated until each unique group has been used as the test set. -For example, when you split your dataset into 3 folds, as in Figure 1, `cross_validate` will fit and evaluate your model 3 times. First, the function will use folds 2 and 3 to train your model and fold 1 to evaluate its performance. On the second run, the function will take folds 1 and 3 for trainig and fold 2 for evaluation. +For example, when you split your dataset into 3 folds, as in Figure 1, `cross_validate` will fit and evaluate your model 3 times. First, the function will use folds 2 and 3 to train your model and fold 1 to evaluate its performance. On the second run, the function will take folds 1 and 3 for training and fold 2 for evaluation.
k-fold CV @@ -95,7 +95,7 @@ We also keep toy datasets behind the `datasets` feature flag. Feature `datasets` smartcore = { version = "0.1.0", default-features = false} ``` -When feature flag `datasets` is enabled you'l get these datasets: +When feature flag `datasets` is enabled you'll get these datasets: {:.table .table-striped .table-bordered} | Dataset | Description | Samples | Attributes | Type | diff --git a/user_guide/quick_start.md b/user_guide/quick_start.md index 6fff153..7be2534 100644 --- a/user_guide/quick_start.md +++ b/user_guide/quick_start.md @@ -19,7 +19,7 @@ All of these algorithms are implemented in Rust. Why another machine learning library for Rust, you might ask? While there are at least three [general-purpose ML libraries](http://www.arewelearningyet.com/) for Rust, most of these libraries either do not support all of the algorithms that are implemented in *SmartCore* or aren't integrated with [nalgebra](https://nalgebra.org/) and [ndarray](https://github.com/rust-ndarray/ndarray). -All algorithms in *SmartCore* works well with both libraries. You can also use standard Rust vectors with all of the algorithms implemented here if you prefer to have minimum number of dependencies in your code. +All algorithms in *SmartCore* work well with both libraries. You can also use standard Rust vectors with all of the algorithms implemented here if you prefer to have minimum number of dependencies in your code. We developed *SmartCore* to promote scientific computing in Rust. Our goal is to build an open-source library that has accurate, numerically stable, and well-documented implementations of the most well-known and widely used machine learning methods. @@ -96,9 +96,9 @@ Our performance metric (accuracy) went up two percentage points! Nice work! ## High-level overview -Majority of machine learning algorithms rely on linear algebra routines and optimization methods to fit a model to a dataset or to make a prediction from new data. There are many crates for linear algebra and optimization in Rust but SmartCore does not has a hard dependency on any of these crates. Instead, machine learning algorithms in *SmartCore* use an abstraction layer where operations on multidimensional arrays and maximization/minimization routines are defined. This approach allow us to quickly integrate with any new type of matrix or vector as long as it implements all abstract methods from this layer. +Majority of machine learning algorithms rely on linear algebra routines and optimization methods to fit a model to a dataset or to make a prediction from new data. There are many crates for linear algebra and optimization in Rust but SmartCore does not have a hard dependency on any of these crates. Instead, machine learning algorithms in *SmartCore* use an abstraction layer where operations on multidimensional arrays and maximization/minimization routines are defined. This approach allows us to quickly integrate with any new type of matrix or vector as long as it implements all abstract methods from this layer. -Functions from optimization module are not available directly but we plan to make optimization library public once it is mature enough. +Functions from the optimization module are not available directly but we plan to make the optimization library public once it is mature enough. While functions from [linear algebra module]({{site.api_base_url}}/linalg/index.html) are public you should not use them directly because this module is still unstable. We keep this interface open to let anyone add implementations of other types of matrices that are currently not supported by *SmartCore*. Please see [Developer's Guide]({{ site.baseurl }}/user_guide/developer.html) if you want to add your favourite matrix type to *SmartCore*. @@ -111,9 +111,9 @@ Figure 1 shows 3 layers with abstract linear algebra and optimization functions ### API -All algorithms in *SmartCore* implement the same inrefrace when it comes to fitting an algorithm to your dataset or making a prediction from new data. All core interfaces are defined in the [api module]({{site.api_base_url}}/api/index.html). +All algorithms in *SmartCore* implement the same interface when it comes to fitting an algorithm to your dataset or making a prediction from new data. All core interfaces are defined in the [api module]({{site.api_base_url}}/api/index.html). -There is a static function `fit` that fits an algorithm to your data. This function is defined in two places, [`SupervisedEstimator`]({{ site.api_base_url }}/api/trait.SupervisedEstimator.html) and [`UnsupervisedEstimator`]({{ site.api_base_url }}/api/trait.UnsupervisedEstimator.html), one is used for supervised learning and another for unsupervised learning. Both estimators takes you training data and hyperparameters for the algorithm and produce a fully trained instance of the estimator. The only difference between these two traits is that `SupervisedEstimator` requires training target values in addition to training predictors to fit an algorithm to your data. +There is a static function `fit` that fits an algorithm to your data. This function is defined in two places, [`SupervisedEstimator`]({{ site.api_base_url }}/api/trait.SupervisedEstimator.html) and [`UnsupervisedEstimator`]({{ site.api_base_url }}/api/trait.UnsupervisedEstimator.html), one is used for supervised learning and another for unsupervised learning. Both estimators take your training data and hyperparameters for the algorithm and produce a fully trained instance of the estimator. The only difference between these two traits is that `SupervisedEstimator` requires training target values in addition to training predictors to fit an algorithm to your data. A function `predict` is defined in the [`Predictor`]({{ site.api_base_url }}/api/trait.Predictor.html) trait and is used to predict labels or target values from new data. All mandatory parameters of the model are declared as parameters of function `fit`. All optional parameters are hidden behind `Default::default()`. @@ -208,4 +208,4 @@ If you are done reading through this page we would recommend to go to a specific * [Supervised Learning]({{ site.baseurl }}/user_guide/supervised.html), in this section you will find tree-based, linear and KNN models. * [Unsupervised Learning]({{ site.baseurl }}/user_guide/unsupervised.html), unsupervised methods like clustering and matrix decomposition methods. * [Model Selection]({{ site.baseurl }}/user_guide/model_selection.html), varios metrics for model evaluation. -* [Developer's Guide]({{ site.baseurl }}/user_guide/developer.html), would you like to contribute? Here you will find useful guidelines and rubrics to consider. \ No newline at end of file +* [Developer's Guide]({{ site.baseurl }}/user_guide/developer.html), would you like to contribute? Here you will find useful guidelines and rubrics to consider. diff --git a/user_guide/supervised.md b/user_guide/supervised.md index 9c3c316..6cb10fa 100644 --- a/user_guide/supervised.md +++ b/user_guide/supervised.md @@ -6,12 +6,12 @@ description: Supervised learning with Smartcore, including, but not limited to K # Supervised Learning -Most machine learning problems falls into one of two categories: _supervised_ and _unsupervised_. +Most machine learning problems fall into one of two categories: _supervised_ and _unsupervised_. This page describes supervised learning algorithms implemented in *SmartCore*. In supervised learning we build a model which can be written in the very general form as \\[Y = f(X) + \epsilon\\] -\\(X = (X_1,X_2,...,X_p)\\) are observations that consist of \\(p\\) different predictors, \\(Y\\) is an associated target values and +\\(X = (X_1,X_2,...,X_p)\\) are observations that consist of \\(p\\) different predictors, \\(Y\\) is an associated target value and \\(\epsilon\\) is a random error term, which is independent of \\(X\\) and has zero mean. We fit an unknown function \\(f\\) to our data to predict the response for future observations or better understand the relationship between the response and the predictors. @@ -26,7 +26,7 @@ To make a prediction use `predict` method that takes new observations as `x` and ## K Nearest Neighbors -K-nearest neighbors (KNN) is one of the simplest and best-known non-parametric classification and regression method. +K-nearest neighbors (KNN) is one of the simplest and best-known non-parametric classification and regression methods. KNN does not require training. The algorithm simply stores the entire dataset and then uses this dataset to make predictions. More formally, @@ -34,11 +34,11 @@ given a positive integer \\(K\\) and a test observation \\(x_0\\), the KNN class \\[ Pr(Y=j \vert X=x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i=j) \\] -KNN Regressor is closely related to the KNN classifier. It estimates target value using the average of all the reponses in \\(N_0\\), i.e. +KNN Regressor is closely related to the KNN classifier. It estimates a target value using the average of all the reponses in \\(N_0\\), i.e. \\[ \hat{y} = \frac{1}{K} \sum_{i \in N_0} y_i \\] -The choice of \\(K\\) is very important. \\(K\\) can be found by tuning algorithm on a holdout dataset. It is a good idea to try many different values for \\(K\\) (e.g. values from 1 to 21) and see which value gives the best test error rate. +The choice of \\(K\\) is very important. \\(K\\) can be found by tuning the algorithm on a holdout dataset. It is a good idea to try many different values for \\(K\\) (e.g. values from 1 to 21) and see which value gives the best test error rate. To determine which of the \\(K\\) instances in the training dataset are most similar to a new input a [distance metric]({{site.api_base_url}}/math/distance/index.html) is used. For real-valued input variables, the most popular distance metric is [Euclidean distance]({{site.api_base_url}}/math/distance/euclidian/index.html). You can choose the best distance metric based on the properties of your data. If you are unsure, you can experiment with different distance metrics and different values of \\(K\\) together and see which mix results in the most accurate models. @@ -78,11 +78,11 @@ let y_hat_knn = KNNClassifier::fit( println!("AUC: {}", roc_auc_score(&y_test, &y_hat_knn)); ``` -Default value of \\(K\\) is 3. If you want to change value of this and other parameters replace `Default::default()` with an instance of [`KNNClassifierParameters`]({{site.api_base_url}}/neighbors/knn_classifier/struct.KNNClassifierParameters.html). +Default value of \\(K\\) is 3. If you want to change this value or other parameters replace `Default::default()` with an instance of [`KNNClassifierParameters`]({{site.api_base_url}}/neighbors/knn_classifier/struct.KNNClassifierParameters.html). ### Nearest Neighbors Regression -KNN Regressor, implemented in [`KNNClassifier`]({{site.api_base_url}}/neighbors/knn_regressor/struct.KNNRegressor.html) is very similar to KNN Classifier, the only difference is that returned value is a real value instead of class label. To fit `KNNRegressor` to [Boston Housing]({{site.api_base_url}}/dataset/boston/index.html) dataset: +KNN Regressor, implemented in [`KNNClassifier`]({{site.api_base_url}}/neighbors/knn_regressor/struct.KNNRegressor.html) is very similar to KNN Classifier, the only difference is that the returned value is a real value instead of class label. To fit `KNNRegressor` to [Boston Housing]({{site.api_base_url}}/dataset/boston/index.html) dataset: ```rust use smartcore::dataset::*; @@ -120,19 +120,19 @@ As with KNN Classifier you can change value of k and other parameters by passing ### Nearest Neighbor Algorithms -The computational complexity of KNN increases with the size of the training dataset. This is because every time prediction is made algorithm has to search through all stored samples to find K nearest neighbors. Efficient implementation of KNN requires special data structure, like [CoverTree](https://en.wikipedia.org/wiki/Cover_tree) to speed up look-up of nearest neighbors during prediction. +The computational complexity of KNN increases with the size of the training dataset. This is because every time a prediction is made the algorithm has to search through all stored samples to find K nearest neighbors. Efficient implementations of KNN require a special data structure, like [CoverTree](https://en.wikipedia.org/wiki/Cover_tree) to speed up the look-up of nearest neighbors during prediction. -Cover Tree is the default algorithm for KNN regressor and classifier. Change value of `algorithm` field of the `KNNRegressorParameters` or `KNNClassifierParameters` if you want to switch to brute force search method. +Cover Tree is the default algorithm for KNN regressor and classifier. Change the value of `algorithm` field of the `KNNRegressorParameters` or `KNNClassifierParameters` if you want to switch to brute force search method. #### Brute Force -The brute force nearest neighbor search is the simplest algorithm that calculates the distance from the query point to every other point in the dataset while maintaining a list of K nearest items in a [Binary Heap](https://en.wikipedia.org/wiki/Binary_heap#Search). This algorithms does not maintain any search data structure and results in \\(O(n)\\) search time, where \\(n\\) is number of samples. Brute force search algorithm is implemented in [LinearKNNSearch]({{site.api_base_url}}/algorithm/neighbour/linear_search/index.html). +The brute force nearest neighbor search is the simplest algorithm that calculates the distance from the query point to every other point in the dataset while maintaining a list of K nearest items in a [Binary Heap](https://en.wikipedia.org/wiki/Binary_heap#Search). This algorithm does not maintain any search data structure and results in \\(O(n)\\) search time, where \\(n\\) is number of samples. Brute force search algorithm is implemented in [LinearKNNSearch]({{site.api_base_url}}/algorithm/neighbour/linear_search/index.html). #### Cover Tree -Although Brute Force algorithms is very simple approach it outperforms a lot of space partitioning approaches like [k-d tree](https://en.wikipedia.org/wiki/K-d_tree) on higher dimensional spaces. However, the brute-force approach quickly becomes infeasible as the dataset grows in size. To address inefficiencies of Brute Force other data structures are used that reduce the required number of distance calculations by efficiently encoding aggregate distance information for the sample. +Although the Brute Force algorithm is a very simple approach it outperforms a lot of space partitioning approaches like [k-d tree](https://en.wikipedia.org/wiki/K-d_tree) on higher dimensional spaces. However, the brute-force approach quickly becomes infeasible as the dataset grows in size. To address inefficiencies of Brute Force other data structures are used that reduce the required number of distance calculations by efficiently encoding aggregate distance information for the sample. -A [Cover Tree]({{site.api_base_url}}/algorithm/neighbour/cover_tree/index.html) is a tree data structure used for the partitiong of metric spaces to speed up nearest neighbor operations. Cover trees are fast in practice and have great theoretical properties: +A [Cover Tree]({{site.api_base_url}}/algorithm/neighbour/cover_tree/index.html) is a tree data structure used for the partitioning of metric spaces to speed up nearest neighbor operations. Cover trees are fast in practice and have great theoretical properties: * Construction: \\(O(c^6n\log n)\\) * Query: \\(O(c^{12}\log n)\\), @@ -142,7 +142,7 @@ where \\(n\\) is number of samples in a dataset and \\(c\\) denotes the expansio ### Distance Metrics -The choice of distance metric for KNN algorithm largely depends on properties of your data. If you don't know which distance to use go with Euclidean distance function or choose metric that gives you the best performance on a hold out test set. +The choice of distance metric for the KNN algorithm largely depends on properties of your data. If you don't know which distance to use go with Euclidean distance function or choose a metric that gives you the best performance on a hold out test set. There are many other distance measures that can be used with KNN in *SmartCore* {:.table .table-striped .table-bordered} @@ -200,7 +200,7 @@ let y_hat_lr = LinearRegression::fit(&x_train, &y_train, Default::default()) println!("MSE: {}", mean_squared_error(&y_test, &y_hat_lr)); ``` -By default, *SmartCore* uses [SVD Decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition) to find estimates of \\(\beta_i\\) that minimizes the sum of the squared residuals. While SVD Decomposition provides the most stable solution, you might decide to go with [QR Decomposition](https://en.wikipedia.org/wiki/QR_decomposition) since this approach is more computationally efficient than SVD Decomposition. For comparison, runtime complexity of SVD Decomposition is \\(O(mn^2 + n^3)\\) vs \\(O(mn^2 + n^3/3)\\) for QR decomposition, where \\(n\\) and \\(m\\) are dimentions of input matrix \\(X\\). Use `solver` attribute of the [`LinearRegressionParameters`]({{site.api_base_url}}/linear/linear_regression/struct.LinearRegressionParameters.html) to choose between decomposition methods. +By default, *SmartCore* uses [SVD Decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition) to find estimates of \\(\beta_i\\) that minimize the sum of the squared residuals. While the SVD provides the most stable solution, you might decide to go with [QR Decomposition](https://en.wikipedia.org/wiki/QR_decomposition) since this approach is more computationally efficient than the SVD. For comparison, runtime complexity of the SVD is \\(O(mn^2 + n^3)\\) vs \\(O(mn^2 + n^3/3)\\) for QR decomposition, where \\(n\\) and \\(m\\) are dimentions of input matrix \\(X\\). Use `solver` attribute of the [`LinearRegressionParameters`]({{site.api_base_url}}/linear/linear_regression/struct.LinearRegressionParameters.html) to choose between decomposition methods. ### Shrinkage Methods @@ -208,11 +208,11 @@ One way to avoid overfitting when you fit a linear model to your dataset is to u #### Ridge Regression -Ridge Regression is a regularized version of linear regression that adds L2 regularization term to the cost function: +Ridge Regression is a regularized version of linear regression that adds an L2 regularization term to the cost function: \\[\lambda \sum_{i=i}^n \beta_i^2\\] -where \\(\lambda \geq 0\\) is a tuning hyperparameter. If \\(\lambda\\) is close to 0, then it has no effects because Ridge Regression is similar to plain linear regression. As \\(\lambda\\) gets larger the shrinking effect on the weights gets stronger and the weights approach zero. +where \\(\lambda \geq 0\\) is a tuning hyperparameter. If \\(\lambda\\) is close to 0, then it has no effect because Ridge Regression is similar to plain linear regression. As \\(\lambda\\) gets larger the shrinking effect on the weights gets stronger and the weights approach zero. To fit Ridge Regression use structs from the [`ridge_regression`]({{site.api_base_url}}/linear/ridge_regression/index.html) module: @@ -252,7 +252,7 @@ println!( #### LASSO -LASSO stands for Least Absolute Shrinkage and Selection Operator. It is analogous to Ridge Regression but uses L1 regularization term instead of L2 regularization term: +LASSO stands for Least Absolute Shrinkage and Selection Operator. It is analogous to Ridge Regression but uses an L1 regularization term instead of an L2 regularization term: \\[\lambda \sum_{i=i}^n \mid \beta_i \mid \\] @@ -297,7 +297,7 @@ Elastic net linear regression uses the penalties from both the lasso and ridge t where \\(\lambda_1 = \\alpha l_{1r}\\), \\(\lambda_2 = \\alpha (1 - l_{1r})\\) and \\(l_{1r}\\) is the l1 ratio, elastic net mixing parameter. -elastic net combines both the L1 and L2 penalties during training, which can result in better performance than a model with either one or the other penalty on some problems. +Elastic net combines both the L1 and L2 penalties during training, which can result in better performance than a model with either one or the other penalty on some problems. ```rust use smartcore::dataset::*; @@ -337,7 +337,7 @@ println!( ### Logistic Regression -Logistic regression uses linear model to represent relashionship between dependent and explanatory variables. Unlike linear regression, output in logistic regression is modeled as a binary value (0 or 1) rather than a numeric value. to squish output between 0 and 1 [Sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) is used. +Logistic regression uses a linear model to represent a relationship between dependent and explanatory variables. Unlike linear regression, output in logistic regression is modeled as a binary value (0 or 1) rather than a numeric value. to squish output between 0 and 1 [Sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) is used. In *SmartCore* Logistic Regression is represented by [`LogisticRegression`]({{site.api_base_url}}/linear/logistic_regression/index.html) struct that has methods `fit` and `predict`. @@ -369,11 +369,11 @@ let y_hat_lr = LogisticRegression::fit(&x_train, &y_train, Default::default()) println!("AUC: {}", roc_auc_score(&y_test, &y_hat_lr)); ``` -*SmartCore* uses [Limited-memory BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) routine to find optimal combination of \\(\beta_i\\) parameters. +*SmartCore* uses [Limited-memory BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) routine to find an optimal combination of \\(\beta_i\\) parameters. ## Support Vector Machines -Support Vector Machines (SVM) is perhaps one of the most popular machine learning algorithms. SVMs have been shown to perform well in a variety of settings, and is often considered one of the best "out of the box" classifiers. The support vector machines is a generalization of a simple and intuitive classifier called the maximal margin classifier. +Support Vector Machines (SVM) are perhaps one of the most popular machine learning algorithms. SVMs have been shown to perform well in a variety of settings, and are often considered one of the best "out of the box" classifiers. The support vector machines is a generalization of a simple and intuitive classifier called the maximal margin classifier. The maximal margin classifier is a hypothetical classifier that best explains how SVM works in practice. This classifier is based on the idea of a hyperplane, a flat affine subspace of dimension \\(p-1\\) that divides p-dimensional space into two halves. A hyperplane is defined by the equation @@ -431,7 +431,7 @@ Pre-defined kernel functions: ### Support Vector Classifier To fit a support vector classifier to your dataset use [`SVC`]({{site.api_base_url}}/svm/svc/index.html). -*SmartCore* uses an [approximate SVM solver](https://leon.bottou.org/projects/lasvm) to solve SMV optimization problem. +*SmartCore* uses an [approximate SVM solver](https://leon.bottou.org/projects/lasvm) to solve the SVM optimization problem. The solver reaches accuracies similar to that of a real SVM after performing two passes through the training examples. You can choose the number of passes through the data that the algorithm takes by changing the `epoch` parameter of the classifier. @@ -507,7 +507,7 @@ println!( Naive Bayes (NB) is a probabilistic machine learning algorithm based on the Bayes Theorem that assumes conditional independence between features given the value of the class variable. -Bayes Theorem states following relashionship between class label and data: +Bayes Theorem states the following relationship between class label and data: \\[ P(y \mid X) = \frac{P(y)P(X \mid y)}{P(X)} \\] @@ -573,7 +573,7 @@ println!("accuracy: {}", accuracy(&y, &y_hat)); // Prints 0.96 Classification and Regression Trees (CART) and its modern variant Random Forest are among the most powerful algorithms available in machine learning. -CART models relationship between predictor and explanatory variables as a binary tree. Each node of the tree represents a decision that is made based on an outcome of a single attribute. +CART models relationships between predictor and explanatory variables as a binary tree. Each node of the tree represents a decision that is made based on an outcome of a single attribute. The leaf nodes of the tree represent an outcome. To make a prediction we take the mean of the training observations belonging to the leaf node for regression and the mode of observations for classification. Given a dataset with just three explanatory variables and a qualitative dependent variable the tree might look like an example below. @@ -583,7 +583,7 @@ Given a dataset with just three explanatory variables and a qualitative dependen
Figure 4. An example of Decision Tree where target is a class.
-CART model is simple and useful for interpretation. However, they typically are not competitive with the best supervised learning approaches, like Logistic and Linear Regression, especially when the response can be well approximated by a linear model. Tree-based method is also non-robust which means that a small change in the data can cause a large change in the final estimated tree. That's why it is a common practice to combine prediction from multiple trees in ensemble to estimate predicted values. +CART model is simple and useful for interpretation. However, they typically are not competitive with the best supervised learning approaches, like Logistic and Linear Regression, especially when the response can be well approximated by a linear model. Tree-based methods are also non-robust which means that a small change in the data can cause a large change in the final estimated tree. That's why it is a common practice to combine predictions from multiple trees in ensemble to estimate predicted values. In *SmartCore* both, decision and regression trees can be found in the [`tree`]({{site.api_base_url}}/tree/index.html) module. Use [`DecisionTreeClassifier`]({{site.api_base_url}}/tree/decision_tree_classifier/index.html) to fit decision tree and [`DecisionTreeRegressor`]({{site.api_base_url}}/tree/decision_tree_regressor/index.html) for regression. @@ -621,11 +621,11 @@ Here we have used default parameter values but in practice you will almost alway ## Ensemble methods -In ensemble learning we combine predictions from multiple base models to reduce the variance of predictions and decrease generalization error. Base models are assumed to be independent from each other. [Bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating) is one of the most streightforward ways to reduce correlation between base models in the ensemble. It works by taking repeated samples from the same training data set. As a result we generate _K_ different training data sets (bootstraps) that overlap but are not the same. We then train our base model on the each bootstrapped training set and average predictions for regression or use majority voting scheme for classification. +In ensemble learning we combine predictions from multiple base models to reduce the variance of predictions and decrease generalization error. Base models are assumed to be independent from each other. [Bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating) is one of the most straightforward ways to reduce correlation between base models in the ensemble. It works by taking repeated samples from the same training data set. As a result we generate _K_ different training data sets (bootstraps) that overlap but are not the same. We then train our base model on each bootstrapped training set and average predictions for regression or use majority voting scheme for classification. ### Random Forest -Random forest is an extension of bagging that also randomly selects a subset of features when training a tree. This improvement decorrelated the trees and hence decreases prediction error even more. Random forests have proven effective on a wide range of different predictive modeling problems. +Random forest is an extension of bagging that also randomly selects a subset of features when training a tree. This improvement decorrelates the trees and hence decreases prediction error even more. Random forests have proven effective on a wide range of different predictive modeling problems. Let's fit [Random Forest regressor]({{site.api_base_url}}/ensemble/random_forest_regressor/index.html) to Boston Housing dataset: @@ -659,7 +659,7 @@ println!("MSE: {}", mean_squared_error(&y_test, &y_hat_rf)); You should get lower mean squared error here when compared to other methods from this manual. This is because by default Random Forest fits 100 independent trees to different bootstrapped training sets and calculates target value by averaging predictions from these trees. -[Random Forest classifier]({{site.api_base_url}}/ensemble/random_forest_classifier/index.html) works in a similar manner. The only difference is that you prediction targets should be nominal or ordinal values (class label). +[Random Forest classifier]({{site.api_base_url}}/ensemble/random_forest_classifier/index.html) works in a similar manner. The only difference is that your prediction targets should be nominal or ordinal values (class label). ## References * ["Nearest Neighbor Pattern Classification" Cover, T.M., IEEE Transactions on Information Theory (1967)](http://ssg.mit.edu/cal/abs/2000_spring/np_dens/classification/cover67.pdf) diff --git a/user_guide/unsupervised.md b/user_guide/unsupervised.md index 30b0ee1..f016a22 100644 --- a/user_guide/unsupervised.md +++ b/user_guide/unsupervised.md @@ -6,7 +6,7 @@ description: Unsupervised learning with Smartcore, including, but not limited to # Unsupervised Learning -In unsupervised learning we do not have labeled dataset. In other words, for every observation \\(i = 1,...,n\\), we observe a vector of measurements \\(x_i\\) but no associated response \\(y_i\\). The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the problem at hand. +In unsupervised learning we do not have a labeled dataset. In other words, for every observation \\(i = 1,...,n\\), we observe a vector of measurements \\(x_i\\) but no associated response \\(y_i\\). The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the problem at hand. In *SmartCore*, we use the same set of functions to fit unsupervised algorithms to your data as in supervised learning. The only difference is that the method `fit` does not need labels to learn from your data. Similar to supervised learning, optional parameters of method `fit` are hidden behind `Default::default()`. To make predictions use `predict` method that takes new data and predicts estimated class labels. @@ -20,7 +20,7 @@ Clustering can be a helpful tool in your toolbox to learn more about the problem There are many types of clustering algorithms but at this moment *SmartCore* supports only [K-means](https://en.wikipedia.org/wiki/K-means_clustering) and [DBSCAN](https://en.wikipedia.org/wiki/DBSCAN). -To fit K-means to your data use `fit` method from the [`KMeans`]({{site.api_base_url}}/cluster/kmeans/index.html) struct. Method `fit` takes a _NxM_ matrix with your data where _N_ is the number of samples and _M_ is the number of features. Another parameter of this function, _K_, is the number of clusters. If you don't know how many clusters are there in your data use [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)) to estimate it. +To fit K-means to your data use `fit` method from the [`KMeans`]({{site.api_base_url}}/cluster/kmeans/index.html) struct. Method `fit` takes a _NxM_ matrix with your data where _N_ is the number of samples and _M_ is the number of features. Another parameter of this function, _K_, is the number of clusters. If you don't know how many clusters are in your data use [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)) to estimate it. ```rust // Load datasets API @@ -53,7 +53,7 @@ println!("V Measure: {}", v_measure_score(&true_labels, &labels)); By default, `KMeans` terminates when it reaches 100 iterations without converging to a stable set of clusters. Pass an instance of [`KMeansParameters`]({{site.api_base_url}}/cluster/kmeans/struct.KMeansParameters.html) instead of `Default::default()` into method `fit` if you want to change value of this parameter. -DBSCAN implementation can be found in the [dbscan]({{site.api_base_url}}/cluster/dbscan/index.html) module. To fit DBSCAN to your dataset: +The DBSCAN implementation can be found in the [dbscan]({{site.api_base_url}}/cluster/dbscan/index.html) module. To fit DBSCAN to your dataset: ```rust // Load datasets API @@ -95,7 +95,7 @@ utils::scatterplot( .unwrap(); ``` -DBSCAN is good for data which contains clusters of similar density. If you visualize results using scatter plot you will see that each concentrical circle is assigned to a separate cluster. +DBSCAN is good for data which contains clusters of similar density. If you visualize results using a scatter plot you will see that each concentric circle is assigned to a separate cluster.
DBSCAN @@ -104,7 +104,7 @@ DBSCAN is good for data which contains clusters of similar density. If you visua ## Dimensionality Reduction -Large number of correlated variables in the feature space can dramatically impact the performance of machine learning algorithms (see [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)). Therefore, it is often desirable to reduce the dimensionality of the feature space. +A large number of correlated variables in the feature space can dramatically impact the performance of machine learning algorithms (see [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)). Therefore, it is often desirable to reduce the dimensionality of the feature space. Principal component analysis (PCA) is a popular approach to dimensionality reduction from the field of linear algebra. PCA is often called "feature projection" and the algorithms used are referred to as "projection methods". @@ -112,7 +112,7 @@ PCA is an unsupervised approach, since it involves only a set of \\(n\\) feature In PCA, the set of features \\(x_i\\) is re-expressed in terms of a set of an equal number of principal component variables. Whereas the features might be intercorrelated, the principal component variables are not. Each of the principal components found by PCA is a linear combination of the \\(n\\) features. The first principal component has the largest variance, the second component has the second largest variance, and so on. -In *SmartCore*, PCA is declared in [`pca`]({{site.api_base_url}}/decomposition/pca/index.html) module. Here is how you can calculate first two principal components for the [Digits]({{site.api_base_url}}/dataset/digits/index.html) dataset: +In *SmartCore*, PCA is declared in [`pca`]({{site.api_base_url}}/decomposition/pca/index.html) module. Here is how you can calculate the first two principal components for the [Digits]({{site.api_base_url}}/dataset/digits/index.html) dataset: ```rust use smartcore::dataset::*; @@ -136,7 +136,7 @@ let pca = PCA::fit(&x, PCAParameters::default().with_n_components(2)).unwrap(); let x_transformed = pca.transform(&x).unwrap(); ``` -Once you've reduced the set of input features to first two principal components you can visualize your data using scatter plot, similar to Figure 2. +Once you've reduced the set of input features to the first two principal components you can visualize your data using a scatter plot, similar to Figure 2.
PCA @@ -170,7 +170,7 @@ let x_transformed = svd.transform(&x).unwrap(); ## Matrix Factorization -Many complex matrix operations cannot be solved efficiently or with stability using the limited precision of computers. One way to solve this problem is to use matrix decompositions methods (or matrix factorization methods) that reduce a matrix into its constituent parts. +Many complex matrix operations cannot be solved efficiently or with stability using the limited precision of computers. One way to solve this problem is to use matrix decomposition methods (or matrix factorization methods) that reduce a matrix into its constituent parts. Matrix decomposition methods are at the foundation of basic operations such as solving systems of linear equations, calculating the inverse, and calculating the determinant of a matrix.