Skip to content

KoyoteScience/BayesianLinearRegressor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Sequential Bayesian Linear Regression

Example Usage:

    bayes_linear_regression = BayesLinearRegressor(n_features)
    
    for input_vector, output_value in zip(list_of_input_vectors, list_of_output_values):
      bayes_linear_regression.partial_fit(input_vector, output_value)
      
    prediction = bayes_linear_regression.predict(prediction_input_vector)

Introduction

Linear regressors are the workhorse of data science and machine learning. Nearly every problem can use a linear regression as a baseline, and they can be used in tandem with other algorithms, such as deep learning neural networks and gradient-boosted trees, to add prediction uncertainties and to provide an ideal balance of computational efficiency and interpretability. This is in large part due to the fact that linear regressions are shallow neural networks -- literally, neural nets with only one layer. This limitation means that their solutions are analytically solvable, making linear regressions ridiculously efficient and easy to work with, and offering us access to exact prediction uncertainties without needing to resort to expensive Bayesian inferencing techniques. Either as a standard candle to confirm other Bayesian engines such as bootstrapping for advanced models, or as the engine and model itself, these properties make the Bayesian linear regressor a critical element in applications for sequential decision processes like games, recommendation systems, and interface optimization.

However, our efforts to find a pre-made reliable package in Python to implement a Bayesian Linear Regressor came up short. Scikit-learn is the most trustworthy and tested, but doesn't offer an estimate of the parameter covariance matrix, meaning that we lose prediction uncertainties which are critical in the bandit setting. Statsmodels offers these uncertainty estimates, but is far slower. Both packages store all the training data in memory. Other repositories on Github that copy the key equations from Christopher Bishop's "Pattern Recognition and Machine Learning" require that the noise parameter (the sum of squared residuals) be added as a tunable hyper parameter because that's what he did, and it breaks the ability to generalize those packages to new data sets.

Description of Repository

For these reasons, we provide a minimal (<200 lines of code) and complete Python implementation of a Bayesian Linear Regression / Bayesian Linear Regressor, made to mimic the relevant methods in statsmodels and scikit-learn. No parameters need to be set, although the ridge regularization constant is available to change beyond the default value. All internal parameters are learned in an online/streaming/incremental/sequential fashion, including the noise parameter, i.e., the sum of squared residuals divided by the number of degrees of freedom.

This reposity covers two files: bayesian_linear_regressor.py and test_suite.py. They work with the BanditoAPI (https://github.com/KoyoteScience/BanditoAPI) to test different approaches to performing a linear regression with correct uncertainties:

  • Using linear algebra identities (doesn't support online/streaming/incremental/sequential)
  • Using the statsmodels library (as a ground truth, and doesn't support online/streaming/incremental/sequential)
  • Using the Bandito API (supports online/streaming/incremental/sequential)
  • Using the included BayesianLinearRegressor module (supports online/streaming/incremental/sequential)

At Koyote Science, LLC, we used this code as the basis for larger and more stringent integration tests we use to ensure the accuracy and efficiency of Bandito.

Update Rules for Bayesian Linear Regression

Given a set of observations with values and , we concatenate these arrays to yield the moment matrix . Thus after observations, we set as the sum of squared residuals, as the model coefficient means, as the parameter covariance matrix, and as the ridge regularization constant (also known as the Tikhonov regularization constant, and we generally set it to 1e-6 expecting order unity in our input and output value distributions). Note that we do not regularize the intercept term so that . The update rules are then as follows:

  • which is equivalent to:

Note that we only need to update the moment matrix and the sum of squared residuals. The rest of the update rules follow downstream. We can either predict using the model parameter means , or we can sample from the distibutions of our model parameters using a multivariate normal with those means and covariance matrix . Here, is the number of degrees of freedom, or the number of updates minus the number of features .

The update rules for the sum of squared residuals arises from linear algebra identities and the definition . For more information, see this derivation. Note that this is the most numerically unstable calculation among the update rules, and often drifts off after the fourth decimal place (assuming all other distributions are on order unity variance). For small residuals, this value can even become negative. However, in this case, we can take the sum of squared residuals to be effectively zero, so we take the absolute value when sampling from the multivariate gaussian to obtain our regression parameters in a probabilistic (bandit) setting. Note that these updates should only be done once we have as many training samples as we have parameters due to numerical instability which is exaggerrated when the system is underdefined.

Using these update rules, we can train BayesianLinearRegressor in a streaming/online/incremental/sequential fashion, meaning that at no point do we need to store more than one row of training data in memory, and we achieve the same results as using linear algebra identities or tested code like the scikit-learn and statsmodels modules, and with the full covariance matrix of the parameters (which only statsmodels offers, at a substantial computational cost). These properties make the BayesianLinearRegressor ideal for the bandit setting. Storage and computational requirements for BayesianLinearRegressor are and respectively, which is ideal for most cases where is reasonable. Note that standard linear regressors that keep all the training data in memory enjoy lower computational complexity which is why they don't use sequential methods under the hood.

"Unlearning" Data

The update rules only have to be slightly tweaked to allow us to "unlearn" data, making it possible to create a trailing-window linear regression, by keeping a log of the trailing input and output data and incrementally unlearning and removing the tail end of the log after a desired expiration period. The changed update rules for unlearning are:

  • which is equivalent to:

Weighted Linear Regression / Weighted Least Squares

In the bandit setting, it will also be necessary to implement weights for our training data. This occurs because in the sequential decision process, each action is selected according the outputs of a model, and therefore each action has a different propensity (or likelihood) of being selected. We need to weight the training data by the inverse of this propensity to get an accurate representation of the true data distribution when it isn't being meddled with by the bandit algorithm.

When including data weights to our training procedure, all we have to do is adjust the moment matrix update rules. We use the weight matrix which is a diagonal matrix of size with the weight for each training on the corresponding diagonal entry. This gives us the update rules:

  • which is equivalent to:

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages