diff --git a/.gitignore b/.gitignore index c40841695..0617b67d1 100644 --- a/.gitignore +++ b/.gitignore @@ -25,3 +25,4 @@ site venv requirements-dev.lock requirements.lock +.eggs/ diff --git a/docs/examples.md b/docs/examples.md index 754875e7c..58f11f35a 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -523,7 +523,55 @@ Note that this expression has a large dynamic range so may be difficult to find. Note that you can also search for exclusively dimensionless constants by settings `dimensionless_constants_only` to `true`. -## 11. Additional features +## 11. Sequences + +Note that most of the functionality +of PySRSequenceRegressor is inherited +from [PySRRegressor](options.md). + +### 1. Simple Search + +Here's a simple example where we +find the expression $f(n) = f(n-1) + f(n-2)$. + +```python +X = np.array([1, 1]) +for i in range(20): + X = np.append(X, X[-1] + X[-2]) +X.reshape(-1, 1) # lots of samples with one data point, not the other way +model = PySRSequenceRegressor( + recursive_history_length=2, + binary_operators=["+", "-", "*", "/"] +) +model.fit(X) # no y needed +print(model) +``` + +### 2. Multidimensionality + +Here we find a 2D recurrence relation +with two data points at a time: +$f_0(n) = f_0(n-1) + f_1(n-2)$ +$f_1(n) = f_1(n-1) + f_0(n-2)$ + +```python +X = np.array([[1, 2], [3, 4]]) +for i in range(100): + X = np.append(X, [ + X[-1][0] + X[-2][1], + X[-1][1] - X[-2][0] + ]) + +model = PySRSequenceRegressor( + recursive_history_length=2, + binary_operators=["+", "*"], +) + +model.fit(X) +print(model) +``` + +## 12. Additional features For the many other features available in PySR, please read the [Options section](options.md). diff --git a/environment.yml b/environment.yml index c7d6ceebf..30925f02c 100644 --- a/environment.yml +++ b/environment.yml @@ -5,7 +5,7 @@ dependencies: - python>=3.8 - sympy>=1.0.0,<2.0.0 - pandas>=0.21.0,<3.0.0 - - numpy>=1.13.0,<2.0.0 + - numpy>=1.20.0,<2.0.0 - scikit-learn>=1.0.0,<2.0.0 - pyjuliacall>=0.9.21,<0.9.22 - click>=7.0.0,<9.0.0 diff --git a/pysr/__init__.py b/pysr/__init__.py index b40ee840e..af507de24 100644 --- a/pysr/__init__.py +++ b/pysr/__init__.py @@ -8,6 +8,7 @@ from .export_jax import sympy2jax from .export_torch import sympy2torch from .julia_extensions import load_all_packages +from .regressor_sequence import PySRSequenceRegressor from .sr import PySRRegressor # This file is created by setuptools_scm during the build process: @@ -22,6 +23,7 @@ "install", "load_all_packages", "PySRRegressor", + "PySRSequenceRegressor", "best", "best_callable", "best_row", diff --git a/pysr/export_latex.py b/pysr/export_latex.py index b7815d07c..6f0f00676 100644 --- a/pysr/export_latex.py +++ b/pysr/export_latex.py @@ -76,6 +76,9 @@ def sympy2latextable( if indices is None: indices = list(equations.index) + if output_variable_name == None: + output_variable_name = "y" + for i in indices: latex_equation = sympy2latex( equations.iloc[i]["sympy_format"], diff --git a/pysr/regressor_sequence.py b/pysr/regressor_sequence.py new file mode 100644 index 000000000..18c9364c3 --- /dev/null +++ b/pysr/regressor_sequence.py @@ -0,0 +1,298 @@ +from typing import List, Optional, Tuple, Union + +import numpy as np +from sklearn.base import BaseEstimator + +from .sr import PySRRegressor +from .utils import ArrayLike, _subscriptify + + +def _check_assertions( + X, + recursive_history_length=None, + weights=None, + variable_names=None, + X_units=None, +): + if recursive_history_length is not None and recursive_history_length <= 0: + raise ValueError( + "The `recursive_history_length` parameter must be greater than 0 (otherwise it's not recursion)." + ) + if len(X.shape) > 2: + raise ValueError( + "Recursive symbolic regression only supports up to 2D data; please flatten your data first" + ) + if len(X) <= recursive_history_length + 1: + raise ValueError( + f"Recursive symbolic regression with a history length of {recursive_history_length} requires at least {recursive_history_length + 2} datapoints." + ) + if isinstance(weights, np.ndarray) and len(weights) != len(X): + raise ValueError("The length of `weights` must have shape (n_times,).") + if isinstance(variable_names, list) and len(variable_names) != X.shape[1]: + raise ValueError( + "The length of `variable_names` must be equal to the number of features in `X`." + ) + if isinstance(X_units, list) and len(X_units) != X.shape[1]: + raise ValueError( + "The length of `X_units` must be equal to the number of features in `X`." + ) + + +class PySRSequenceRegressor(BaseEstimator): + """ + High performance symbolic regression for recurrent sequences. + Based off of the `PySRRegressor` class, but with a preprocessing step for recurrence relations. + + Parameters + ---------- + recursive_history_length : int + The number of previous time points to use as input features. + For example, if `recursive_history_length=2`, then the input features + will be `[X[0], X[1]]` and the output will be `X[2]`. + This continues on for all X: [X[n-1], X[n-2]] to predict X[n]. + Must be greater than 0. + Other parameters and attributes are inherited from `PySRRegressor`. + """ + + def __init__( + self, + *, + recursive_history_length: int = 0, + **kwargs, + ): + super().__init__() + self._regressor = PySRRegressor(**kwargs) + self.recursive_history_length = recursive_history_length + + def _construct_variable_names( + self, n_features: int, variable_names: Optional[List[str]] + ) -> Tuple[List[str], List[str]]: + if not isinstance(variable_names, list): + if n_features == 1: + variable_names = ["x"] + display_variable_names = ["x"] + else: + variable_names = [f"x{i}" for i in range(n_features)] + display_variable_names = [ + f"x{_subscriptify(i)}" for i in range(n_features) + ] + else: + display_variable_names = variable_names + + # e.g., `x0_tm1` + variable_names_with_time = [ + f"{var}_tm{j}" + for j in range(self.recursive_history_length, 0, -1) + for var in variable_names + ] + # e.g., `x₀[t-1]` + display_variable_names_with_time = [ + f"{var}[t-{j}]" + for j in range(self.recursive_history_length, 0, -1) + for var in display_variable_names + ] + + return variable_names_with_time, display_variable_names_with_time + + def fit( + self, + X, + *, + weights=None, + variable_names: Optional[List[str]] = None, + complexity_of_variables: Optional[ + Union[int, float, List[Union[int, float]]] + ] = None, + X_units: Optional[ArrayLike[str]] = None, + ) -> "PySRSequenceRegressor": + """ + Search for equations to fit the sequence and store them in `self.equations_`. + + Parameters + ---------- + X : ndarray | pandas.DataFrame + Sequence of shape (n_times, n_features) or (n_times,) + weights : ndarray | pandas.DataFrame + Weight array of the same shape as `X`. + Each element is how to weight the mean-square-error loss + for that particular element of `X`. Alternatively, + if a custom `loss` was set, it can be used + in custom ways. + variable_names : list[str] + A list of names for the variables, rather than "x0t_1", "x1t_2", etc. + If `X` is a pandas dataframe, the column name will be used + instead of `variable_names`. Cannot contain spaces or special + characters. Avoid variable names which are also + function names in `sympy`, such as "N". + The number of variable names must be equal to (n_features,). + complexity_of_variables : int | float | list[int] | list[float] + The complexity of each variable in `X`. If a single value is + passed, it will be used for all variables. If a list is passed, + its length must be the same as `recurrence_history_length`. + X_units : list[str] + A list of units for each variable in `X`. Each unit should be + a string representing a Julia expression. See DynamicQuantities.jl + https://symbolicml.org/DynamicQuantities.jl/dev/units/ for more + information. + Length should be equal to n_features. + + Returns + ------- + self : object + Fitted estimator. + """ + X = self._validate_data(X, ensure_2d=False) + if X.ndim == 1: + X = X.reshape(-1, 1) + assert X.ndim == 2 + _check_assertions( + X, + self.recursive_history_length, + weights, + variable_names, + X_units, + ) + self.variable_names = variable_names # for latex_table() + self.n_features = X.shape[1] # for latex_table() + + current_X = X[self.recursive_history_length :] + historical_X = self._sliding_window(X)[: -1 : current_X.shape[1], :] + y_units = X_units + if isinstance(weights, np.ndarray): + weights = weights[self.recursive_history_length :] + variable_names, display_variable_names = self._construct_variable_names( + current_X.shape[1], variable_names + ) + + self._regressor.fit( + X=historical_X, + y=current_X, + weights=weights, + variable_names=variable_names, + display_variable_names=display_variable_names, + X_units=X_units, + y_units=y_units, + complexity_of_variables=complexity_of_variables, + ) + return self + + def predict(self, X, index=None, num_predictions=1): + """ + Predict future data from input X using the equation chosen by `model_selection`. + + You may see what equation is used by printing this object. X should + have the same columns as the training data. + + Parameters + ---------- + X : ndarray | pandas.DataFrame + Data of shape `(n_times, n_features)`. + index : int | list[int] + If you want to compute the output of an expression using a + particular row of `self.equations_`, you may specify the index here. + For multiple output equations, you must pass a list of indices + in the same order. + num_predictions : int + How many predictions to make. If `num_predictions` is less than + `(n_times - recursive_history_length + 1)`, + some input data at the end will be ignored. + Default is `1`. + + Returns + ------- + x_predicted : ndarray of shape (num_predictions, n_features) + Values predicted by substituting `X` into the fitted sequence symbolic + regression model and rolling it out for `num_predictions` steps. + + Raises + ------ + ValueError + Raises if the `best_equation` cannot be evaluated. + """ + X = self._validate_data(X, ensure_2d=False) + if X.ndim == 1: + X = X.reshape(-1, 1) + assert X.ndim == 2 + _check_assertions(X, recursive_history_length=self.recursive_history_length) + historical_X = self._sliding_window(X)[:: X.shape[1], :] + if num_predictions < 1: + raise ValueError("num_predictions must be greater than 0.") + if num_predictions < len(historical_X): + historical_X = historical_X[:num_predictions] + return self._regressor.predict(X=historical_X, index=index) + else: + extra_predictions = num_predictions - len(historical_X) + pred = self._regressor.predict(X=historical_X, index=index) + for _ in range(extra_predictions): + pred_data = [pred[-self.recursive_history_length :].flatten()] + pred = np.concatenate( + [pred, self._regressor.predict(X=pred_data, index=index)], axis=0 + ) + return pred + + def _sliding_window(self, X): + return np.lib.stride_tricks.sliding_window_view( + X.flatten(), self.recursive_history_length * np.prod(X.shape[1]) + ) + + @classmethod + def from_file( + cls, + *args, + recursive_history_length: int, + **kwargs, + ): + assert recursive_history_length is not None and recursive_history_length > 0 + + model = cls(recursive_history_length=recursive_history_length) + model._regressor = PySRRegressor.from_file(*args, **kwargs) + return model + + def __repr__(self): + return self._regressor.__repr__().replace( + "PySRRegressor", "PySRSequenceRegressor", 1 + ) + + def get_best(self, *args, **kwargs): + return self._regressor.get_best(*args, **kwargs) + + def refresh(self, *args, **kwargs): + return self._regressor.refresh(*args, **kwargs) + + def sympy(self, *args, **kwargs): + return self._regressor.sympy(*args, **kwargs) + + def latex(self, *args, **kwargs): + return self._regressor.latex(*args, **kwargs) + + def get_hof(self): + return self._regressor.get_hof() + + def latex_table( + self, + *args, + **kwargs, + ): + """ + Generates LaTeX variable names, then creates a LaTeX table of the best equation(s). + Refer to `PySRRegressor.latex_table` for information. + """ + if self.variable_names is not None: + if len(self.variable_names) == 1: + variable_names = self.variable_names[0] + "_{tm}" + else: + variable_names = [ + variable_name + "_{tm}" for variable_name in self.variable_names + ] + else: + if self.n_features == 1: + variable_names = "x_{tm}" + else: + variable_names = [f"x_{{{i} tm}}" for i in range(self.n_features)] + return self._regressor.latex_table( + *args, **kwargs, output_variable_names=variable_names + ) + + @property + def equations_(self): + return self._regressor.equations_ diff --git a/pysr/sr.py b/pysr/sr.py index 0054ce502..791955bae 100644 --- a/pysr/sr.py +++ b/pysr/sr.py @@ -138,6 +138,7 @@ def _check_assertions( X, use_custom_variable_names, variable_names, + display_variable_names, complexity_of_variables, weights, y, @@ -153,6 +154,7 @@ def _check_assertions( assert X.shape[0] == weights.shape[0] if use_custom_variable_names: assert len(variable_names) == X.shape[1] + assert len(display_variable_names) == X.shape[1] # Check none of the variable names are function names: for var_name in variable_names: # Check if alphanumeric only: @@ -1361,6 +1363,7 @@ def _validate_and_set_fit_params( Xresampled, weights, variable_names, + display_variable_names, complexity_of_variables, X_units, y_units, @@ -1370,6 +1373,7 @@ def _validate_and_set_fit_params( Optional[ndarray], Optional[ndarray], ArrayLike[str], + Optional[ArrayLike[str]], Union[int, float, List[Union[int, float]]], Optional[ArrayLike[str]], Optional[Union[str, ArrayLike[str]]], @@ -1395,6 +1399,8 @@ def _validate_and_set_fit_params( for that particular element of y. variable_names : ndarray of length n_features Names of each feature in the training dataset, `X`. + display_variable_names : ndarray of length n_features + Custom variable names to display in the progress bar output. complexity_of_variables : int | float | list[int | float] Complexity of each feature in the training dataset, `X`. X_units : list[str] of length n_features @@ -1412,12 +1418,21 @@ def _validate_and_set_fit_params( Validated resampled training data used for denoising. variable_names_validated : list[str] of length n_features Validated list of variable names for each feature in `X`. + display_variable_names_validated : list[str] of length n_features + Validated list of variable names to display in the progress bar output. X_units : list[str] of length n_features Validated units for `X`. y_units : str | list[str] of length n_out Validated units for `y`. """ + if display_variable_names is not None: + assert ( + variable_names is not None + ), "`variable_names` must be provided if `display_variable_names` is provided." + assert len(display_variable_names) == len( + variable_names + ), "`display_variable_names` must be the same length as `variable_names`." if isinstance(X, pd.DataFrame): if variable_names: variable_names = None @@ -1478,9 +1493,14 @@ def _validate_and_set_fit_params( [f"x{_subscriptify(i)}" for i in range(X.shape[1])] ) variable_names = self.feature_names_in_ + display_variable_names = self.display_feature_names_in_ else: - self.display_feature_names_in_ = self.feature_names_in_ + if display_variable_names is None: + self.display_feature_names_in_ = self.feature_names_in_ + else: + self.display_feature_names_in_ = display_variable_names variable_names = self.feature_names_in_ + display_variable_names = self.display_feature_names_in_ # Handle multioutput data if len(y.shape) == 1 or (len(y.shape) == 2 and y.shape[1] == 1): @@ -1500,6 +1520,7 @@ def _validate_and_set_fit_params( Xresampled, weights, variable_names, + display_variable_names, complexity_of_variables, X_units, y_units, @@ -1519,6 +1540,7 @@ def _pre_transform_training_data( y: ndarray, Xresampled: Union[ndarray, None], variable_names: ArrayLike[str], + display_variable_names: ArrayLike[str], complexity_of_variables: Union[int, float, List[Union[int, float]]], X_units: Union[ArrayLike[str], None], y_units: Union[ArrayLike[str], str, None], @@ -1542,6 +1564,9 @@ def _pre_transform_training_data( variable_names : list[str] Names of each variable in the training dataset, `X`. Of length `n_features`. + display_variable_names : list[str] + Custom variable names to display in the progress bar output. + Of length `n_features`. complexity_of_variables : int | float | list[int | float] Complexity of each variable in the training dataset, `X`. X_units : list[str] @@ -1569,6 +1594,8 @@ def _pre_transform_training_data( variable_names_transformed : list[str] of length n_features Names of each variable in the transformed dataset, `X_transformed`. + display_variable_names_transformed : list[str] of length n_features + Custom variable names to display in the progress bar output. X_units_transformed : list[str] of length n_features Units of each variable in the transformed dataset. y_units_transformed : str | list[str] of length n_out @@ -1593,6 +1620,14 @@ def _pre_transform_training_data( if selection_mask[i] ], ) + display_variable_names = cast( + ArrayLike[str], + [ + display_variable_names[i] + for i in range(len(display_variable_names)) + if selection_mask[i] + ], + ) if isinstance(complexity_of_variables, list): complexity_of_variables = [ @@ -1614,7 +1649,7 @@ def _pre_transform_training_data( # Update feature names with selected variable names self.selection_mask_ = selection_mask self.feature_names_in_ = _check_feature_names_in(self, variable_names) - self.display_feature_names_in_ = self.feature_names_in_ + self.display_feature_names_in_ = display_variable_names print(f"Using features {self.feature_names_in_}") # Denoising transformation @@ -1626,7 +1661,15 @@ def _pre_transform_training_data( else: X, y = denoise(X, y, Xresampled=Xresampled, random_state=random_state) - return X, y, variable_names, complexity_of_variables, X_units, y_units + return ( + X, + y, + variable_names, + display_variable_names, + complexity_of_variables, + X_units, + y_units, + ) def _run( self, @@ -1934,6 +1977,7 @@ def fit( Xresampled=None, weights=None, variable_names: Optional[ArrayLike[str]] = None, + display_variable_names: Optional[ArrayLike[str]] = None, complexity_of_variables: Optional[ Union[int, float, List[Union[int, float]]] ] = None, @@ -1966,6 +2010,11 @@ def fit( instead of `variable_names`. Cannot contain spaces or special characters. Avoid variable names which are also function names in `sympy`, such as "N". + display_variable_names : list[str] + Custom variable names to display in the progress bar output, if + different from `variable_names`. For example, if you want to print + specific unicode characters which are not allowed in `variable_names`, + you can use `display_variable_names` to specify the names. X_units : list[str] A list of units for each variable in `X`. Each unit should be a string representing a Julia expression. See DynamicQuantities.jl @@ -2011,6 +2060,7 @@ def fit( Xresampled, weights, variable_names, + display_variable_names, complexity_of_variables, X_units, y_units, @@ -2020,6 +2070,7 @@ def fit( Xresampled, weights, variable_names, + display_variable_names, complexity_of_variables, X_units, y_units, @@ -2040,17 +2091,24 @@ def fit( seed = cast(int, random_state.randint(0, 2**31 - 1)) # For julia random # Pre transformations (feature selection and denoising) - X, y, variable_names, complexity_of_variables, X_units, y_units = ( - self._pre_transform_training_data( - X, - y, - Xresampled, - variable_names, - complexity_of_variables, - X_units, - y_units, - random_state, - ) + ( + X, + y, + variable_names, + display_variable_names, + complexity_of_variables, + X_units, + y_units, + ) = self._pre_transform_training_data( + X, + y, + Xresampled, + variable_names, + cast(ArrayLike[str], display_variable_names), + complexity_of_variables, + X_units, + y_units, + random_state, ) # Warn about large feature counts (still warn if feature count is large @@ -2071,6 +2129,7 @@ def fit( X, use_custom_variable_names, variable_names, + display_variable_names, complexity_of_variables, weights, y, @@ -2493,6 +2552,7 @@ def latex_table( indices=None, precision=3, columns=["equation", "complexity", "loss", "score"], + output_variable_names=None, ): """Create a LaTeX/booktabs table for all, or some, of the equations. @@ -2525,7 +2585,11 @@ def latex_table( assert len(indices) == self.nout_ table_string = sympy2multilatextable( - self.equations_, indices=indices, precision=precision, columns=columns + self.equations_, + indices=indices, + precision=precision, + columns=columns, + output_variable_names=output_variable_names, ) elif isinstance(self.equations_, pd.DataFrame): if indices is not None: @@ -2533,7 +2597,11 @@ def latex_table( assert isinstance(indices[0], int) table_string = sympy2latextable( - self.equations_, indices=indices, precision=precision, columns=columns + self.equations_, + indices=indices, + precision=precision, + columns=columns, + output_variable_name=output_variable_names, ) else: raise ValueError( diff --git a/pysr/test/test.py b/pysr/test/test.py index c641e9f66..290a932f5 100644 --- a/pysr/test/test.py +++ b/pysr/test/test.py @@ -12,7 +12,7 @@ import sympy # type: ignore from sklearn.utils.estimator_checks import check_estimator -from pysr import PySRRegressor, install, jl, load_all_packages +from pysr import PySRRegressor, PySRSequenceRegressor, install, jl, load_all_packages from pysr.export_latex import sympy2latex from pysr.feature_selection import _handle_feature_selection, run_feature_selection from pysr.julia_helpers import init_julia @@ -513,6 +513,344 @@ def test_jl_function_error(self): ) +class TestSequenceRegressor(unittest.TestCase): + def setUp(self): + # Using inspect, + # get default niterations from PySRRegressor, and double them: + self.default_test_kwargs = dict( + progress=False, + model_selection="accuracy", + niterations=DEFAULT_NITERATIONS * 2, + populations=DEFAULT_POPULATIONS * 2, + temp_equation_file=True, + recursive_history_length=3, + ) + + def test_sequence(self): + # simple tribbonaci sequence + X = [1, 1, 1] + for i in range(3, 30): + X.append(X[i - 1] + X[i - 2] + X[i - 3]) + X = np.asarray(X).reshape(-1, 1) + model = PySRSequenceRegressor( + **self.default_test_kwargs, + binary_operators=["+"], + early_stop_condition="stop_if(loss, complexity) = loss < 1e-4 && complexity == 1", + ) + model.fit(X) + print(model.equations_) + self.assertLessEqual(model.get_best()["loss"], 1e-4) + self.assertIn("x_{tm}", model.latex_table()) + + def test_sequence_named(self): + X = [1, 1, 1] + for i in range(3, 30): + X.append(X[i - 1] + X[i - 2] + X[i - 3]) + X = np.asarray(X).reshape(-1, 1) + model = PySRSequenceRegressor( + **self.default_test_kwargs, + early_stop_condition="stop_if(loss, complexity) = loss < 1e-4 && complexity == 1", + ) + model.fit(X, variable_names=["c1"]) + self.assertIn("c1_tm1", model.equations_.iloc[-1]["equation"]) + self.assertIn("c1_{tm}", model.latex_table()) + + def test_sequence_custom_variable_complexity(self): + for outer in (True, False): + for case in (1, 2): + X = [1, 1] + for i in range(2, 30): + X.append(X[i - 1] + X[i - 2]) + X = np.asarray(X).reshape(-1, 1) + if case == 1: + kwargs = dict(complexity_of_variables=[2, 3, 2]) + elif case == 2: + kwargs = dict(complexity_of_variables=2) + + if outer: + outer_kwargs = kwargs + inner_kwargs = dict() + else: + outer_kwargs = dict() + inner_kwargs = kwargs + + model = PySRSequenceRegressor( + binary_operators=["+"], + verbosity=0, + **self.default_test_kwargs, + early_stop_condition=( + f"stop_if_{case}(l, c) = l < 1e-8 && c <= {3 if case == 1 else 2}" + ), + **outer_kwargs, + ) + model.fit(X, **inner_kwargs) + self.assertLessEqual(model.get_best()["loss"], 1e-8) + + def test_sequence_error_message_custom_variable_complexity(self): + X = [1, 1] + for i in range(2, 100): + X.append(X[i - 1] + X[i - 2]) + X = np.asarray(X).reshape(-1, 1) + model = PySRSequenceRegressor(recursive_history_length=3) + with self.assertRaises(ValueError) as cm: + model.fit(X, complexity_of_variables=[1]) + + self.assertIn( + "number of elements in `complexity_of_variables`", str(cm.exception) + ) + + def test_sequence_multidimensional_data_error(self): + X = np.zeros((10, 1, 1)) + model = PySRSequenceRegressor( + **self.default_test_kwargs, + ) + with self.assertRaises(ValueError) as cm: + model.fit(X) + self.assertIn( + "Recursive symbolic regression only supports up to 2D data; please flatten your data first", + str(cm.exception), + ) + + def test_sequence_2D_data(self): + X = [[1, 2], [2, 3]] + for i in range(2, 10): + X.append( + [ + X[i - 1][1] + X[i - 2][0], + X[i - 1][0] - X[i - 2][1], + ] + ) + X = np.asarray(X) + model = PySRSequenceRegressor( + progress=False, + model_selection="accuracy", + niterations=DEFAULT_NITERATIONS * 2, + populations=DEFAULT_POPULATIONS * 2, + temp_equation_file=True, + recursive_history_length=2, + ) + model.fit(X) + self.assertLessEqual(model.get_best()[0]["loss"], 1e-4) + self.assertIn("x_{1 tm}", model.latex_table(indices=[[0, 1], [1, 1]])) + self.assertListEqual(model.predict(X).tolist(), [[4.0, 0.0]]) + self.assertListEqual( + model.predict(X, num_predictions=9).tolist(), + [ + [4.0, 0.0], + [2.0, 1.0], + [5.0, 2.0], + [4.0, 4.0], + [9.0, 2.0], + [6.0, 5.0], + [14.0, 4.0], + [10.0, 9.0], + [23.0, 6.0], + ], + ) + self.assertListEqual( + model.predict(X, num_predictions=14).tolist(), + [ + [4.0, 0.0], + [2.0, 1.0], + [5.0, 2.0], + [4.0, 4.0], + [9.0, 2.0], + [6.0, 5.0], + [14.0, 4.0], + [10.0, 9.0], + [23.0, 6.0], + [16.0, 14.0], + [37.0, 10.0], + [26.0, 23.0], + [60.0, 16.0], + [42.0, 37.0], + ], + ) + + def test_sequence_named_2D_data(self): + X = [ + [1, 2, 3], + [8, 7, 6], + [3, 6, 4], + ] + for i in range(3, 20): + X.append( + [ + X[i - 1][2] * X[i - 2][1], + X[i - 2][1] - X[i - 3][0], + X[i - 3][2] / X[i - 1][0], + ] + ) + X = np.asarray(X) + model = PySRSequenceRegressor( + **self.default_test_kwargs, + ) + model.fit(X, variable_names=["a", "b", "c"]) + self.assertLessEqual(model.get_best()[0]["loss"], 1e-4) + self.assertIn("a_{tm}", model.latex_table()) + self.assertIn("b_{tm}", model.latex_table()) + self.assertIn("c_{tm}", model.latex_table()) + self.assertIn("a_{tm1}", model.latex()[2]) + + def test_sequence_variable_names(self): + model = PySRSequenceRegressor( + **self.default_test_kwargs, + ) + sequence_variable_names = model._construct_variable_names( + 3, variable_names=None + ) + self.assertListEqual( + list(sequence_variable_names), + [ + [ + "x0_tm3", + "x1_tm3", + "x2_tm3", + "x0_tm2", + "x1_tm2", + "x2_tm2", + "x0_tm1", + "x1_tm1", + "x2_tm1", + ], + [ + "x₀[t-3]", + "x₁[t-3]", + "x₂[t-3]", + "x₀[t-2]", + "x₁[t-2]", + "x₂[t-2]", + "x₀[t-1]", + "x₁[t-1]", + "x₂[t-1]", + ], + ], + ) + + def test_sequence_custom_variable_names(self): + model = PySRSequenceRegressor( + **self.default_test_kwargs, + ) + variable_names = ["a", "b", "c"] + sequence_variable_names = model._construct_variable_names(3, variable_names) + self.assertListEqual( + list(sequence_variable_names), + [ + [ + "a_tm3", + "b_tm3", + "c_tm3", + "a_tm2", + "b_tm2", + "c_tm2", + "a_tm1", + "b_tm1", + "c_tm1", + ], + [ + "a[t-3]", + "b[t-3]", + "c[t-3]", + "a[t-2]", + "b[t-2]", + "c[t-2]", + "a[t-1]", + "b[t-1]", + "c[t-1]", + ], + ], + ) + + def test_sequence_unused_variables(self): + X = [1, 1] + for i in range(2, 30): + X.append(X[i - 1] + X[i - 2]) + X = np.asarray(X).reshape(-1, 1) + y = np.asarray([1] * len(X)) + model = PySRSequenceRegressor( + **self.default_test_kwargs, + early_stop_condition="stop_if(loss, complexity) = loss < 1e-4 && complexity == 1", + ) + with self.assertRaises(TypeError): + model.fit(X, y, Xresampled=X, y_units=["doesn't matter"]) + + def test_sequence_0_recursive_history_length_error(self): + model = PySRSequenceRegressor(recursive_history_length=0) + with self.assertRaises(ValueError): + model.fit([[1, 2, 3]]) + + def test_sequence_short_data_error(self): + X = [1] + model = PySRSequenceRegressor( + **self.default_test_kwargs, + ) + with self.assertRaises(ValueError): + model.fit(X) + + def test_sequence_bad_weight_length_error(self): + X = np.zeros((10, 1)) + model = PySRSequenceRegressor( + **self.default_test_kwargs, + ) + with self.assertRaises(ValueError): + model.fit(X, weights=np.zeros(9)) + + def test_sequence_weights(self): + X = np.ones((100, 1)) + weights = np.ones((100,)) + model = PySRSequenceRegressor( + recursive_history_length=2, + early_stop_condition="stop_if(loss, complexity) = loss < 1e-4 && complexity == 1", + ) + model.fit(X, weights=weights) + self.assertLessEqual(model.get_best()["loss"], 1e-4) + + def test_sequence_repr(self): + model = PySRSequenceRegressor( + **self.default_test_kwargs, + ) + self.assertIn("PySRSequenceRegressor", model.__repr__()) + + def test_sequence_from_file(self): + X = [1, 1] + for i in range(2, 100): + X.append(X[i - 1] + X[i - 2]) + X = np.asarray(X).reshape(-1, 1) + + temp_dir = Path(tempfile.mkdtemp()) + equation_file = str(temp_dir / "equation_file.csv") + model = PySRSequenceRegressor( + recursive_history_length=2, + equation_file=equation_file, + niterations=10, + ) + + pkl_file = str(temp_dir / "equation_file.pkl") + model.fit(X) + + model2 = PySRSequenceRegressor.from_file(pkl_file, recursive_history_length=2) + self.assertIn("x_tm1", model2.get_best()["equation"]) + + os.remove(pkl_file) + model3 = PySRSequenceRegressor.from_file( + equation_file, + binary_operators=["+"], + n_features_in=2, + recursive_history_length=2, + ) + self.assertIn("x_tm1", model3.get_best()["equation"]) + + model4 = PySRSequenceRegressor.from_file( + equation_file, + binary_operators=["+"], + n_features_in=2, + recursive_history_length=2, + feature_names_in=["xt_1", "xt_2"], + selection_mask=np.ones(2, dtype=np.bool_), + ) + self.assertIn("x_tm1", model4.get_best()["equation"]) + + def manually_create_model(equations, feature_names=None): if feature_names is None: feature_names = ["x0", "x1"] @@ -1160,11 +1498,13 @@ def test_unit_checks(self): """This just checks the number of units passed""" use_custom_variable_names = False variable_names = None + display_variable_names = None complexity_of_variables = 1 weights = None args = ( use_custom_variable_names, variable_names, + display_variable_names, complexity_of_variables, weights, ) @@ -1272,6 +1612,7 @@ def runtests(just_tests=False): """Run all tests in test.py.""" test_cases = [ TestPipeline, + TestSequenceRegressor, TestBest, TestFeatureSelection, TestMiscellaneous, diff --git a/requirements.txt b/requirements.txt index aa92aaf13..b316b8b9c 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,6 +1,6 @@ sympy>=1.0.0,<2.0.0 pandas>=0.21.0,<3.0.0 -numpy>=1.13.0,<3.0.0 +numpy>=1.20.0,<3.0.0 scikit_learn>=1.0.0,<2.0.0 juliacall==0.9.23 click>=7.0.0,<9.0.0