Confidence intervals #947

TiptoC01 · 2025-05-29T11:06:53Z

TiptoC01
May 29, 2025

Background

One of the graduate students we are sponsoring has introduced me to this library and I am intrigued by it. My background is in Experimental Nonlinear Dynamics - though that was a long time ago. I have played with the library on some simple model data and I am impressed by the outcomes. However, I have this nagging question about error-bars on the coefficients that are reported. I have some ideas how to answer it via Monte-Carlo approaches etc, or even from the definition of score. But feel I can't be the only one to be thinking about this. I have done some background reading but feel like I am either just missing something obvious.

Trivial Example code

This is a trivial code example of what I am talking about

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import style
from scipy.integrate import solve_ivp
from scipy.optimize import curve_fit
from pysr import PySRRegressor

lina = 2
cons = 5

N = 100


x = np.linspace(0, 1, N)
testy = lina*x + cons + np.random.normal(0, 0.1, N)

testx = x.reshape(-1, 1)




model = PySRRegressor(
    maxsize=20,
    niterations=100,  
    binary_operators=["*","+"], 
    elementwise_loss="loss(prediction, target) = (prediction - target)^2",
    model_selection="best"
)


model.fit(testx, testy)


plt.plot(testx,testy, 'o', label='data')
plt.plot(testx, model.predict(testx), label='prediction $ y='+model.latex() + '$')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()

This gives an almost perfect estimate of the equation I put in, despite the random scatter of the data. Which to be honest I was shocked by and made me dig deeper. Why choose a linear equation when a 100th order polynomial would give a 0 loss and was available ? The answer as you will all know is complexity bias. Brilliant !

But... How do we sensibly discuss the error of the coefficients in this brave new world (well it is new to me :))

Traditional Regression approach

I could of course fit the final equation using traditional regression techniques

def func(x, a, b):
    return a * x + b 
 
popt, pcov = curve_fit(func, x, testy)

print(popt)
print(np.sqrt(np.diag(pcov)))

Which gives me a estimates of $a = 1.99 \pm 0.04$ and $b = 5.00 \pm 0.02$ respectively which is certainly a credible result. But... philosophically I feel like this approach has not quite captured the uncertainty (if that is the right word) in the complexity of the chosen model. What if there is a nearby model with the form $y = ax^2 + bx + c$ with a low loss. If I repeat the traditional analysis with a quadratic equation.

def func2(x, a, b,c):
    return a * x*x + b*x + c 
 
popt, pcov = curve_fit(func2, x, testy)
print(popt)
print(np.sqrt(np.diag(pcov)))

I get
$a = 0.06 \pm 0.13 b = 1.93 \pm 0.14 c = 5.00 \pm 0.03$

the errors in first coefficient that quadratic is too many degrees of freedom for this regression. A similar analysis for y = c gives $c=5.99 \pm 0.06$.

The traditional regression would choose the linear based on various test, like reduced chi-squared.

Finally the Question!

What is the correct way of describing confidence using symbolic regression? I guess I want to be able to say the answer is of a certain form with a given level of confidence.

MilesCranmer · 2025-05-29T11:39:17Z

MilesCranmer
May 29, 2025
Maintainer

There is really no good way to do this. I think ultimately you should just go with something that seems reasonable to you. Even the "score" metric that looks for drops in log-loss-vs-complexity is pretty arbitrary and nowhere near universally-applicable.

A lot of the traditional approaches to getting uncertainty begin to fail when you bring in the symbolic search space and combine it with the space of real numbers. For example, take Bayesian evidence. Bayesian evidence integrates the likelihood of the model against the prior over your coefficients. But... what should the prior over coefficients be? Because, if you take any prior (even a uniform prior!), it will result in nonsensical biases towards particular models. For example, x * [constant] might be preferred over x / [constant], even though they are symbolically identical – purely because an arbitrary prior landed more probability mass on one of these models – giving an arbitrary result.

(By the way, Bayesian evidence is how a lot of traditional measures of complexity are derived, like AIC/BIC https://en.wikipedia.org/wiki/Bayesian_information_criterion. These are basically just Bayesian evidence, Taylor expanded with Gaussian priors – arbitrary! Using # of parameters as a measure of complexity is also totally arbitrary, because you can do things like this: https://arxiv.org/abs/1904.12320.)

So, picking some empirically-backed heuristic is actually the best we can do. There are a bunch of pseudo-mathematical approaches one could try to make an argument for, but in the end, it's all arbitrary. I really think the best thing to do is simply mock up some toy problems for your domain area, generate realistic noise and dataset sizes, and see if your heuristic gives you something reasonable in terms of model selection or confidence interval. I don't think there is any universal heuristic or measure. (I mean, if we want to get philosophical, even mathematical operators themselves are quite arbitrary! The only reason + seems simple to us is because it is familiar (and its familiar because it is widely useful). But it is actually quite an abstract concept.)

You can also try conformal prediction. There's an example here linking backend of PySR (SymbolicRegression.jl) to a conformal prediction package in Julia: https://github.com/JuliaTrustworthyAI/ConformalPrediction.jl (scroll down). I think it's a nice approach. Very empirical.

So, in summary: just pick something that works by experimenting on toy problems close to your domain.

0 replies

pukpr · 2025-05-29T22:27:27Z

pukpr
May 29, 2025

It seems that confidence intervals are being replaced by cross-validation approaches in many machine learning applications -- as it replaces a questionable/debatable numerical value with something that directly tells you that you are going in the right direction.

Consider the idea of confidence intervals in modeling some behavior. At one time a rule-of-thumb was that a normal C.I. estimate could be generated by considering the degrees of freedom. Yet, consider the analysis of tides for tidal prediction, a purely fitted model. A handful of tidal factor periodicities does a fairly decent job of calibrating a time-series, but it's possible to add a couple hundred more tidal factors from the table of tidal cycles. This will indeed make the tidal analysis more accurate but it becomes increasingly difficult to calculate a realistic error estimate, as it could also lead to over-fitting if there is some noise in the data.

How does ChatGPT respond to the prompt per the above paragraph: https://chatgpt.com/share/6838dab1-c474-8010-a942-87912469b0db

In summary it does respond with alternatives such as the aforementioned Bayesian metric, bootstrapped MC approach, regularization techniques such as Lasso, and the cross-validation that I mention.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confidence intervals #947

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Confidence intervals #947

Uh oh!

Uh oh!

TiptoC01 May 29, 2025

Background

Trivial Example code

Traditional Regression approach

Finally the Question!

Replies: 2 comments

Uh oh!

Uh oh!

MilesCranmer May 29, 2025 Maintainer

Uh oh!

pukpr May 29, 2025

TiptoC01
May 29, 2025

MilesCranmer
May 29, 2025
Maintainer

pukpr
May 29, 2025