Interpreting Functional Form as Novel vs. Reproduced #1010

shehan807 · 2025-08-17T19:25:16Z

shehan807
Aug 17, 2025

I have a domain-specific question, but would like to gain general insight from anyone who might've encountered a similar puzzle. Put simply: how do I know when I’ve sufficiently reproduced a known functional form versus when I’ve truly discovered a new one?

Concretely, I have a sparse dataset (24 samples, each from molecular simulations) of calculated excess enthalpies of mixing, or $\Delta H_{mix}$ as a function of mole fraction of some component A within a mixture of A and B, $\chi_{A}$. There are several semi-empirical models that capture mole fraction-dependence of $\Delta H_{mix}$, like Redlich-Kester polynomial series, van Laar, Margules two-parameter model, Huron-Vidal, etc. These relations can work fine for weakly polar or non-polar mixtures, but for novel systems of interest to me, they (presumably) aren't as well defined.

Nonetheless, there is a list of well-defined traits of an expressive mixture model, as outlined nicely by Focke et al.. For example, we generally want enthalpy of mixing to follow

$$\Delta H_{mix}(\chi_{A}) = \chi_{A} (1 - \chi_{A}) G(\chi_{A})$$

so that it "reduces to the pure component value when any mole or mass fraction approaches unity". Similarly, I've tried to ensure that my functional search space encompasses existing models while expansive enough to explore new forms as well. I've attached my data and a few of these model fits + a symbolic regression-generated functional form (PySR inputs included at the bottom).

Note: I weight my data by inverse uncertainties (not plotted above), where at increasing mole fractions, I observe larger uncertainties (up to ~10%).

The more challenging task is computing derived properties from $\Delta H_{mix}$. For instance, the partial molar enthalpy, $h$, is defined as

$$h = \Delta H_{mix} - \chi_{A}\frac{\delta\Delta H_{mix}}{\delta\chi_{A}}$$

and I've computed these below:

What I’m observing is a delicate balance between keeping the search space tractable and still allowing exploration beyond existing models. Interestingly, my PySR output identifies a form that can be rewritten into the van Laar model, while also suggesting a slightly better-performing alternative:

Complexity  Loss       Score      Equation
1           9.119e-09  0.000e+00  g = -86.088
3           8.949e-09  9.426e-03  g = -86.861 + #1
5           8.784e-09  9.313e-03  g = (#1 + -86.861) + #1
6           1.276e-09  1.929e+00  g = (1.6952 - #1) * -55.827 <--- CAN BE REWRITTEN IN van Laar FORM -->
9           1.275e-09  2.668e-04  g = (-55.827 * (1.6952 - #1)) * 1.0005
10          8.500e-10  4.056e-01  g = (0.50133 ^ #1) * -95.293 <--"BEST" equation plotted above-->

Best equation (original scale): g = (0.50133234 ^ #1) * -95.29277
Score: 0.405575, Complexity: 10

If symbolic regression reproduces an existing model, that suggests such a model is trustworthy. Conversely, if the search space is restricted to noncompetitive forms, then reproducing known models isn’t surprising. I’d greatly appreciate any rules of thumb or holistic insights on how one may interpret these results!

pySR Inputs:

My general pySR setup is as below:

PYSR_CONFIG = {
    'niterations': 7500,  
    'populations': 30,    
    'population_size': 200,  
    'maxsize': 30,       
    'binary_operators': ["+", "-", "*", "/", "^"],
    'complexity_of_operators': {
        "+": 1, "-": 1, "*": 2, "/": 4, "^": 5  
    },
    'constraints': {
        "^": (9, 1)  
    },
    'model_selection': "best",
    'parsimony': 0.01,    
    'adaptive_parsimony_scaling': 2000,
    'weight_optimize': 0.001,
    'turbo': True,        
    'loss': "L1DistLoss()",  
    'warmup_maxsize_by': 0.1,  
    'fraction_replaced': 0.2, 
    'random_state': 42
}

template_spec = TemplateExpressionSpec(
    expressions=["g"],
    variable_names=["x0"],
    combine="x0 * (1 - x0) * g(x0)"
)

model = PySRRegressor(
    **PYSR_CONFIG,  
    procs=0,
    deterministic=True,
    parallelism="serial",
    verbosity=2,
    progress=False,
    optimizer_algorithm="BFGS",
    optimizer_iterations=10,
    expression_spec=template_spec
)

...
model.fit(X, y_norm, weights=weights)

gm89uk · 2025-08-18T10:00:15Z

gm89uk
Aug 18, 2025

Just some comments unrelated to your question about interpretability, but perhaps your parsimony (0.01) is too high for your loss?
You could also consider using DynamicDiff to directly solve for partial molar enthalpy within the template expression structure.
Lastly, the latest version of symbolicregression.jl allows you to input guesses, which could represent existing domain knowledge; SR could evolve from there. It will probably come to PySR soon.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Interpreting Functional Form as Novel vs. Reproduced #1010

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Interpreting Functional Form as Novel vs. Reproduced #1010

Uh oh!

Uh oh!

shehan807 Aug 17, 2025

Replies: 1 comment

Uh oh!

Uh oh!

gm89uk Aug 18, 2025

shehan807
Aug 17, 2025

gm89uk
Aug 18, 2025