regmfit-toolkit

regmfit-toolkit is the repository for regmfit, a Python package for large-scale linear modeling with an emphasis on multi-target workloads, repeatable evaluation, and practical memory control.

The package includes:

SVD-based ridge regression with cross-validated alpha selection
A legacy ridge backend for simpler control flow and faster single-pass workloads
PCR and PLS utilities
Baseline wrappers built on scikit-learn
Split generators for k-fold, bootstrap, chunked, and grouped validation
Preprocessing and scoring helpers for reusable modeling pipelines

What `regmfit` is good at

regmfit is designed for workflows where one or more of the following are true:

many targets are fit against the same feature matrix (tens to thousands of outputs)
cross-validation is repeated many times (large bootstrap counts, nested-CV outer loops)
bootstrap evaluation is part of the model-selection workflow
memory use matters as much as raw solver speed (large feature matrices or target counts)
you want one package that covers the full workflow from splits to scoring to coefficient extraction

The main ridge path is built around a single-SVD-per-fold design with:

duplicate-aware bootstrap handling: collapses repeated bootstrap indices into integer count weights, reducing per-fold SVD cost without changing the solution
controlled fold parallelism: thread-pool limiting prevents BLAS oversubscription when running many folds in parallel
adaptive target batching: the full U_T @ Y projection is materialised when it fits the memory budget; target chunking only engages for genuinely large matrices
streaming fold accumulation: per-fold score surfaces are summed on the fly — no (n_folds, n_alphas, n_targets) cube is ever held in RAM
structured results: a frozen RidgeCVResult dataclass carries scores, weights, intercepts, and normalisation state together

Installation

The package itself can be installed directly from the repository root:

pip install -e .

For development:

pip install -e .
pip install pytest

Requires Python >= 3.10 (tested on 3.10 – 3.14). Runtime dependencies: numpy, scipy, scikit-learn, joblib, threadpoolctl.

Environment Setup Notes

For reproducible timing and model-fitting benchmarks, it is often better to create a clean environment just for regmfit rather than reusing a large general-purpose environment.

Recommended clean conda environment

conda create -n mfit -c conda-forge python pip numpy scipy scikit-learn joblib threadpoolctl
conda activate mfit
python -m pip install -e .

# optional utilities for examples / benchmarks
conda install -c conda-forge tqdm natsort pandas openpyxl matplotlib

Optional Intel-optimised environment

If you are running on Intel CPUs, it can also be worth benchmarking a separate environment based on Intel's Python distribution. On some systems this can improve BLAS-backed model-fitting workloads, but the benefit is machine-dependent, so it should be treated as an optional performance path rather than the default install route.

Reference:

https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-python-download.html

Example setup:

conda create -n idp -c https://software.repos.intel.com/python/conda -c conda-forge --override-channels intelpython3_full python=3.12
conda activate idp
conda install -c conda-forge pandas natsort openpyxl
python -m pip install -e .

If you care about throughput, benchmark both environments on your own workload. The best-performing stack can vary with BLAS linkage, CPU model, thread settings, and problem size.

Quick Start

import numpy as np
from regmfit import ridge_cv_svd, make_bootstrap_splits

# n_samples × n_features and n_samples × n_targets
X_train = np.random.randn(200, 64)
X_test  = np.random.randn(50,  64)
y_train = np.random.randn(200, 128)
y_test  = np.random.randn(50,  128)

# build 500 bootstrap inner-CV splits
cv = make_bootstrap_splits(X_train.shape[0], n_splits=500, random_state=0)

result = ridge_cv_svd(
    X_train, X_test, y_train, y_test,
    alphas=np.logspace(-3, 5, 40),
    cv=cv,
    scoring_in="pearson",        # metric used to select alpha in inner CV
    scoring_out="pearson",       # metric used to evaluate on the test set
    with_std_x=True,             # optional: enable feature standardization
    with_std_y=False,            # optional: enable target standardization
    n_jobs=-1,                   # parallel folds
    collapse_duplicates=True,    # compress repeated bootstrap indices (default)
    memory_budget_mb=256.0,      # controls when target chunking kicks in (default 128)
    # return_weights=False       # set to skip weight storage (saves memory in null loops)
)

# --- Access the result ---
print(result.test_scores)      # (n_targets,)            test-set score per target
print(result.best_alphas)      # (n_targets,)            selected alpha per target
print(result.coef_scaled)      # (n_features, n_targets) weights in standardised space
print(result.coef_raw)         # (n_features, n_targets) weights in original space
print(result.intercept_raw)    # (n_targets,)            intercepts in original space
print(result.ols_test_scores)  # (n_targets,)            OLS (alpha=0) test-set scores
# coef_scaled, coef_raw, intercept_raw are None when return_weights=False

By default, ridge_cv_svd centers X and y (with_mean_x=True, with_mean_y=True) and leaves variance scaling off (with_std_x=False, with_std_y=False) unless you enable it explicitly.

CV Split Types

Split	When to use
`make_bootstrap_splits(n_samples, n_splits)`	Repeated bootstrap inner CV; pairs well with `collapse_duplicates=True`
`make_kfold_splits(n_samples, n_splits)`	Standard k-fold inner CV
`LeaveOneRunOut(groups)`	Leave-one-group-out by run, session, or block label
`ChunkedSplit(n_splits, chunklen, ...)`	Contiguous block splits for temporally or sequentially structured data

Pass any list of (train_indices, val_indices) tuples to cv — the normalisation step accepts both pre-built lists and sklearn splitter objects.

Scoring Metrics

Supported values for scoring_in and scoring_out:

Value	Description	Notes
`"r2"`	Coefficient of determination
`"pearson"`	Pearson correlation	recommended for inner CV selection
`"spearman"`	Spearman rank correlation
`"mse"`	Mean squared error	lower is better
`"rmse"`	Root mean squared error	lower is better
`"explained_variance"`	Explained variance score

For inner CV selection (scoring_in), "pearson" and "r2" are the most common choices. "mse" works well when target variances are comparable across the alpha grid.

Recommended Modeling Defaults

For general-purpose regression, the best default pair is:

scoring_in="r2"
scoring_out="r2"

Why this is the safest default:

r2 is scale-normalised, so targets with different raw variances are still comparable during inner-CV model selection
it matches the default behaviour of the package and common sklearn conventions
it keeps the inner and outer metrics aligned, which makes model comparison easier to interpret

Use scoring_in="mse" when:

targets are already on comparable scales, or
you enable with_std_y=True, or
you explicitly want alpha selection to optimise squared-error calibration

Use scoring_out="pearson" when:

you care more about directional agreement than explained variance, or
target amplitudes are not directly comparable across tasks, datasets, or preprocessing choices

In short:

general default: r2 / r2
correlation-focused evaluation: r2 / pearson
error-focused selection on standardised targets: mse / r2 or mse / pearson

Alpha Selection Modes

The alpha_mode parameter controls how the best regularisation strength is chosen from the inner-CV score surface:

Value	Behaviour
`"per_target"` (default)	Each target independently selects the alpha that maximised its inner-CV score. Best when targets have different regularisation scales.
`"global"`	A single alpha is selected by aggregating across all targets. Useful when targets share the same underlying signal-to-noise structure.
`"both"`	Fits with per-target alphas and also records `result.best_alpha_global` for reference.

`RidgeCVResult` Fields

ridge_cv_svd returns a frozen RidgeCVResult dataclass:

Field	Shape	Description
`test_scores`	`(n_targets,)`	Test-set score per target
`best_alphas`	`(n_targets,)`	Selected alpha per target
`coef_scaled`	`(n_features, n_targets)`	Weights in standardised feature/target space
`coef_raw`	`(n_features, n_targets)`	Weights applicable to un-scaled features
`intercept_raw`	`(n_targets,)`	Intercept in original target space
`ols_test_scores`	`(n_targets,)`	OLS (alpha=0) test-set scores
`ols_coef_scaled`	`(n_features, n_targets)`	OLS weights in standardised space
`ols_coef_raw`	`(n_features, n_targets)`	OLS weights in original space
`cv_scores`	`(n_alphas, n_targets)`	Mean inner-CV score surface (`return_cv_scores=True`)
`best_alpha_global`	scalar	Single global alpha (`alpha_mode` includes `"global"`)
`x_state`	`CenterScaleState`	Feature centering/scaling state
`y_state`	`CenterScaleState`	Target centering/scaling state

Weight and intercept fields are None when return_weights=False.

Main API Surface

Core ridge interfaces:

ridge_cv_svd(X_train, X_test, y_train, y_test, alphas, cv, ...) — functional nested-CV ridge fitting; returns RidgeCVResult
RidgeSVDCV — scikit-learn compatible estimator backed by the same SVD path
ridge_cv_svd_legacy, RidgeSVDCVLegacy — lighter dense-SVD path (see below)

Supporting modules:

Module	Contents
`model_selection`	`make_bootstrap_splits`, `make_kfold_splits`, `LeaveOneRunOut`, `ChunkedSplit`, `normalize_cv_splits`
`preprocessing`	`fit_center_scale`, `scale_train_test`, `scaled_coef_to_raw`, `CenterScaleState`
`metrics`	`get_score`, `score_from_context`, `score_many_from_context`, `SUPPORTED_SCORINGS`
`baselines`	`ridge_cv_sklearn`, `ridge_cv_sklearn_rk`, `poisson_cv_sklearn`, `poisson_sklearn` — these are sklearn-backed reference implementations and return tuples, not `RidgeCVResult`; see each function's docstring for field order
`pcr`	`pcr_cv`, `pcr_cv_sk` — cross-validated principal components regression over a component-count grid
`pls`	`pls_cv` — cross-validated partial least squares over a latent-component grid
`stats_utils`	`get_bootstrappval_cols`, `get_permutationpval_cols`, `get_maxTpval_cols` — vectorised bootstrap and permutation p-values with two-sided / greater / less alternatives

Choosing Between `ridge` and `ridge_legacy`

Use the main ridge backend (ridge_cv_svd) when you want:

the default, fully-featured path
better scaling on repeated bootstrap workloads (deduplication, streaming sums)
memory-budget-aware batching for large target counts
the most complete and actively maintained implementation

Use ridge_legacy (ridge_cv_svd_legacy) when you want:

a simpler dense-SVD path with fewer moving parts
faster single-pass performance on small-to-medium data where the overhead of deduplication and adaptive batching is not needed
a reference implementation for debugging or comparison

from regmfit.ridge_legacy import ridge_cv_svd_legacy, RidgeSVDCVLegacy

result = ridge_cv_svd_legacy(
    X_train, X_test, y_train, y_test,
    alphas=np.logspace(-3, 4, 32),
    cv=cv,
    scoring_in="pearson",
    scoring_out="pearson",
)
# Returns the same RidgeCVResult dataclass as ridge_cv_svd

Both backends return the same RidgeCVResult structure, so they are drop-in alternatives at the call site.

Model Fitting Details

regmfit is designed so the main solver choice and preprocessing choices stay explicit:

ridge_cv_svd is the default backend for repeated CV, large target counts, or memory-sensitive workloads
ridge_cv_svd_legacy is the simpler dense-SVD reference path and is often useful for debugging, benchmarking, or small-to-medium problems
with_mean_x=True and with_mean_y=True are the default centering choices
with_std_x=False and with_std_y=False are the default scaling choices
alpha_mode="per_target" is the default because different targets often prefer different regularisation strengths
alpha_mode="global" is useful when you want one shared alpha across all outputs for simplicity or comparability

The weight-space outputs reflect the chosen preprocessing:

coef_scaled lives in the centered/scaled space used internally by the fit
coef_raw is converted back to the original feature scale and is the version to use for direct prediction on raw inputs
ols_* fields are the unregularised alpha=0 reference fit evaluated with the same scoring choices as the ridge solution

sklearn Estimator Interface

RidgeSVDCV wraps the same SVD backend in a scikit-learn BaseEstimator:

from regmfit import RidgeSVDCV
from sklearn.model_selection import KFold

est = RidgeSVDCV(
    alphas=np.logspace(-3, 4, 32),
    cv=KFold(5),
    scoring_in="pearson",
)
est.fit(X_train, y_train)
scores = est.score(X_test, y_test)   # (n_targets,)
coef   = est.coef_                   # (n_features, n_targets)

Benchmarking

Timing comparisons live in benchmarks/, not in tests/:

tests/ is for correctness, validation, and API behaviour
benchmarks/ is for timing-sensitive scripts whose results depend on hardware, BLAS backend, and thread settings — these are run manually

The repository includes two benchmark scripts:

benchmark_ridge_baselines.py — compact synthetic comparison of ridge_cv_svd, ridge_cv_svd_legacy, and ridge_cv_sklearn
benchmark_ridge_vs_sklearn.py — focused comparison of per-target and global alpha selection against sklearn RidgeCV

Run from the repository root:

python benchmarks/benchmark_ridge_baselines.py
python benchmarks/benchmark_ridge_vs_sklearn.py

# scale up to stress-test on your hardware
python benchmarks/benchmark_ridge_baselines.py --n-targets 256 --n-inner 20
python benchmarks/benchmark_ridge_vs_sklearn.py --scoring-in pearson --n-samples 1000 --n-targets 128

Both scripts accept --n-samples, --n-features, --n-targets, --n-inner, --n-outer, --scoring-in, --scoring-out, --with-std-x, and --with-std-y flags.

Examples

The repository includes runnable examples under examples/:

run_mfit_demo_basics.py: self-contained synthetic or sample-data demo
sampledata/: small bundled arrays used by the demo script

Repository Layout

regmfit/: package source
benchmarks/: timing-oriented comparison scripts
examples/: example scripts and sample data helpers
tests/: automated tests

Development

Run the test suite with:

pytest

The package is fully typed (py.typed) and tested against Python 3.10, 3.11, 3.12, 3.13, and 3.14 via the CI workflow in .github/workflows/tests.yml.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmarks		benchmarks
examples		examples
hfb_analysis		hfb_analysis
regmfit		regmfit
tests		tests
.codex		.codex
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

regmfit-toolkit

What `regmfit` is good at

Installation

Environment Setup Notes

Recommended clean conda environment

Optional Intel-optimised environment

Quick Start

CV Split Types

Scoring Metrics

Recommended Modeling Defaults

Alpha Selection Modes

`RidgeCVResult` Fields

Main API Surface

Choosing Between `ridge` and `ridge_legacy`

Model Fitting Details

sklearn Estimator Interface

Benchmarking

Examples

Repository Layout

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

regmfit-toolkit

What regmfit is good at

Installation

Environment Setup Notes

Recommended clean conda environment

Optional Intel-optimised environment

Quick Start

CV Split Types

Scoring Metrics

Recommended Modeling Defaults

Alpha Selection Modes

RidgeCVResult Fields

Main API Surface

Choosing Between ridge and ridge_legacy

Model Fitting Details

sklearn Estimator Interface

Benchmarking

Examples

Repository Layout

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What `regmfit` is good at

`RidgeCVResult` Fields

Choosing Between `ridge` and `ridge_legacy`

Packages