regmfit-toolkit is the repository for regmfit, a Python package for
large-scale linear modeling with an emphasis on multi-target workloads,
repeatable evaluation, and practical memory control.
The package includes:
- SVD-based ridge regression with cross-validated alpha selection
- A legacy ridge backend for simpler control flow and faster single-pass workloads
- PCR and PLS utilities
- Baseline wrappers built on scikit-learn
- Split generators for k-fold, bootstrap, chunked, and grouped validation
- Preprocessing and scoring helpers for reusable modeling pipelines
regmfit is designed for workflows where one or more of the following are
true:
- many targets are fit against the same feature matrix (tens to thousands of outputs)
- cross-validation is repeated many times (large bootstrap counts, nested-CV outer loops)
- bootstrap evaluation is part of the model-selection workflow
- memory use matters as much as raw solver speed (large feature matrices or target counts)
- you want one package that covers the full workflow from splits to scoring to coefficient extraction
The main ridge path is built around a single-SVD-per-fold design with:
- duplicate-aware bootstrap handling: collapses repeated bootstrap indices into integer count weights, reducing per-fold SVD cost without changing the solution
- controlled fold parallelism: thread-pool limiting prevents BLAS oversubscription when running many folds in parallel
- adaptive target batching: the full
U_T @ Yprojection is materialised when it fits the memory budget; target chunking only engages for genuinely large matrices - streaming fold accumulation: per-fold score surfaces are summed on the
fly — no
(n_folds, n_alphas, n_targets)cube is ever held in RAM - structured results: a frozen
RidgeCVResultdataclass carries scores, weights, intercepts, and normalisation state together
The package itself can be installed directly from the repository root:
pip install -e .For development:
pip install -e .
pip install pytestRequires Python >= 3.10 (tested on 3.10 – 3.14). Runtime dependencies: numpy,
scipy, scikit-learn, joblib, threadpoolctl.
For reproducible timing and model-fitting benchmarks, it is often better to
create a clean environment just for regmfit rather than reusing a large
general-purpose environment.
conda create -n mfit -c conda-forge python pip numpy scipy scikit-learn joblib threadpoolctl
conda activate mfit
python -m pip install -e .
# optional utilities for examples / benchmarks
conda install -c conda-forge tqdm natsort pandas openpyxl matplotlibIf you are running on Intel CPUs, it can also be worth benchmarking a separate environment based on Intel's Python distribution. On some systems this can improve BLAS-backed model-fitting workloads, but the benefit is machine-dependent, so it should be treated as an optional performance path rather than the default install route.
Reference:
Example setup:
conda create -n idp -c https://software.repos.intel.com/python/conda -c conda-forge --override-channels intelpython3_full python=3.12
conda activate idp
conda install -c conda-forge pandas natsort openpyxl
python -m pip install -e .If you care about throughput, benchmark both environments on your own workload. The best-performing stack can vary with BLAS linkage, CPU model, thread settings, and problem size.
import numpy as np
from regmfit import ridge_cv_svd, make_bootstrap_splits
# n_samples × n_features and n_samples × n_targets
X_train = np.random.randn(200, 64)
X_test = np.random.randn(50, 64)
y_train = np.random.randn(200, 128)
y_test = np.random.randn(50, 128)
# build 500 bootstrap inner-CV splits
cv = make_bootstrap_splits(X_train.shape[0], n_splits=500, random_state=0)
result = ridge_cv_svd(
X_train, X_test, y_train, y_test,
alphas=np.logspace(-3, 5, 40),
cv=cv,
scoring_in="pearson", # metric used to select alpha in inner CV
scoring_out="pearson", # metric used to evaluate on the test set
with_std_x=True, # optional: enable feature standardization
with_std_y=False, # optional: enable target standardization
n_jobs=-1, # parallel folds
collapse_duplicates=True, # compress repeated bootstrap indices (default)
memory_budget_mb=256.0, # controls when target chunking kicks in (default 128)
# return_weights=False # set to skip weight storage (saves memory in null loops)
)
# --- Access the result ---
print(result.test_scores) # (n_targets,) test-set score per target
print(result.best_alphas) # (n_targets,) selected alpha per target
print(result.coef_scaled) # (n_features, n_targets) weights in standardised space
print(result.coef_raw) # (n_features, n_targets) weights in original space
print(result.intercept_raw) # (n_targets,) intercepts in original space
print(result.ols_test_scores) # (n_targets,) OLS (alpha=0) test-set scores
# coef_scaled, coef_raw, intercept_raw are None when return_weights=FalseBy default, ridge_cv_svd centers X and y (with_mean_x=True,
with_mean_y=True) and leaves variance scaling off
(with_std_x=False, with_std_y=False) unless you enable it explicitly.
| Split | When to use |
|---|---|
make_bootstrap_splits(n_samples, n_splits) |
Repeated bootstrap inner CV; pairs well with collapse_duplicates=True |
make_kfold_splits(n_samples, n_splits) |
Standard k-fold inner CV |
LeaveOneRunOut(groups) |
Leave-one-group-out by run, session, or block label |
ChunkedSplit(n_splits, chunklen, ...) |
Contiguous block splits for temporally or sequentially structured data |
Pass any list of (train_indices, val_indices) tuples to cv — the
normalisation step accepts both pre-built lists and sklearn splitter objects.
Supported values for scoring_in and scoring_out:
| Value | Description | Notes |
|---|---|---|
"r2" |
Coefficient of determination | |
"pearson" |
Pearson correlation | recommended for inner CV selection |
"spearman" |
Spearman rank correlation | |
"mse" |
Mean squared error | lower is better |
"rmse" |
Root mean squared error | lower is better |
"explained_variance" |
Explained variance score |
For inner CV selection (scoring_in), "pearson" and "r2" are the most
common choices. "mse" works well when target variances are comparable across
the alpha grid.
For general-purpose regression, the best default pair is:
scoring_in="r2"scoring_out="r2"
Why this is the safest default:
r2is scale-normalised, so targets with different raw variances are still comparable during inner-CV model selection- it matches the default behaviour of the package and common sklearn conventions
- it keeps the inner and outer metrics aligned, which makes model comparison easier to interpret
Use scoring_in="mse" when:
- targets are already on comparable scales, or
- you enable
with_std_y=True, or - you explicitly want alpha selection to optimise squared-error calibration
Use scoring_out="pearson" when:
- you care more about directional agreement than explained variance, or
- target amplitudes are not directly comparable across tasks, datasets, or preprocessing choices
In short:
- general default:
r2/r2 - correlation-focused evaluation:
r2/pearson - error-focused selection on standardised targets:
mse/r2ormse/pearson
The alpha_mode parameter controls how the best regularisation strength is
chosen from the inner-CV score surface:
| Value | Behaviour |
|---|---|
"per_target" (default) |
Each target independently selects the alpha that maximised its inner-CV score. Best when targets have different regularisation scales. |
"global" |
A single alpha is selected by aggregating across all targets. Useful when targets share the same underlying signal-to-noise structure. |
"both" |
Fits with per-target alphas and also records result.best_alpha_global for reference. |
ridge_cv_svd returns a frozen RidgeCVResult dataclass:
| Field | Shape | Description |
|---|---|---|
test_scores |
(n_targets,) |
Test-set score per target |
best_alphas |
(n_targets,) |
Selected alpha per target |
coef_scaled |
(n_features, n_targets) |
Weights in standardised feature/target space |
coef_raw |
(n_features, n_targets) |
Weights applicable to un-scaled features |
intercept_raw |
(n_targets,) |
Intercept in original target space |
ols_test_scores |
(n_targets,) |
OLS (alpha=0) test-set scores |
ols_coef_scaled |
(n_features, n_targets) |
OLS weights in standardised space |
ols_coef_raw |
(n_features, n_targets) |
OLS weights in original space |
cv_scores |
(n_alphas, n_targets) |
Mean inner-CV score surface (return_cv_scores=True) |
best_alpha_global |
scalar | Single global alpha (alpha_mode includes "global") |
x_state |
CenterScaleState |
Feature centering/scaling state |
y_state |
CenterScaleState |
Target centering/scaling state |
Weight and intercept fields are None when return_weights=False.
Core ridge interfaces:
ridge_cv_svd(X_train, X_test, y_train, y_test, alphas, cv, ...)— functional nested-CV ridge fitting; returnsRidgeCVResultRidgeSVDCV— scikit-learn compatible estimator backed by the same SVD pathridge_cv_svd_legacy,RidgeSVDCVLegacy— lighter dense-SVD path (see below)
Supporting modules:
| Module | Contents |
|---|---|
model_selection |
make_bootstrap_splits, make_kfold_splits, LeaveOneRunOut, ChunkedSplit, normalize_cv_splits |
preprocessing |
fit_center_scale, scale_train_test, scaled_coef_to_raw, CenterScaleState |
metrics |
get_score, score_from_context, score_many_from_context, SUPPORTED_SCORINGS |
baselines |
ridge_cv_sklearn, ridge_cv_sklearn_rk, poisson_cv_sklearn, poisson_sklearn — these are sklearn-backed reference implementations and return tuples, not RidgeCVResult; see each function's docstring for field order |
pcr |
pcr_cv, pcr_cv_sk — cross-validated principal components regression over a component-count grid |
pls |
pls_cv — cross-validated partial least squares over a latent-component grid |
stats_utils |
get_bootstrappval_cols, get_permutationpval_cols, get_maxTpval_cols — vectorised bootstrap and permutation p-values with two-sided / greater / less alternatives |
Use the main ridge backend (ridge_cv_svd) when you want:
- the default, fully-featured path
- better scaling on repeated bootstrap workloads (deduplication, streaming sums)
- memory-budget-aware batching for large target counts
- the most complete and actively maintained implementation
Use ridge_legacy (ridge_cv_svd_legacy) when you want:
- a simpler dense-SVD path with fewer moving parts
- faster single-pass performance on small-to-medium data where the overhead of deduplication and adaptive batching is not needed
- a reference implementation for debugging or comparison
from regmfit.ridge_legacy import ridge_cv_svd_legacy, RidgeSVDCVLegacy
result = ridge_cv_svd_legacy(
X_train, X_test, y_train, y_test,
alphas=np.logspace(-3, 4, 32),
cv=cv,
scoring_in="pearson",
scoring_out="pearson",
)
# Returns the same RidgeCVResult dataclass as ridge_cv_svdBoth backends return the same RidgeCVResult structure, so they are drop-in
alternatives at the call site.
regmfit is designed so the main solver choice and preprocessing choices stay
explicit:
ridge_cv_svdis the default backend for repeated CV, large target counts, or memory-sensitive workloadsridge_cv_svd_legacyis the simpler dense-SVD reference path and is often useful for debugging, benchmarking, or small-to-medium problemswith_mean_x=Trueandwith_mean_y=Trueare the default centering choiceswith_std_x=Falseandwith_std_y=Falseare the default scaling choicesalpha_mode="per_target"is the default because different targets often prefer different regularisation strengthsalpha_mode="global"is useful when you want one shared alpha across all outputs for simplicity or comparability
The weight-space outputs reflect the chosen preprocessing:
coef_scaledlives in the centered/scaled space used internally by the fitcoef_rawis converted back to the original feature scale and is the version to use for direct prediction on raw inputsols_*fields are the unregularisedalpha=0reference fit evaluated with the same scoring choices as the ridge solution
RidgeSVDCV wraps the same SVD backend in a scikit-learn BaseEstimator:
from regmfit import RidgeSVDCV
from sklearn.model_selection import KFold
est = RidgeSVDCV(
alphas=np.logspace(-3, 4, 32),
cv=KFold(5),
scoring_in="pearson",
)
est.fit(X_train, y_train)
scores = est.score(X_test, y_test) # (n_targets,)
coef = est.coef_ # (n_features, n_targets)Timing comparisons live in benchmarks/, not in tests/:
tests/is for correctness, validation, and API behaviourbenchmarks/is for timing-sensitive scripts whose results depend on hardware, BLAS backend, and thread settings — these are run manually
The repository includes two benchmark scripts:
benchmark_ridge_baselines.py— compact synthetic comparison ofridge_cv_svd,ridge_cv_svd_legacy, andridge_cv_sklearnbenchmark_ridge_vs_sklearn.py— focused comparison of per-target and global alpha selection against sklearnRidgeCV
Run from the repository root:
python benchmarks/benchmark_ridge_baselines.py
python benchmarks/benchmark_ridge_vs_sklearn.py
# scale up to stress-test on your hardware
python benchmarks/benchmark_ridge_baselines.py --n-targets 256 --n-inner 20
python benchmarks/benchmark_ridge_vs_sklearn.py --scoring-in pearson --n-samples 1000 --n-targets 128Both scripts accept --n-samples, --n-features, --n-targets, --n-inner,
--n-outer, --scoring-in, --scoring-out, --with-std-x, and
--with-std-y flags.
The repository includes runnable examples under examples/:
run_mfit_demo_basics.py: self-contained synthetic or sample-data demosampledata/: small bundled arrays used by the demo script
regmfit/: package sourcebenchmarks/: timing-oriented comparison scriptsexamples/: example scripts and sample data helperstests/: automated tests
Run the test suite with:
pytestThe package is fully typed (py.typed) and tested against Python 3.10, 3.11,
3.12, 3.13, and 3.14 via the CI workflow in .github/workflows/tests.yml.