Skip to content

ukeles/regmfit-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

regmfit-toolkit

regmfit-toolkit is the repository for regmfit, a Python package for large-scale linear modeling with an emphasis on multi-target workloads, repeatable evaluation, and practical memory control.

The package includes:

  • SVD-based ridge regression with cross-validated alpha selection
  • A legacy ridge backend for simpler control flow and faster single-pass workloads
  • PCR and PLS utilities
  • Baseline wrappers built on scikit-learn
  • Split generators for k-fold, bootstrap, chunked, and grouped validation
  • Preprocessing and scoring helpers for reusable modeling pipelines

What regmfit is good at

regmfit is designed for workflows where one or more of the following are true:

  • many targets are fit against the same feature matrix (tens to thousands of outputs)
  • cross-validation is repeated many times (large bootstrap counts, nested-CV outer loops)
  • bootstrap evaluation is part of the model-selection workflow
  • memory use matters as much as raw solver speed (large feature matrices or target counts)
  • you want one package that covers the full workflow from splits to scoring to coefficient extraction

The main ridge path is built around a single-SVD-per-fold design with:

  • duplicate-aware bootstrap handling: collapses repeated bootstrap indices into integer count weights, reducing per-fold SVD cost without changing the solution
  • controlled fold parallelism: thread-pool limiting prevents BLAS oversubscription when running many folds in parallel
  • adaptive target batching: the full U_T @ Y projection is materialised when it fits the memory budget; target chunking only engages for genuinely large matrices
  • streaming fold accumulation: per-fold score surfaces are summed on the fly — no (n_folds, n_alphas, n_targets) cube is ever held in RAM
  • structured results: a frozen RidgeCVResult dataclass carries scores, weights, intercepts, and normalisation state together

Installation

The package itself can be installed directly from the repository root:

pip install -e .

For development:

pip install -e .
pip install pytest

Requires Python >= 3.10 (tested on 3.10 – 3.14). Runtime dependencies: numpy, scipy, scikit-learn, joblib, threadpoolctl.

Environment Setup Notes

For reproducible timing and model-fitting benchmarks, it is often better to create a clean environment just for regmfit rather than reusing a large general-purpose environment.

Recommended clean conda environment

conda create -n mfit -c conda-forge python pip numpy scipy scikit-learn joblib threadpoolctl
conda activate mfit
python -m pip install -e .

# optional utilities for examples / benchmarks
conda install -c conda-forge tqdm natsort pandas openpyxl matplotlib

Optional Intel-optimised environment

If you are running on Intel CPUs, it can also be worth benchmarking a separate environment based on Intel's Python distribution. On some systems this can improve BLAS-backed model-fitting workloads, but the benefit is machine-dependent, so it should be treated as an optional performance path rather than the default install route.

Reference:

Example setup:

conda create -n idp -c https://software.repos.intel.com/python/conda -c conda-forge --override-channels intelpython3_full python=3.12
conda activate idp
conda install -c conda-forge pandas natsort openpyxl
python -m pip install -e .

If you care about throughput, benchmark both environments on your own workload. The best-performing stack can vary with BLAS linkage, CPU model, thread settings, and problem size.

Quick Start

import numpy as np
from regmfit import ridge_cv_svd, make_bootstrap_splits

# n_samples × n_features and n_samples × n_targets
X_train = np.random.randn(200, 64)
X_test  = np.random.randn(50,  64)
y_train = np.random.randn(200, 128)
y_test  = np.random.randn(50,  128)

# build 500 bootstrap inner-CV splits
cv = make_bootstrap_splits(X_train.shape[0], n_splits=500, random_state=0)

result = ridge_cv_svd(
    X_train, X_test, y_train, y_test,
    alphas=np.logspace(-3, 5, 40),
    cv=cv,
    scoring_in="pearson",        # metric used to select alpha in inner CV
    scoring_out="pearson",       # metric used to evaluate on the test set
    with_std_x=True,             # optional: enable feature standardization
    with_std_y=False,            # optional: enable target standardization
    n_jobs=-1,                   # parallel folds
    collapse_duplicates=True,    # compress repeated bootstrap indices (default)
    memory_budget_mb=256.0,      # controls when target chunking kicks in (default 128)
    # return_weights=False       # set to skip weight storage (saves memory in null loops)
)

# --- Access the result ---
print(result.test_scores)      # (n_targets,)            test-set score per target
print(result.best_alphas)      # (n_targets,)            selected alpha per target
print(result.coef_scaled)      # (n_features, n_targets) weights in standardised space
print(result.coef_raw)         # (n_features, n_targets) weights in original space
print(result.intercept_raw)    # (n_targets,)            intercepts in original space
print(result.ols_test_scores)  # (n_targets,)            OLS (alpha=0) test-set scores
# coef_scaled, coef_raw, intercept_raw are None when return_weights=False

By default, ridge_cv_svd centers X and y (with_mean_x=True, with_mean_y=True) and leaves variance scaling off (with_std_x=False, with_std_y=False) unless you enable it explicitly.

CV Split Types

Split When to use
make_bootstrap_splits(n_samples, n_splits) Repeated bootstrap inner CV; pairs well with collapse_duplicates=True
make_kfold_splits(n_samples, n_splits) Standard k-fold inner CV
LeaveOneRunOut(groups) Leave-one-group-out by run, session, or block label
ChunkedSplit(n_splits, chunklen, ...) Contiguous block splits for temporally or sequentially structured data

Pass any list of (train_indices, val_indices) tuples to cv — the normalisation step accepts both pre-built lists and sklearn splitter objects.

Scoring Metrics

Supported values for scoring_in and scoring_out:

Value Description Notes
"r2" Coefficient of determination
"pearson" Pearson correlation recommended for inner CV selection
"spearman" Spearman rank correlation
"mse" Mean squared error lower is better
"rmse" Root mean squared error lower is better
"explained_variance" Explained variance score

For inner CV selection (scoring_in), "pearson" and "r2" are the most common choices. "mse" works well when target variances are comparable across the alpha grid.

Recommended Modeling Defaults

For general-purpose regression, the best default pair is:

  • scoring_in="r2"
  • scoring_out="r2"

Why this is the safest default:

  • r2 is scale-normalised, so targets with different raw variances are still comparable during inner-CV model selection
  • it matches the default behaviour of the package and common sklearn conventions
  • it keeps the inner and outer metrics aligned, which makes model comparison easier to interpret

Use scoring_in="mse" when:

  • targets are already on comparable scales, or
  • you enable with_std_y=True, or
  • you explicitly want alpha selection to optimise squared-error calibration

Use scoring_out="pearson" when:

  • you care more about directional agreement than explained variance, or
  • target amplitudes are not directly comparable across tasks, datasets, or preprocessing choices

In short:

  • general default: r2 / r2
  • correlation-focused evaluation: r2 / pearson
  • error-focused selection on standardised targets: mse / r2 or mse / pearson

Alpha Selection Modes

The alpha_mode parameter controls how the best regularisation strength is chosen from the inner-CV score surface:

Value Behaviour
"per_target" (default) Each target independently selects the alpha that maximised its inner-CV score. Best when targets have different regularisation scales.
"global" A single alpha is selected by aggregating across all targets. Useful when targets share the same underlying signal-to-noise structure.
"both" Fits with per-target alphas and also records result.best_alpha_global for reference.

RidgeCVResult Fields

ridge_cv_svd returns a frozen RidgeCVResult dataclass:

Field Shape Description
test_scores (n_targets,) Test-set score per target
best_alphas (n_targets,) Selected alpha per target
coef_scaled (n_features, n_targets) Weights in standardised feature/target space
coef_raw (n_features, n_targets) Weights applicable to un-scaled features
intercept_raw (n_targets,) Intercept in original target space
ols_test_scores (n_targets,) OLS (alpha=0) test-set scores
ols_coef_scaled (n_features, n_targets) OLS weights in standardised space
ols_coef_raw (n_features, n_targets) OLS weights in original space
cv_scores (n_alphas, n_targets) Mean inner-CV score surface (return_cv_scores=True)
best_alpha_global scalar Single global alpha (alpha_mode includes "global")
x_state CenterScaleState Feature centering/scaling state
y_state CenterScaleState Target centering/scaling state

Weight and intercept fields are None when return_weights=False.

Main API Surface

Core ridge interfaces:

  • ridge_cv_svd(X_train, X_test, y_train, y_test, alphas, cv, ...) — functional nested-CV ridge fitting; returns RidgeCVResult
  • RidgeSVDCV — scikit-learn compatible estimator backed by the same SVD path
  • ridge_cv_svd_legacy, RidgeSVDCVLegacy — lighter dense-SVD path (see below)

Supporting modules:

Module Contents
model_selection make_bootstrap_splits, make_kfold_splits, LeaveOneRunOut, ChunkedSplit, normalize_cv_splits
preprocessing fit_center_scale, scale_train_test, scaled_coef_to_raw, CenterScaleState
metrics get_score, score_from_context, score_many_from_context, SUPPORTED_SCORINGS
baselines ridge_cv_sklearn, ridge_cv_sklearn_rk, poisson_cv_sklearn, poisson_sklearn — these are sklearn-backed reference implementations and return tuples, not RidgeCVResult; see each function's docstring for field order
pcr pcr_cv, pcr_cv_sk — cross-validated principal components regression over a component-count grid
pls pls_cv — cross-validated partial least squares over a latent-component grid
stats_utils get_bootstrappval_cols, get_permutationpval_cols, get_maxTpval_cols — vectorised bootstrap and permutation p-values with two-sided / greater / less alternatives

Choosing Between ridge and ridge_legacy

Use the main ridge backend (ridge_cv_svd) when you want:

  • the default, fully-featured path
  • better scaling on repeated bootstrap workloads (deduplication, streaming sums)
  • memory-budget-aware batching for large target counts
  • the most complete and actively maintained implementation

Use ridge_legacy (ridge_cv_svd_legacy) when you want:

  • a simpler dense-SVD path with fewer moving parts
  • faster single-pass performance on small-to-medium data where the overhead of deduplication and adaptive batching is not needed
  • a reference implementation for debugging or comparison
from regmfit.ridge_legacy import ridge_cv_svd_legacy, RidgeSVDCVLegacy

result = ridge_cv_svd_legacy(
    X_train, X_test, y_train, y_test,
    alphas=np.logspace(-3, 4, 32),
    cv=cv,
    scoring_in="pearson",
    scoring_out="pearson",
)
# Returns the same RidgeCVResult dataclass as ridge_cv_svd

Both backends return the same RidgeCVResult structure, so they are drop-in alternatives at the call site.

Model Fitting Details

regmfit is designed so the main solver choice and preprocessing choices stay explicit:

  • ridge_cv_svd is the default backend for repeated CV, large target counts, or memory-sensitive workloads
  • ridge_cv_svd_legacy is the simpler dense-SVD reference path and is often useful for debugging, benchmarking, or small-to-medium problems
  • with_mean_x=True and with_mean_y=True are the default centering choices
  • with_std_x=False and with_std_y=False are the default scaling choices
  • alpha_mode="per_target" is the default because different targets often prefer different regularisation strengths
  • alpha_mode="global" is useful when you want one shared alpha across all outputs for simplicity or comparability

The weight-space outputs reflect the chosen preprocessing:

  • coef_scaled lives in the centered/scaled space used internally by the fit
  • coef_raw is converted back to the original feature scale and is the version to use for direct prediction on raw inputs
  • ols_* fields are the unregularised alpha=0 reference fit evaluated with the same scoring choices as the ridge solution

sklearn Estimator Interface

RidgeSVDCV wraps the same SVD backend in a scikit-learn BaseEstimator:

from regmfit import RidgeSVDCV
from sklearn.model_selection import KFold

est = RidgeSVDCV(
    alphas=np.logspace(-3, 4, 32),
    cv=KFold(5),
    scoring_in="pearson",
)
est.fit(X_train, y_train)
scores = est.score(X_test, y_test)   # (n_targets,)
coef   = est.coef_                   # (n_features, n_targets)

Benchmarking

Timing comparisons live in benchmarks/, not in tests/:

  • tests/ is for correctness, validation, and API behaviour
  • benchmarks/ is for timing-sensitive scripts whose results depend on hardware, BLAS backend, and thread settings — these are run manually

The repository includes two benchmark scripts:

  • benchmark_ridge_baselines.py — compact synthetic comparison of ridge_cv_svd, ridge_cv_svd_legacy, and ridge_cv_sklearn
  • benchmark_ridge_vs_sklearn.py — focused comparison of per-target and global alpha selection against sklearn RidgeCV

Run from the repository root:

python benchmarks/benchmark_ridge_baselines.py
python benchmarks/benchmark_ridge_vs_sklearn.py

# scale up to stress-test on your hardware
python benchmarks/benchmark_ridge_baselines.py --n-targets 256 --n-inner 20
python benchmarks/benchmark_ridge_vs_sklearn.py --scoring-in pearson --n-samples 1000 --n-targets 128

Both scripts accept --n-samples, --n-features, --n-targets, --n-inner, --n-outer, --scoring-in, --scoring-out, --with-std-x, and --with-std-y flags.

Examples

The repository includes runnable examples under examples/:

  • run_mfit_demo_basics.py: self-contained synthetic or sample-data demo
  • sampledata/: small bundled arrays used by the demo script

Repository Layout

  • regmfit/: package source
  • benchmarks/: timing-oriented comparison scripts
  • examples/: example scripts and sample data helpers
  • tests/: automated tests

Development

Run the test suite with:

pytest

The package is fully typed (py.typed) and tested against Python 3.10, 3.11, 3.12, 3.13, and 3.14 via the CI workflow in .github/workflows/tests.yml.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages