Oaxaca-Blinder Decomposition in Rust

Context Anchor: This component is a pure statistical logic engine of the Macro System. For the overarching architecture and vision, see SYSTEM_ARCHITECTURE and MISSION.

A high-performance Rust library for performing Oaxaca-Blinder decomposition, designed for economists, data scientists, and HR analysts. It decomposes the gap in an outcome variable (like wage) between two groups into "explained" (characteristics) and "unexplained" (discrimination/coefficients) components.

Python Bindings

The library includes Python bindings exposed via pyo3.

Note for Python 3.13+ users: If you are building from source on Python 3.13 or newer, you may need to enable ABI3 forward compatibility if you encounter build errors. We have enabled this via .cargo/config.toml for development, but for distribution builds, ensure:

export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1

To build and install the Python package:

maturin develop --features python

The Python API supports standard Oaxaca-Blinder decomposition, Quantile decomposition (RIF-Regression), and AKM estimation. Beyond standard decomposition, it supports Quantile Decomposition (RIF & Machado-Mata), AKM (Abowd-Kramarz-Margolis) Models, Propensity Score Matching, DFL Reweighting, and Budget Optimization for policy simulation.

🚀 Feature Support

Feature	Support
OLS Mean Decomposition	✅
Quantile Decomposition (Machado-Mata)	✅
Quantile Decomposition (RIF Regression)	✅
Categorical Normalization (Yun)	✅
Bootstrapped Standard Errors	✅
Budget Optimization ("Cheapest Fix")	✅
Wage Scale Design (Grade/Step)	✅
JMP Decomposition (Time Series)	✅
DFL Reweighting (Counterfactuals)	✅
Sample Weights	✅
Heckman Correction (Selection Bias)	✅
AKM (Worker-Firm Fixed Effects)	✅
Matching (Euclidean, Mahalanobis, PSM)	✅

🏆 Why Use This Library?

Most economists rely on the oaxaca R package or statsmodels in Python. While excellent, they have limitations that this library addresses:

🚀 Speed: Written in Rust with parallelized bootstrapping (Rayon). It is 20-30x faster than R and 10x faster than Python for large datasets (see Benchmarks).
📦 All-in-One Toolkit: In R, you need oaxaca for decomposition, rifreg for quantiles, MatchIt for matching, and lfe for AKM. In Python, statsmodels lacks built-in RIF, Matching, and AKM. This library unifies all of them into a single, consistent API.
🛡️ Type Safety: Rust's strict type system prevents common data errors (like silent NaN propagation) that can plague dynamic languages.
🧠 Unique Features: Includes the "Cheapest Fix" budget optimization solver, a tool specifically designed for HR departments to close pay gaps efficiently—something no other standard library offers.
🐍 Python & CLI Support: You don't need to know Rust. Use the high-performance engine directly from Python or the command line.
⚡ Parallelized Inference: Bootstrapping standard errors for Oaxaca decompositions is computationally intensive. This library uses Rayon to parallelize this across all CPU cores, reducing wait times from minutes to seconds.

🖥️ Command Line Interface (CLI)

Don't want to write Rust code? You can use the oaxaca-cli tool directly from your terminal to analyze CSV files.

Installation

cargo install oaxaca_blinder --features cli

Usage

Basic Decomposition:

oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
    --predictors education experience --categorical sector

Using R-style Formula:

oaxaca-cli --data wage.csv --group gender --reference F \
    --formula "wage ~ education + experience + C(sector)"

With Sample Weights (WLS):

oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
    --predictors education experience \
    --weights sampling_weight

With Heckman Correction (Selection Bias):

oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
    --predictors education experience \
    --selection-outcome employed \
    --selection-predictors education experience age marital_status

Export Results:

oaxaca-cli --data wage.csv ... --output-json results.json --output-markdown report.md

Generate HTML Report:

oaxaca-cli report \
  --data wage.csv \
  --outcome wage \
  --group gender \
  --reference F \
  --predictors education,experience \
  --categorical sector \
  --output report.html

Supports both --analysis-type mean (default) and --analysis-type quantile.

⚡ Quick Start

Add to Cargo.toml:

[dependencies]
oaxaca_blinder = "0.1.0"
polars = { version = "0.38", features = ["lazy", "csv"] }

Basic OLS Decomposition

use polars::prelude::*;
use oaxaca_blinder::{OaxacaBuilder, ReferenceCoefficients};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let df = df!(
        "wage" => &[25.0, 30.0, 35.0, 40.0, 45.0, 20.0, 22.0, 28.0, 32.0, 38.0],
        "education" => &[16.0, 18.0, 14.0, 20.0, 16.0, 12.0, 14.0, 16.0, 12.0, 18.0],
        "gender" => &["M", "M", "M", "M", "M", "F", "F", "F", "F", "F"]
    )?;

    let results = OaxacaBuilder::new(df, "wage", "gender", "F")
        .predictors(&["education"])
        .reference_coefficients(ReferenceCoefficients::Pooled)
        .run()?;

    results.summary();
    Ok(())
}

Python Example

import polars as pl
from oaxaca_blinder import OaxacaBlinder

# Load data with Polars
df = pl.read_csv("wage.csv")

# Initialize the Estimator
model = OaxacaBlinder(
    dataframe=df,
    outcome="wage",
    group="gender",
    reference_group="F",
    predictors=["education", "experience"],
    categorical_predictors=["sector"]
)

# Run Mean Decomposition
results = model.fit()

print(f"Total Gap: {results.total_gap:.4f}")
print(f"Explained: {results.two_fold.aggregate.explained.estimate:.4f}")
print(f"Unexplained: {results.two_fold.aggregate.unexplained.estimate:.4f}")

# Run Quantile Decomposition (RIF)
q_results = model.fit_quantile(0.9)
print(f"90th Percentile Gap: {q_results.total_gap:.4f}")

# Run Budget Optimization
# "How do we close the gap to 0 with a $100,000 budget?"
adjustments = model.optimize_budget(budget=100000.0, target_gap=0.0)

💰 Policy Simulation: Budget Optimization

"The Cheapest Fix"

This unique feature is designed for HR analytics. It answers: "Given a limited budget, how can we reduce the pay gap as much as possible?"

It identifies individuals in the disadvantaged group with the largest negative unexplained residuals (i.e., the most "underpaid" relative to their qualifications) and calculates the optimal raises.

// Scenario: You have $200,000 to reduce the gap to 5%
let adjustments = results.optimize_budget(200_000.0, 0.05);

for adj in adjustments {
    println!("Give ${:.2} raise to employee #{}", adj.adjustment, adj.index);
}

📐 Policy Simulation: Wage Scale Design

"The Structural Fix"

Sometimes, you can't just give individual raises. You need to redesign the entire pay scale (Grades & Steps) to be fair, subject to budget constraints.

This optimization engine finds the optimal min_step_diff (increment between steps) and min_grade_diff (increment between grades) to minimize the total cost while ensuring no current employee takes a pay cut.

use optimization_engine::WageScaleProblem;

let problem = WageScaleProblem::new(dataframe, budget, grades, steps, min_wage);

// Solves for optimal structural parameters
let solution = problem.solve()?;

📊 Quantile Decomposition Strategies

The library supports two robust methods for decomposing the wage gap across the distribution:

Method	Best For...	Builder
Machado-Mata (Simulation)	Constructing full counterfactual distributions and "glass ceiling" analysis.	`QuantileDecompositionBuilder`
RIF Regression (Analytical)	Fast, detailed decomposition of specific quantiles (e.g., "Why is the 90th percentile gap so large?").	`OaxacaBuilder::decompose_quantile(0.9)`

Example: RIF Decomposition

// Fast decomposition of the 90th percentile gap
let results = OaxacaBuilder::new(df, "wage", "gender", "F")
    .predictors(&["education", "experience"])
    .decompose_quantile(0.9)?;

CLI Example

oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
    --predictors education experience \
    --analysis-type quantile --quantiles 0.1,0.5,0.9

Note: Python bindings for quantile decomposition are coming soon.

📈 Visualizing DFL Reweighting

DiNardo-Fortin-Lemieux (DFL) reweighting (Rust Only) is a non-parametric alternative that allows you to visualize what the wage distribution of Group B would look like if they had the characteristics of Group A.

The run_dfl function returns density vectors perfect for plotting in Python (matplotlib) or Rust (plotters).

use oaxaca_blinder::run_dfl;

let dfl = run_dfl(&df, "wage", "gender", "F", &["education", "experience"])?;

// dfl.grid                   <- X-axis (Wage levels)
// dfl.density_a              <- Actual Group A Density
// dfl.density_b              <- Actual Group B Density
// dfl.density_b_counterfactual <- "What B would earn with A's characteristics"

Tip: Plot density_b vs density_b_counterfactual to visualize the "explained" gap.

⏱️ Benchmarks

Designed for performance, utilizing Rust's speed and parallelization (Rayon) for bootstrapping.

Performance vs Python (statsmodels) vs R (oaxaca) Dataset: 100k rows, 10 predictors

Reps	Rust (`oaxaca_blinder`)	Python (`statsmodels`)	R (`oaxaca`)
1 (Raw)	0.14s 🚀	0.15s	?
100	0.76s 🚀	N/A	?
500	3.11s 🚀	N/A	~119.4s

Rust's raw decomposition is significantly faster than statsmodels, and the bootstrap performance is orders of magnitude faster than R.

Matching Engine

The library includes a high-performance Matching Engine for causal inference, supporting Euclidean, Mahalanobis, and Propensity Score Matching (PSM).

Rust Example

use oaxaca_blinder::MatchingEngine;
use polars::prelude::*;

// Load data...
let engine = MatchingEngine::new(df, "treatment", "outcome", &["age", "education"]);

// 1-Nearest Neighbor Matching with Mahalanobis distance
let weights = engine.run_matching(1, true)?;

Python Example

import oaxaca_blinder

# Match units
weights = oaxaca_blinder.match_units(
    "data.csv",
    treatment="treatment",
    outcome="wage",
    covariates=["education", "experience"],
    k=1,
    method="mahalanobis" # or "euclidean", "psm"
)

CLI Example

oaxaca-cli --data wage.csv --outcome wage --group treatment --reference 0 \
  --predictors education,experience \
  --analysis-type match --matching-method mahalanobis --k-neighbors 1

📚 Theory & Methodology

Deep Dive: The Indexing Problem & Reference Groups

The decomposition depends on the choice of the non-discriminatory coefficient vector $\beta^*$. The general decomposition equation is:

This library supports:

Group A / Group B: Uses $\beta_A$ or $\beta_B$ as the reference.
Pooled (Neumark): Uses $\beta^*$ from a pooled regression of both groups.
Weighted (Cotton): Uses a weighted average: $\beta^* = w\beta_A + (1-w)\beta_B$.

Deep Dive: Categorical Variables (Yun Normalization)

Standard detailed decomposition is sensitive to the choice of the omitted base category for dummy variables. This library implements Yun's normalization, which transforms coefficients to be invariant to the base category choice:

Where $\bar{\beta}_k$ is the mean of the coefficients for the categorical variable $k$. This ensures robust detailed results.

Deep Dive: JMP Decomposition

The Juhn-Murphy-Pierce (JMP) method decomposes the change in the gap over time (or between distributions) into three components:

Quantity Effect: Changes in observable characteristics ($X$).
Price Effect: Changes in returns to characteristics ($\beta$).
Gap Effect: Changes in the distribution of unobserved residuals.

Deep Dive: Abowd-Kramarz-Margolis (AKM) Model

The AKM model decomposes wage variation into individual and firm-specific components:

$\alpha_i$: Person fixed effect (unobserved ability).
$\psi_{J(i,t)}$: Firm fixed effect (pay premium).
$x_{it}$: Time-varying covariates.

Identification: The model is identified only within the Largest Connected Set (LCS) of workers and firms linked by mobility. This library automatically extracts the LCS using a graph-based approach (BFS) before estimation.

Deep Dive: Propensity Score Matching (PSM)

PSM estimates the Average Treatment Effect on the Treated (ATT) by matching treated units to control units with similar probabilities of treatment:

Propensity Score: $e(x) = P(D=1|X=x)$ estimated via Logistic Regression.
Matching: Nearest Neighbor matching on the logit of the propensity score.
Balance: Ensures that the distribution of covariates is similar between treated and matched control groups.

Deep Dive: DFL Reweighting

DiNardo, Fortin, and Lemieux (1996) proposed a non-parametric method to decompose the entire distribution of wages. It constructs a counterfactual density for Group B (e.g., women) as if they had the characteristics of Group A (e.g., men) by applying a reweighting factor $\Psi(x)$:

$P(A|x)$: Probability of belonging to Group A given characteristics $x$ (estimated via Probit/Logit).
$\Psi(x)$: The weight applied to each observation in Group B.

This allows for visual comparison of the "explained" gap across the entire distribution (e.g., via Kernel Density Estimation).

Deep Dive: Heckman Correction (Selection Bias)

Wage data is often observed only for those who are employed. If employment status is not random—for instance, if highly educated people are more likely to work—standard OLS decomposition will be biased (Selection Bias).

The Heckman Two-Step Correction addresses this:

Selection Equation (Probit): Estimates the probability of employment ($Z\gamma$) and calculates the Inverse Mills Ratio ($\lambda$).
Outcome Equation (OLS): Includes $\lambda$ as an additional regressor to correct for the selection bias.

Rust Example

// Requires a selection variable (e.g., "employed") and predictors
builder.heckman_selection("employed", &["age", "education", "marital_status"]);

CLI Example

oaxaca-cli --data wage.csv --outcome wage --group gender --reference F \
  --predictors education experience \
  --selection-outcome employed \
  --selection-predictors age education marital_status

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.agent/skills		.agent/skills
.cargo		.cargo
.github		.github
.idea		.idea
docs		docs
engine		engine
meridian-mcp		meridian-mcp
oaxaca_blinder		oaxaca_blinder
verification		verification
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Cargo.toml		Cargo.toml
README.md		README.md
cliff.toml		cliff.toml
dummy_dfl.csv		dummy_dfl.csv
probit_bench.rs		probit_bench.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Oaxaca-Blinder Decomposition in Rust

Python Bindings

Installation

Usage

Basic OLS Decomposition

Python Example

Example: RIF Decomposition

CLI Example

Rust Example

Python Example

CLI Example

📚 Theory & Methodology

Rust Example

CLI Example

License

About

Uh oh!

Releases 1

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Oaxaca-Blinder Decomposition in Rust

Python Bindings

Installation

Usage

Basic OLS Decomposition

Python Example

Example: RIF Decomposition

CLI Example

Rust Example

Python Example

CLI Example

📚 Theory & Methodology

Rust Example

CLI Example

License

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors

Uh oh!

Languages