Skip to content

ENH: Add plot_overlap_common_support() to DoubleMLIRM#389

Open
akihiroshimoda wants to merge 2 commits intoDoubleML:mainfrom
akihiroshimoda:feature/add-overlap-diagnostic-plot
Open

ENH: Add plot_overlap_common_support() to DoubleMLIRM#389
akihiroshimoda wants to merge 2 commits intoDoubleML:mainfrom
akihiroshimoda:feature/add-overlap-diagnostic-plot

Conversation

@akihiroshimoda
Copy link
Copy Markdown

Description

Adds a new plot_overlap_common_support() method to DoubleMLIRM that visualizes the distribution of estimated propensity scores $\hat{m}_0(X) = \hat{E}[D|X]$ split by treatment and control groups.

This is a diagnostic tool for assessing the positivity (overlap) assumption, which is critical for the validity of IPW-based estimators. When propensity scores cluster near 0 or 1, the inverse probability weights become extreme, leading to inflated variance of the treatment effect estimate.

Changes made:
Interactive plotly visualization with KDE curves for treated and control groups (consistent with the existing sensitivity_plot()
API).
Positivity danger zones: shaded regions and dashed threshold lines near 0 and 1.
Built-in diagnostics annotation: displays the percentage of observations in violation zones.
Automatic UserWarning when >5% of observations have propensity scores outside the safe range, with actionable guidance.
Added comprehensive unit tests in test_irm_overlap_plot.py.

Reference to Issues or PRs

None

Comments

Here is an example of the generated plot:
overlap_plot_screenshot

PR Checklist

Please fill out this PR checklist (see our contributing guidelines for details).

  • The title of the pull request summarizes the changes made.
  • The PR contains a detailed description of all changes and additions.
  • References to related issues or PRs are added.
  • The code passes all (unit) tests.
  • Enhancements or new feature are equipped with unit tests.
  • The changes adhere to the PEP8 standards.

@SvenKlaassen
Copy link
Copy Markdown
Member

Thank you very much for this.
I have some general comments:

  • generally i think histograms are more robust visualization than densities.
  • Since propensity scores are common in a lot of other models, i think a simple utils function which handles propensity_score and treatment as input would be more generally helpful.
  • I would also add some type of calibration plot (if have added a suggestion below), this can also help to evaluate the ps fit.

Details on Proposed changes

  • Replace the IRM-specific public method with a generic public plotting entry point in doubleml.utils.
  • Add a new public module doubleml/utils/plots.py.
  • Export the public plotting functions from doubleml/utils/__init__.py.
  • Prefer array-based plotting functions with signature based on (handling only the single treatment design):
    • propensity_score
    • treatment
    • optional plotting arguments such as bins and density
  • Use histogram-based diagnostics instead of KDE:
    • more robust on bounded support [0, 1]
    • easier to interpret when scores are clipped
  • Move tests mainly to doubleml/utils/tests:
    • input validation
    • bin handling
    • boundary values at 0 and 1
    • empty-bin behavior
    • return type and basic plot structure

Suggested public API

  • doubleml.utils.plot_propensity_score_calibration

calibration plot sketch

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns


def plot_propensity_score_calibration(
    propensity_score,
    treatment,
    bins=10,
    density=False,
    palette="colorblind",
):
    """
    Plot propensity score distributions and binned calibration curves.

    Parameters
    ----------
    propensity_score : array-like
        Predicted propensity scores of shape (n_samples,).
    treatment : array-like
        Binary treatment indicator of shape (n_samples,).
    bins : int or array-like
        Number of bins or explicit bin edges.
    density : bool
        If True, histogram heights are normalized.
    palette : str or sequence
        Seaborn palette name or explicit colors.

    Returns
    -------
    fig, axes
        Matplotlib figure and 2x2 axes array.
    """
    ps = np.asarray(propensity_score, dtype=float).reshape(-1)
    tr = np.asarray(treatment).reshape(-1)

    if ps.shape != tr.shape:
        raise ValueError("propensity_score and treatment must have the same shape.")
    if ps.ndim != 1:
        raise ValueError("propensity_score and treatment must be one-dimensional.")
    if not np.isin(tr, [0, 1]).all():
        raise ValueError("treatment must be binary with values 0 and 1.")
    if np.any((ps < 0) | (ps > 1)):
        raise ValueError("propensity_score must lie in [0, 1].")

    tr = tr.astype(int)

    if isinstance(bins, int):
        if bins < 2:
            raise ValueError("bins must be at least 2.")
        bins = np.linspace(0.0, 1.0, bins + 1)
    else:
        bins = np.asarray(bins, dtype=float)
        if bins.ndim != 1 or len(bins) < 2:
            raise ValueError("bins must contain at least two edges.")
        if np.any(np.diff(bins) <= 0):
            raise ValueError("bins must be strictly increasing.")

    x_min, x_max = float(bins[0]), float(bins[-1])
    centers = 0.5 * (bins[:-1] + bins[1:])
    widths = np.diff(bins)

    treated_frac = []
    control_frac = []

    for i in range(len(bins) - 1):
        if i < len(bins) - 2:
            mask = (ps >= bins[i]) & (ps < bins[i + 1])
        else:
            mask = (ps >= bins[i]) & (ps <= bins[i + 1])

        if np.sum(mask) == 0:
            treated_frac.append(np.nan)
            control_frac.append(np.nan)
        else:
            p_treated = np.mean(tr[mask] == 1)
            treated_frac.append(p_treated)
            control_frac.append(1.0 - p_treated)

    colors = sns.color_palette(palette, n_colors=2)
    fig, axes = plt.subplots(2, 2, figsize=(12, 10), gridspec_kw={"height_ratios": [2, 1]})

    sns.histplot(
        ps[tr == 1],
        bins=bins,
        stat="density" if density else "count",
        kde=False,
        color=colors[0],
        ax=axes[0, 0],
        label="Treated",
    )
    axes[0, 0].set_title("Treated: Propensity Score Distribution")
    axes[0, 0].set_xlim(x_min, x_max)
    axes[0, 0].set_ylabel("Density" if density else "Count")
    axes[0, 0].legend()

    sns.histplot(
        ps[tr == 0],
        bins=bins,
        stat="density" if density else "count",
        kde=False,
        color=colors[1],
        ax=axes[0, 1],
        label="Control",
    )
    axes[0, 1].set_title("Control: Propensity Score Distribution")
    axes[0, 1].set_xlim(x_min, x_max)
    axes[0, 1].set_ylabel("Density" if density else "Count")
    axes[0, 1].legend()

    axes[1, 0].bar(centers, treated_frac, width=widths, color=colors[0], alpha=0.7)
    axes[1, 0].plot([x_min, x_max], [x_min, x_max], "k--", label="Ideal calibration")
    axes[1, 0].set_title("Treated: Calibration")
    axes[1, 0].set_xlabel("Predicted propensity score")
    axes[1, 0].set_ylabel("Observed treatment fraction")
    axes[1, 0].set_xlim(x_min, x_max)
    axes[1, 0].set_ylim(0, 1)
    axes[1, 0].legend()

    axes[1, 1].bar(centers, control_frac, width=widths, color=colors[1], alpha=0.7)
    axes[1, 1].plot([x_min, x_max], [1 - x_min, 1 - x_max], "k--", label="Ideal calibration")
    axes[1, 1].set_title("Control: Calibration")
    axes[1, 1].set_xlabel("Predicted propensity score")
    axes[1, 1].set_ylabel("Observed control fraction")
    axes[1, 1].set_xlim(x_min, x_max)
    axes[1, 1].set_ylim(0, 1)
    axes[1, 1].legend()

    fig.suptitle("Propensity Score Calibration")
    plt.tight_layout()
    return fig, axes

- Replace KDE overlap plot with histogram-based calibration plot
- Generic array-based API: propensity_score, treatment, bins, density, palette
- 2x2 matplotlib figure: histograms + binned calibration curves
- Move tests to doubleml/utils/tests/
- Address review feedback from PR DoubleML#389
@akihiroshimoda
Copy link
Copy Markdown
Author

Thank you for the detailed feedback, @SvenKlaassen! I've addressed all your suggestions in the latest commit:

1. Histograms instead of KDE densities

  • Replaced the plotly KDE-based visualization with matplotlib/seaborn histograms, which are more robust on the bounded [0, 1] support.

2. Generic utility function

  • Extracted the plotting logic into a standalone public function doubleml.utils.plots.plot_propensity_score_calibration(propensity_score, treatment, bins, density, palette).
  • This accepts raw arrays directly, making it reusable across any model that produces propensity scores.
  • Exported from doubleml.utils.__init__.py.

3. Calibration plot

  • Implemented the 2×2 figure layout you sketched: histograms (top row) + binned calibration curves (bottom row) with ideal calibration reference lines.

4. Tests moved to doubleml/utils/tests/

  • Added comprehensive tests covering: input validation, bin handling, boundary values (0 and 1), empty-bin behavior, return type, and plot structure.
  • Removed the old IRM-specific test file.

The DoubleMLIRM.plot_overlap_common_support() method now delegates to this utility function, keeping the model-level convenience of extracting predictions automatically.
Here is an example of the updated plot:
calibration_plot

All 19 tests pass. Looking forward to your review!

@SvenKlaassen
Copy link
Copy Markdown
Member

Thank you.

Can you first start to fix the minor issues which are identified by codacy?
Most are mainly formatting or unused code elements.
Further, i would suggest then to also correspondingly rename the propensity score method in the irm class to not completely focus on overlap e.g. plot_propensity_score.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants