py-why · kbattocchi · Jul 10, 2025
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -118,7 +118,7 @@ jobs:
         kind: [except-customer-scenarios, customer-scenarios]
         include:
         - kind: "except-customer-scenarios"
-          extras: "[plt,ray]"
+          extras: "[nn,plt,ray]"
           pattern: "(?!CustomerScenarios)"
           install_graphviz: true
           version: '3.12'
@@ -223,16 +223,16 @@ jobs:
             extras: ""
           - kind: other
             opts: '-m "cate_api and not ray" -n auto'
-            extras: "[plt]"
+            extras: "[nn,plt]"
           - kind: dml
             opts: '-m "dml and not ray"'
-            extras: "[plt]"
+            extras: "[nn,plt]"
           - kind: main
             opts: '-m "not (notebook or automl or dml or serial or cate_api or treatment_featurization or ray)" -n 2'
-            extras: "[plt,dowhy]"
+            extras: "[nn,plt,dowhy]"
           - kind: treatment
             opts: '-m "treatment_featurization and not ray" -n auto'
-            extras: "[plt]"
+            extras: "[nn,plt]"
           - kind: ray
             opts: '-m "ray"'
             extras: "[ray]"

diff --git a/README.md b/README.md
@@ -415,6 +415,36 @@ lb, ub = est.effect_interval(X_test, alpha=0.05) # OLS confidence intervals
 ```
 </details>
 
+<details>
+<summary>Deep Instrumental Variables (click to expand)</summary>
+
+```Python
+import keras
+from econml.iv.nnet import DeepIV
+
+treatment_model = keras.Sequential([keras.layers.Dense(128, activation='relu', input_shape=(2,)),
+                                    keras.layers.Dropout(0.17),
+                                    keras.layers.Dense(64, activation='relu'),
+                                    keras.layers.Dropout(0.17),
+                                    keras.layers.Dense(32, activation='relu'),
+                                    keras.layers.Dropout(0.17)])
+response_model = keras.Sequential([keras.layers.Dense(128, activation='relu', input_shape=(2,)),
+                                  keras.layers.Dropout(0.17),
+                                  keras.layers.Dense(64, activation='relu'),
+                                  keras.layers.Dropout(0.17),
+                                  keras.layers.Dense(32, activation='relu'),
+                                  keras.layers.Dropout(0.17),
+                                  keras.layers.Dense(1)])
+est = DeepIV(n_components=10, # Number of gaussians in the mixture density networks)
+             m=lambda z, x: treatment_model(keras.layers.concatenate([z, x])), # Treatment model
+             h=lambda t, x: response_model(keras.layers.concatenate([t, x])), # Response model
+             n_samples=1 # Number of samples used to estimate the response
+             )
+est.fit(Y, T, X=X, Z=Z) # Z -> instrumental variables
+treatment_effects = est.effect(X_test)
+```
+</details>
+
 See the <a href="#references">References</a> section for more details.
 
 ### Interpretability

diff --git a/doc/map.svg b/doc/map.svg
diff --git a/doc/reference.rst b/doc/reference.rst
@@ -86,6 +86,16 @@ Doubly Robust (DR) IV
     econml.iv.dr.IntentToTreatDRIV
     econml.iv.dr.LinearIntentToTreatDRIV
 
+.. _deepiv_api:
+
+DeepIV
+^^^^^^
+
+.. autosummary::
+    :toctree: _autosummary
+
+    econml.iv.nnet.DeepIV
+
 .. _tsls_api:
 
 Sieve Methods

diff --git a/doc/spec/comparison.rst b/doc/spec/comparison.rst
@@ -9,6 +9,8 @@ Detailed estimator comparison
 +=============================================+==============+==============+==================+=============+=================+============+==============+====================+
 | :class:`.SieveTSLS`                         | Any          | Yes          |                  | Yes         | Assumed         | Yes        | Yes          |                    |
 +---------------------------------------------+--------------+--------------+------------------+-------------+-----------------+------------+--------------+--------------------+
+| :class:`.DeepIV`                            | Any          | Yes          |                  |             |                 | Yes        | Yes          |                    |
++---------------------------------------------+--------------+--------------+------------------+-------------+-----------------+------------+--------------+--------------------+
 | :class:`.SparseLinearDML`                   | Any          |              | Yes              | Yes         | Assumed         | Yes        | Yes          | Yes                |
 +---------------------------------------------+--------------+--------------+------------------+-------------+-----------------+------------+--------------+--------------------+
 | :class:`.SparseLinearDRLearner`             | Categorical  |              | Yes              |             | Projected       |            | Yes          | Yes                |

diff --git a/doc/spec/estimation/deepiv.rst b/doc/spec/estimation/deepiv.rst
@@ -0,0 +1,86 @@
+Deep Instrumental Variables
+===========================
+
+Instrumental variables (IV) methods are an approach for estimating causal effects despite the presence of confounding latent variables.  
+The assumptions made are weaker than the unconfoundedness assumption needed in DML.
+The cost is that when unconfoundedness holds, IV estimators will be less efficient than DML estimators.  
+What is required is a vector of instruments :math:`Z`, assumed to casually affect the distribution of the treatment :math:`T`, 
+and to have no direct causal effect on the expected value of the outcome :math:`Y`.  The package offers two IV methods for
+estimating heterogeneous treatment effects: deep instrumental variables [Hartford2017]_
+and the two-stage basis expansion approach of [Newey2003]_.
+
+The setup of the model is as follows:
+
+.. math::
+
+    Y = g(T, X, W) + \epsilon
+
+where :math:`\E[\varepsilon|X,W,Z] = h(X,W)`, so that the expected value of :math:`Y` depends only on :math:`(T,X,W)`. 
+This is known as the *exclusion restriction*.
+We assume that the conditional distribution :math:`F(T|X,W,Z)` varies with :math:`Z`.
+This is known as the *relevance condition*.
+We want to learn the heterogeneous treatment effects: 
+
+.. math::
+
+    \tau(\vec{t}_0, \vec{t}_1, \vec{x}) = \E[g(\vec{t}_1,\vec{x},W) - g(\vec{t}_0,\vec{x},W)] 
+
+where the expectation is taken with respect to the conditional distribution of :math:`W|\vec{x}`.
+If the function :math:`g` is truly non-parametric, then in the special case where :math:`T`, :math:`Z` and :math:`X` are discrete, 
+the probability matrix giving the distribution of :math:`T` for each value of :math:`Z` needs to be invertible pointwise at :math:`\vec{x}` 
+in order for this quantity to be identified for arbitrary :math:`\vec{t}_0` and :math:`\vec{t}_1`.
+In practice though we will place some parametric structure on the function :math:`g` which will make learning easier.
+In deep IV, this takes the form of assuming :math:`g` is a neural net with a given architecture; in the sieve based approaches, 
+this amounts to assuming that :math:`g` is a weighted sum of a fixed set of basis functions. [1]_
+
+As explained in [Hartford2017]_, the Deep IV module learns the heterogenous causal effects by minimizing the "reduced-form" prediction error:
+
+.. math::
+
+    \hat{g}(T,X,W) \equiv \argmin_{g \in \mathcal{G}} \sum_i \left(y_i - \int g(T,x_i,w_i) dF(T|x_i,w_i,z_i)\right)^2 
+
+where the hypothesis class :math:`\mathcal{G}` are neural nets with a given architecture.
+The distribution :math:`F(T|x_i,w_i,z_i)` is unknown and so to make the objective feasible it must be replaced by an estimate 
+:math:`\hat{F}(T|x_i,w_i,z_i)`.
+This estimate is obtained by modeling :math:`F` as a mixture of normal distributions, where the parameters of the mixture model are 
+the output of a "first-stage" neural net whose inputs are :math:`(x_i,w_i,z_i)`.  
+Optimization of the "first-stage" neural net is done by stochastic gradient descent on the (mixture-of-normals) likelihood, 
+while optimization of the "second-stage" model for the treatment effects is done by stochastic gradient descent with 
+three different options for the loss:
+
+    *   Estimating the two integrals that make up the true gradient calculation by independent averages over 
+        mini-batches of data, which are unbiased estimates of the integral.
+    *   Using the modified objective function 
+
+        .. math::
+
+            \sum_i \sum_d \left(y_i - g(t_d,x_i,w_i)\right)^2
+
+        where :math:`t_d \sim \hat{F}(t|x_i,w_i,z_i)` are draws from the estimated first-stage neural net. This modified 
+        objective function is not guaranteed to lead to consistent estimates of :math:`g`, but has the advantage of requiring
+        only a single set of samples from the distribution, and can be interpreted as regularizing the loss with a 
+        variance penalty. [2]_
+    *   Using a single set of samples to compute the gradient of the loss; this will only be an unbiased estimate of the 
+        gradient in the limit as the number of samples goes to infinity.
+
+Training proceeds by splitting the data into a training and test set, and training is stopped when test set performance 
+(on the reduced form prediction error) starts to degrade.  
+
+The output is an estimated function :math:`\hat{g}`.  To obtain an estimate of :math:`\tau`, we difference the estimated 
+function at :math:`\vec{t}_1` and :math:`\vec{t}_0`, replacing the expectation with the empirical average over all
+observations with the specified :math:`\vec{x}`.    
+
+
+.. rubric:: Footnotes
+
+.. [1]
+    Asymptotic arguments about non-parametric consistency require that the neural net architecture (respectively set of basis functions) 
+    are allowed to grow at some rate so that arbitrary functions can be approximated, but this will not be our concern here.
+.. [2]
+    .. math::
+
+        & \int \left(y_i - g(t,x_i,w_i)\right)^2 dt \\
+        =~& y_i - 2 y_i \int g(t,x_i,w_i)\,dt + \int g(t,x_i,w_i)^2\,dt \\
+        =~& y_i - 2 y_i \int g(t,x_i,w_i)\,dt + \left(\int g(t,x_i,w_i)\,dt\right)^2 + \int g(t,x_i,w_i)^2\,dt - \left(\int g(t,x_i,w_i)\,dt\right)^2 \\
+        =~& \left(y_i - \int g(t,x_i,w_i)\,dt\right)^2 + \left(\int g(t,x_i,w_i)^2\,dt - \left(\int g(t,x_i,w_i)\,dt\right)^2\right) \\
+        =~& \left(y_i - \int g(t,x_i,w_i)\,dt\right)^2 + \Var_t g(t,x_i,w_i)
diff --git a/doc/spec/estimation_iv.rst b/doc/spec/estimation_iv.rst
@@ -14,5 +14,6 @@ of [Newey2003]_.
 .. toctree::
     :maxdepth: 2
 
+    estimation/deepiv.rst
     estimation/two_sls.rst
     estimation/orthoiv.rst
diff --git a/econml/iv/__init__.py b/econml/iv/__init__.py
@@ -1,4 +1,4 @@
 # Copyright (c) PyWhy contributors. All rights reserved.
 # Licensed under the MIT License.
 
-__all__ = ["dml", "dr", "sieve"]
+__all__ = ["dml", "dr", "nnet", "sieve"]
diff --git a/econml/iv/nnet/__init__.py b/econml/iv/nnet/__init__.py
@@ -0,0 +1,6 @@
+# Copyright (c) PyWhy contributors. All rights reserved.
+# Licensed under the MIT License.
+
+from ._deepiv import DeepIV, MixtureOfGaussiansModule
+
+__all__ = ["DeepIV, MixtureOfGaussiansModule"]