GH-16583: Access GLM Variance-Covariance Matrix with `vcov` by manh4wk · Pull Request #16586 · h2oai/h2o-3

manh4wk · 2025-03-06T04:38:16Z

Made the variance-covariance matrix for GLMs part of the model_output results so they're accessible by Python and R. The matrix is rearranged in h2o-algos/src/main/java/hex/schemas/GLMModelV3.java so that the Intercept is both the first row and the first column, similar to how it's done for the GLM coefficient results in the same area of the code.

This matrix is now accessible with the glm_model_object.vcov() function in Python and with h2o.vcov(glm_model_object) in R.

This change fixes #16583

manh4wk · 2025-03-18T00:23:14Z

@tomasfryda Do you know if there's anything else I can do right now to see if this passes all the tests, etc.? I saw you mentioned in another thread the team is pretty busy at the moment.

tomasfryda · 2025-03-18T10:03:30Z

@manh4wk I don't think so but I'm no longer active in h2o-3 development so it'd be better to ask @valenad1 or @maurever . Personally, I would like to thank you for taking time and contributing to open-source but I have no idea when or if it will get merged.

manh4wk · 2025-12-29T22:16:17Z

Hi @maurever or @valenad1, Can one of you take a look at this? Having the GLM's variance-covariance matrix available will let us do things like run a Wald test on two different levels of a categorical variable to see if they should be treated as statistically different, or if they should be combined into a single category.

tomasfryda

I think it's a good idea to expose variance-covariance matrix but that probably depends on @valenad1's decision.
If he agrees, I would suggest fixing R tests and making sure column names and row names are the same - currently column names are always lowercased (IIRC this can be caused by the TwoDimTableV3 so I would consider choosing different data structure (e.g. H2OFrame).), row names aren't.
For example the Intercept vs intercept:

Note that I didn't do complete review, I just looked at the R part of the PR.

tomasfryda · 2026-01-02T12:37:17Z

h2o-r/tests/testdir_algos/glm/runit_GH_16583_vcov.R

+	manualYear <- mFV@model$coefficients_table$year
+
+  # compare values from model and obtained manually
+  for (ind in c(1:length(manuelPValues)))


manuelPValues doesn't seem to be defined anywhere. Also, I would recommend to use seq_along(x) instead of 1:length(x) (when the x is empty, the latter will produce c(1, 0)).

Thanks, I updated the R test, that one wasn't working. I also switched to using seq_along as you recommended.

I do think the test would run faster if the data was pulled into a data frame or something and then run, if that's allowed. Right now it goes cell by cell through the matrix, which took a little bit of time when I was running it.

if that's allowed

It's allowed but I don't think it's necessary.

tomasfryda · 2026-01-02T12:38:02Z

h2o-r/tests/testdir_algos/glm/runit_GH_16583_vcov.R

+doTest("GLM: make sure error is generated when a gbm model calls glm functions", testGBMvcov)
+doTest("GLM: make sure error is generated when compute_p_values=FALSE", testGLMvcovcomputePValueFALSE)
+doTest("GLM: test variance-covariance values", testGLMPValZValStdError)


I would prefer something like:

doSuite("GLM: VCOV support", makeSuite(testGBMvcov, testGLMvcovcomputePValueFALSE, testGLMPValZValStdError))

Thanks, I changed it to doSuite.

tomasfryda · 2026-01-02T12:41:23Z

h2o-r/tests/testdir_algos/glm/runit_GH_16583_vcov.R

There is no test that would test if the implementation is working. It just tests if it throws an error if used when unsupported.

The last R test was supposed to, but it looks like I must have pushed the wrong thing. It should now check if the covariance matrix returned by the function matches the one returned by going and grabbing the h2o frame by it's reference name.

I thought a little bit about adding in a test that the actual values themselves were calculated correctly, but since the vcov matrix was already in the code, and the p-values all rely on it already, I assume that's already tested somewhere. I'm just making it accessible. I did do testing to make sure they matched what I got with the statsmodels in python though. It looks like they're calculated based on the Observed Information Matrix instead of the Expected Information Matrix, so it was a little tricky.

The last R test was supposed to, but it looks like I must have pushed the wrong thing.

No worries!

I thought a little bit about adding in a test that the actual values themselves were calculated correctly, but since the vcov matrix was already in the code, and the p-values all rely on it already, I assume that's already tested somewhere. I'm just making it accessible. I did do testing to make sure they matched what I got with the statsmodels in python though. It looks like they're calculated based on the Observed Information Matrix instead of the Expected Information Matrix, so it was a little tricky.

Thank you for being so vigilant! Unfortunately, the person who implemented this is no longer working in H2O so I can't easily ask (and I'm not familiar with this piece of code) but I hope we have the test somewhere.

… gh-16583_vcov

manh4wk · 2026-02-22T19:19:12Z

Sorry it took so long @tomasfryda, I got tied up with work for a bit, and then I had some silly scoping issues it took me too long to figure out.

I made the suggestions you recommended. The matrix is now returned as an H2OFrame, with the first column being the names of the variables. The casing of the variable names now matches between the columns and the rows, too. I think we get more precision behind the scenes with the numbers in the H2OFrame, which I also like.

tomasfryda

Everything appears to be working correctly but there are still some minor necessary changes. Thank you!

tomasfryda · 2026-02-23T12:58:08Z

h2o-algos/src/main/java/hex/schemas/GLMModelV3.java

+        double [][] vcov;
+        vcov = impl.vcov();
+        double [][] vcov_reordered = new double[vcov.length][vcov.length];
+
+        long[] vcov_indices = new long[vcov.length];
+        for (int i = 0; i < vcov.length; ++i)
+          vcov_indices[i] = i;
+
+        // move intercept from last row and column to first row and column to match reordering done with coefficients
+        vcov_reordered[0][0] = vcov[vcov.length - 1][vcov.length - 1];
+        for(int i = 1; i < vcov.length; ++i) {
+          vcov_reordered[0][i] = vcov[vcov.length-1][i-1];
+        }
+        for(int i = 1; i < vcov.length; ++i) {
+          vcov_reordered[i][0] = vcov[i-1][vcov.length-1];
+        }
+        for(int i = 1; i < vcov.length; ++i) {
+          for(int j = 1; j < vcov.length; ++j) {
+            vcov_reordered[i][j] = vcov[i - 1][j - 1];
+          }
+        }
+
+        String [] vcov_colnames = ArrayUtils.append(new String[]{"Names"},Arrays.copyOf(ns,ns.length));
+        Vec [] vec_arr = new Vec[vcov.length+1]; // one extra vec for column names
+        Vec.VectorGroup group = new Vec.VectorGroup();
+        Key<Vec> vec_key = group.addVec(); 
+
+        // load vcov info into vec_arr
+        vec_arr[0] = Vec.makeVec(vcov_indices, ns, vec_key);
+        for(int i = 0; i < vcov.length; ++i) {
+          vec_key = group.addVec();
+          vec_arr[i+1] = Vec.makeVec(vcov_reordered[i], vec_key);
+        }
+
+        Key<Frame> frameKey = Key.make();
+        Frame frame = new Frame(frameKey, vcov_colnames, vec_arr);
+        DKV.put(frameKey, frame);
+        vcov_table = new KeyV3.FrameKeyV3();
+        vcov_table.fillFromImpl(frameKey);


Could you deterministically derive the frameKey from the model_id (e.g. model_id + "vcov_frame"), and enclose the frame creation with a condition like if (DKV.getGet(frameKey) == null) {} so we don't create the frame every time?

Since the frame is in the DKV, this would otherwise lead to memory leak.

I tried using model_id, but I got the error below about being in a static context. I did find that the model id is saved in the _training_metrics though, and can be accessed like this. Do you know if there's a better approach, or is this good enough?
Key<Frame> frameKey = Key.make(impl._training_metrics.model()._key.toString() + "_vcov_frame");

h2o-3\h2o-algos\src\main\java\hex\schemas\GLMModelV3.java:302: error: non-static variable model_id cannot be referenced from a static context
Key frameKey = Key.make(model_id + "_vcov_frame");
^

tomasfryda · 2026-02-23T13:33:48Z

h2o-py/h2o/model/model_base.py

+
+    def vcov(self):
+        """
+        Return an H2OFrame with the variance-covariance matrix for a GLM (requires ``compute_p_values=True``).


Suggested change

Return an H2OFrame with the variance-covariance matrix for a GLM (requires ``compute_p_values=True``).

:returns: Return an H2OFrame with the variance-covariance matrix for a GLM (requires ``compute_p_values=True``).

tomasfryda · 2026-02-23T14:30:44Z

h2o-r/tests/testdir_algos/glm/runit_GH_16583_vcov.R

+	manualYear <- mFV@model$coefficients_table$year
+
+  # compare values from model and obtained manually
+  for (ind in c(1:length(manuelPValues)))


if that's allowed

It's allowed but I don't think it's necessary.

tomasfryda · 2026-02-23T14:39:49Z

h2o-r/tests/testdir_algos/glm/runit_GH_16583_vcov.R

The last R test was supposed to, but it looks like I must have pushed the wrong thing.

No worries!

I thought a little bit about adding in a test that the actual values themselves were calculated correctly, but since the vcov matrix was already in the code, and the p-values all rely on it already, I assume that's already tested somewhere. I'm just making it accessible. I did do testing to make sure they matched what I got with the statsmodels in python though. It looks like they're calculated based on the Observed Information Matrix instead of the Expected Information Matrix, so it was a little tricky.

Thank you for being so vigilant! Unfortunately, the person who implemented this is no longer working in H2O so I can't easily ask (and I'm not familiar with this piece of code) but I hope we have the test somewhere.

manh4wk added 5 commits March 3, 2025 21:05

Add vcov reordering logic

3231144

add vcov function for python glm

467b294

add vcov function for python glm

55b9ccd

Create vcov pytests

72556df

Create vcov R tests

fa1cbc6

tomasfryda requested changes Jan 2, 2026

View reviewed changes

valenad1 added the contributions label Jan 19, 2026

manh4wk and others added 5 commits February 1, 2026 17:22

Merge branch 'h2oai:master' into gh-16583_vcov

33d3aee

Merge branch 'h2oai:master' into gh-16583_vcov

1471a91

move vcov results from TwoDimTable to Frame

20e1ba6

Merge branch 'gh-16583_vcov' of https://github.com/manh4wk/h2o-3 into…

de3e960

… gh-16583_vcov

updated R testing for loop

348de6a

tomasfryda requested changes Feb 23, 2026

View reviewed changes

	Return an H2OFrame with the variance-covariance matrix for a GLM (requires ``compute_p_values=True``).
	:returns: Return an H2OFrame with the variance-covariance matrix for a GLM (requires ``compute_p_values=True``).

Conversation

manh4wk commented Mar 6, 2025

Uh oh!

manh4wk commented Mar 18, 2025

Uh oh!

tomasfryda commented Mar 18, 2025

Uh oh!

manh4wk commented Dec 29, 2025

Uh oh!

tomasfryda left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

manh4wk commented Feb 22, 2026

Uh oh!

tomasfryda left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants