Skip to content

GH-16583: Access GLM Variance-Covariance Matrix with vcov#16586

Open
manh4wk wants to merge 10 commits intoh2oai:masterfrom
manh4wk:gh-16583_vcov
Open

GH-16583: Access GLM Variance-Covariance Matrix with vcov#16586
manh4wk wants to merge 10 commits intoh2oai:masterfrom
manh4wk:gh-16583_vcov

Conversation

@manh4wk
Copy link

@manh4wk manh4wk commented Mar 6, 2025

Made the variance-covariance matrix for GLMs part of the model_output results so they're accessible by Python and R. The matrix is rearranged in h2o-algos/src/main/java/hex/schemas/GLMModelV3.java so that the Intercept is both the first row and the first column, similar to how it's done for the GLM coefficient results in the same area of the code.

This matrix is now accessible with the glm_model_object.vcov() function in Python and with h2o.vcov(glm_model_object) in R.

This change fixes #16583

@manh4wk
Copy link
Author

manh4wk commented Mar 18, 2025

@tomasfryda Do you know if there's anything else I can do right now to see if this passes all the tests, etc.? I saw you mentioned in another thread the team is pretty busy at the moment.

@tomasfryda
Copy link
Contributor

@manh4wk I don't think so but I'm no longer active in h2o-3 development so it'd be better to ask @valenad1 or @maurever . Personally, I would like to thank you for taking time and contributing to open-source but I have no idea when or if it will get merged.

@manh4wk
Copy link
Author

manh4wk commented Dec 29, 2025

Hi @maurever or @valenad1, Can one of you take a look at this? Having the GLM's variance-covariance matrix available will let us do things like run a Wald test on two different levels of a categorical variable to see if they should be treated as statistically different, or if they should be combined into a single category.

Copy link
Contributor

@tomasfryda tomasfryda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea to expose variance-covariance matrix but that probably depends on @valenad1's decision.
If he agrees, I would suggest fixing R tests and making sure column names and row names are the same - currently column names are always lowercased (IIRC this can be caused by the TwoDimTableV3 so I would consider choosing different data structure (e.g. H2OFrame).), row names aren't.
For example the Intercept vs intercept:
Screenshot 2026-01-02 at 13 41 54

Note that I didn't do complete review, I just looked at the R part of the PR.

manualYear <- mFV@model$coefficients_table$year

# compare values from model and obtained manually
for (ind in c(1:length(manuelPValues)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manuelPValues doesn't seem to be defined anywhere. Also, I would recommend to use seq_along(x) instead of 1:length(x) (when the x is empty, the latter will produce c(1, 0)).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I updated the R test, that one wasn't working. I also switched to using seq_along as you recommended.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think the test would run faster if the data was pulled into a data frame or something and then run, if that's allowed. Right now it goes cell by cell through the matrix, which took a little bit of time when I was running it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that's allowed

It's allowed but I don't think it's necessary.

Comment on lines +64 to +66
doTest("GLM: make sure error is generated when a gbm model calls glm functions", testGBMvcov)
doTest("GLM: make sure error is generated when compute_p_values=FALSE", testGLMvcovcomputePValueFALSE)
doTest("GLM: test variance-covariance values", testGLMPValZValStdError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer something like:

doSuite("GLM: VCOV support", makeSuite(testGBMvcov, testGLMvcovcomputePValueFALSE, testGLMPValZValStdError))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I changed it to doSuite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no test that would test if the implementation is working. It just tests if it throws an error if used when unsupported.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last R test was supposed to, but it looks like I must have pushed the wrong thing. It should now check if the covariance matrix returned by the function matches the one returned by going and grabbing the h2o frame by it's reference name.

I thought a little bit about adding in a test that the actual values themselves were calculated correctly, but since the vcov matrix was already in the code, and the p-values all rely on it already, I assume that's already tested somewhere. I'm just making it accessible. I did do testing to make sure they matched what I got with the statsmodels in python though. It looks like they're calculated based on the Observed Information Matrix instead of the Expected Information Matrix, so it was a little tricky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last R test was supposed to, but it looks like I must have pushed the wrong thing.

No worries!

I thought a little bit about adding in a test that the actual values themselves were calculated correctly, but since the vcov matrix was already in the code, and the p-values all rely on it already, I assume that's already tested somewhere. I'm just making it accessible. I did do testing to make sure they matched what I got with the statsmodels in python though. It looks like they're calculated based on the Observed Information Matrix instead of the Expected Information Matrix, so it was a little tricky.

Thank you for being so vigilant! Unfortunately, the person who implemented this is no longer working in H2O so I can't easily ask (and I'm not familiar with this piece of code) but I hope we have the test somewhere.

@manh4wk
Copy link
Author

manh4wk commented Feb 22, 2026

Sorry it took so long @tomasfryda, I got tied up with work for a bit, and then I had some silly scoping issues it took me too long to figure out.

I made the suggestions you recommended. The matrix is now returned as an H2OFrame, with the first column being the names of the variables. The casing of the variable names now matches between the columns and the rows, too. I think we get more precision behind the scenes with the numbers in the H2OFrame, which I also like.

image

Copy link
Contributor

@tomasfryda tomasfryda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything appears to be working correctly but there are still some minor necessary changes. Thank you!

Comment on lines +301 to +339
double [][] vcov;
vcov = impl.vcov();
double [][] vcov_reordered = new double[vcov.length][vcov.length];

long[] vcov_indices = new long[vcov.length];
for (int i = 0; i < vcov.length; ++i)
vcov_indices[i] = i;

// move intercept from last row and column to first row and column to match reordering done with coefficients
vcov_reordered[0][0] = vcov[vcov.length - 1][vcov.length - 1];
for(int i = 1; i < vcov.length; ++i) {
vcov_reordered[0][i] = vcov[vcov.length-1][i-1];
}
for(int i = 1; i < vcov.length; ++i) {
vcov_reordered[i][0] = vcov[i-1][vcov.length-1];
}
for(int i = 1; i < vcov.length; ++i) {
for(int j = 1; j < vcov.length; ++j) {
vcov_reordered[i][j] = vcov[i - 1][j - 1];
}
}

String [] vcov_colnames = ArrayUtils.append(new String[]{"Names"},Arrays.copyOf(ns,ns.length));
Vec [] vec_arr = new Vec[vcov.length+1]; // one extra vec for column names
Vec.VectorGroup group = new Vec.VectorGroup();
Key<Vec> vec_key = group.addVec();

// load vcov info into vec_arr
vec_arr[0] = Vec.makeVec(vcov_indices, ns, vec_key);
for(int i = 0; i < vcov.length; ++i) {
vec_key = group.addVec();
vec_arr[i+1] = Vec.makeVec(vcov_reordered[i], vec_key);
}

Key<Frame> frameKey = Key.make();
Frame frame = new Frame(frameKey, vcov_colnames, vec_arr);
DKV.put(frameKey, frame);
vcov_table = new KeyV3.FrameKeyV3();
vcov_table.fillFromImpl(frameKey);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you deterministically derive the frameKey from the model_id (e.g. model_id + "vcov_frame"), and enclose the frame creation with a condition like if (DKV.getGet(frameKey) == null) {} so we don't create the frame every time?

Since the frame is in the DKV, this would otherwise lead to memory leak.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using model_id, but I got the error below about being in a static context. I did find that the model id is saved in the _training_metrics though, and can be accessed like this. Do you know if there's a better approach, or is this good enough?
Key<Frame> frameKey = Key.make(impl._training_metrics.model()._key.toString() + "_vcov_frame");

h2o-3\h2o-algos\src\main\java\hex\schemas\GLMModelV3.java:302: error: non-static variable model_id cannot be referenced from a static context
Key frameKey = Key.make(model_id + "_vcov_frame");
^


def vcov(self):
"""
Return an H2OFrame with the variance-covariance matrix for a GLM (requires ``compute_p_values=True``).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Return an H2OFrame with the variance-covariance matrix for a GLM (requires ``compute_p_values=True``).
:returns: Return an H2OFrame with the variance-covariance matrix for a GLM (requires ``compute_p_values=True``).

manualYear <- mFV@model$coefficients_table$year

# compare values from model and obtained manually
for (ind in c(1:length(manuelPValues)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that's allowed

It's allowed but I don't think it's necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last R test was supposed to, but it looks like I must have pushed the wrong thing.

No worries!

I thought a little bit about adding in a test that the actual values themselves were calculated correctly, but since the vcov matrix was already in the code, and the p-values all rely on it already, I assume that's already tested somewhere. I'm just making it accessible. I did do testing to make sure they matched what I got with the statsmodels in python though. It looks like they're calculated based on the Observed Information Matrix instead of the Expected Information Matrix, so it was a little tricky.

Thank you for being so vigilant! Unfortunately, the person who implemented this is no longer working in H2O so I can't easily ask (and I'm not familiar with this piece of code) but I hope we have the test somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose GLM Variance-Covariance Matrix

3 participants