GH-16583: Access GLM Variance-Covariance Matrix with vcov#16586
GH-16583: Access GLM Variance-Covariance Matrix with vcov#16586manh4wk wants to merge 10 commits intoh2oai:masterfrom
vcov#16586Conversation
|
@tomasfryda Do you know if there's anything else I can do right now to see if this passes all the tests, etc.? I saw you mentioned in another thread the team is pretty busy at the moment. |
|
Hi @maurever or @valenad1, Can one of you take a look at this? Having the GLM's variance-covariance matrix available will let us do things like run a Wald test on two different levels of a categorical variable to see if they should be treated as statistically different, or if they should be combined into a single category. |
tomasfryda
left a comment
There was a problem hiding this comment.
I think it's a good idea to expose variance-covariance matrix but that probably depends on @valenad1's decision.
If he agrees, I would suggest fixing R tests and making sure column names and row names are the same - currently column names are always lowercased (IIRC this can be caused by the TwoDimTableV3 so I would consider choosing different data structure (e.g. H2OFrame).), row names aren't.
For example the Intercept vs intercept:

Note that I didn't do complete review, I just looked at the R part of the PR.
| manualYear <- mFV@model$coefficients_table$year | ||
|
|
||
| # compare values from model and obtained manually | ||
| for (ind in c(1:length(manuelPValues))) |
There was a problem hiding this comment.
manuelPValues doesn't seem to be defined anywhere. Also, I would recommend to use seq_along(x) instead of 1:length(x) (when the x is empty, the latter will produce c(1, 0)).
There was a problem hiding this comment.
Thanks, I updated the R test, that one wasn't working. I also switched to using seq_along as you recommended.
There was a problem hiding this comment.
I do think the test would run faster if the data was pulled into a data frame or something and then run, if that's allowed. Right now it goes cell by cell through the matrix, which took a little bit of time when I was running it.
There was a problem hiding this comment.
if that's allowed
It's allowed but I don't think it's necessary.
| doTest("GLM: make sure error is generated when a gbm model calls glm functions", testGBMvcov) | ||
| doTest("GLM: make sure error is generated when compute_p_values=FALSE", testGLMvcovcomputePValueFALSE) | ||
| doTest("GLM: test variance-covariance values", testGLMPValZValStdError) |
There was a problem hiding this comment.
I would prefer something like:
doSuite("GLM: VCOV support", makeSuite(testGBMvcov, testGLMvcovcomputePValueFALSE, testGLMPValZValStdError))There was a problem hiding this comment.
Thanks, I changed it to doSuite.
There was a problem hiding this comment.
There is no test that would test if the implementation is working. It just tests if it throws an error if used when unsupported.
There was a problem hiding this comment.
The last R test was supposed to, but it looks like I must have pushed the wrong thing. It should now check if the covariance matrix returned by the function matches the one returned by going and grabbing the h2o frame by it's reference name.
I thought a little bit about adding in a test that the actual values themselves were calculated correctly, but since the vcov matrix was already in the code, and the p-values all rely on it already, I assume that's already tested somewhere. I'm just making it accessible. I did do testing to make sure they matched what I got with the statsmodels in python though. It looks like they're calculated based on the Observed Information Matrix instead of the Expected Information Matrix, so it was a little tricky.
There was a problem hiding this comment.
The last R test was supposed to, but it looks like I must have pushed the wrong thing.
No worries!
I thought a little bit about adding in a test that the actual values themselves were calculated correctly, but since the vcov matrix was already in the code, and the p-values all rely on it already, I assume that's already tested somewhere. I'm just making it accessible. I did do testing to make sure they matched what I got with the statsmodels in python though. It looks like they're calculated based on the Observed Information Matrix instead of the Expected Information Matrix, so it was a little tricky.
Thank you for being so vigilant! Unfortunately, the person who implemented this is no longer working in H2O so I can't easily ask (and I'm not familiar with this piece of code) but I hope we have the test somewhere.
|
Sorry it took so long @tomasfryda, I got tied up with work for a bit, and then I had some silly scoping issues it took me too long to figure out. I made the suggestions you recommended. The matrix is now returned as an H2OFrame, with the first column being the names of the variables. The casing of the variable names now matches between the columns and the rows, too. I think we get more precision behind the scenes with the numbers in the H2OFrame, which I also like.
|
tomasfryda
left a comment
There was a problem hiding this comment.
Everything appears to be working correctly but there are still some minor necessary changes. Thank you!
| double [][] vcov; | ||
| vcov = impl.vcov(); | ||
| double [][] vcov_reordered = new double[vcov.length][vcov.length]; | ||
|
|
||
| long[] vcov_indices = new long[vcov.length]; | ||
| for (int i = 0; i < vcov.length; ++i) | ||
| vcov_indices[i] = i; | ||
|
|
||
| // move intercept from last row and column to first row and column to match reordering done with coefficients | ||
| vcov_reordered[0][0] = vcov[vcov.length - 1][vcov.length - 1]; | ||
| for(int i = 1; i < vcov.length; ++i) { | ||
| vcov_reordered[0][i] = vcov[vcov.length-1][i-1]; | ||
| } | ||
| for(int i = 1; i < vcov.length; ++i) { | ||
| vcov_reordered[i][0] = vcov[i-1][vcov.length-1]; | ||
| } | ||
| for(int i = 1; i < vcov.length; ++i) { | ||
| for(int j = 1; j < vcov.length; ++j) { | ||
| vcov_reordered[i][j] = vcov[i - 1][j - 1]; | ||
| } | ||
| } | ||
|
|
||
| String [] vcov_colnames = ArrayUtils.append(new String[]{"Names"},Arrays.copyOf(ns,ns.length)); | ||
| Vec [] vec_arr = new Vec[vcov.length+1]; // one extra vec for column names | ||
| Vec.VectorGroup group = new Vec.VectorGroup(); | ||
| Key<Vec> vec_key = group.addVec(); | ||
|
|
||
| // load vcov info into vec_arr | ||
| vec_arr[0] = Vec.makeVec(vcov_indices, ns, vec_key); | ||
| for(int i = 0; i < vcov.length; ++i) { | ||
| vec_key = group.addVec(); | ||
| vec_arr[i+1] = Vec.makeVec(vcov_reordered[i], vec_key); | ||
| } | ||
|
|
||
| Key<Frame> frameKey = Key.make(); | ||
| Frame frame = new Frame(frameKey, vcov_colnames, vec_arr); | ||
| DKV.put(frameKey, frame); | ||
| vcov_table = new KeyV3.FrameKeyV3(); | ||
| vcov_table.fillFromImpl(frameKey); |
There was a problem hiding this comment.
Could you deterministically derive the frameKey from the model_id (e.g. model_id + "vcov_frame"), and enclose the frame creation with a condition like if (DKV.getGet(frameKey) == null) {} so we don't create the frame every time?
Since the frame is in the DKV, this would otherwise lead to memory leak.
There was a problem hiding this comment.
I tried using model_id, but I got the error below about being in a static context. I did find that the model id is saved in the _training_metrics though, and can be accessed like this. Do you know if there's a better approach, or is this good enough?
Key<Frame> frameKey = Key.make(impl._training_metrics.model()._key.toString() + "_vcov_frame");
h2o-3\h2o-algos\src\main\java\hex\schemas\GLMModelV3.java:302: error: non-static variable model_id cannot be referenced from a static context
Key frameKey = Key.make(model_id + "_vcov_frame");
^
|
|
||
| def vcov(self): | ||
| """ | ||
| Return an H2OFrame with the variance-covariance matrix for a GLM (requires ``compute_p_values=True``). |
There was a problem hiding this comment.
| Return an H2OFrame with the variance-covariance matrix for a GLM (requires ``compute_p_values=True``). | |
| :returns: Return an H2OFrame with the variance-covariance matrix for a GLM (requires ``compute_p_values=True``). |
| manualYear <- mFV@model$coefficients_table$year | ||
|
|
||
| # compare values from model and obtained manually | ||
| for (ind in c(1:length(manuelPValues))) |
There was a problem hiding this comment.
if that's allowed
It's allowed but I don't think it's necessary.
There was a problem hiding this comment.
The last R test was supposed to, but it looks like I must have pushed the wrong thing.
No worries!
I thought a little bit about adding in a test that the actual values themselves were calculated correctly, but since the vcov matrix was already in the code, and the p-values all rely on it already, I assume that's already tested somewhere. I'm just making it accessible. I did do testing to make sure they matched what I got with the statsmodels in python though. It looks like they're calculated based on the Observed Information Matrix instead of the Expected Information Matrix, so it was a little tricky.
Thank you for being so vigilant! Unfortunately, the person who implemented this is no longer working in H2O so I can't easily ask (and I'm not familiar with this piece of code) but I hope we have the test somewhere.

Made the variance-covariance matrix for GLMs part of the model_output results so they're accessible by Python and R. The matrix is rearranged in
h2o-algos/src/main/java/hex/schemas/GLMModelV3.javaso that the Intercept is both the first row and the first column, similar to how it's done for the GLM coefficient results in the same area of the code.This matrix is now accessible with the
glm_model_object.vcov()function in Python and withh2o.vcov(glm_model_object)in R.This change fixes #16583