Conversation
jaclyn-taroni
left a comment
There was a problem hiding this comment.
Thanks for doing this! We should talk more about this at our meeting before you make any changes (if you make changes).
You've introduced additional variability by not using the same training/testing split. As in, it's harder to tell if there's a difference in performance because you're using targeted gene sets or because you happen to be testing on different datasets (probably not an issue for RNA-seq in practice and may not be an issue for arrays either). You could use the same splits and not delta Kappa as a measure – you could use a Wilcoxon test for paired samples instead.
I also think that having different sample sizes for the full/targeted vs. random gene sets is a bit of an issue for display – a boxplot obscures that information, and (this is a personal problem) I'm reading too much into the differences in statistics.
I agree that we're dealing with two questions here:
- Given a targeted gene set of cancer-related genes, how does model performance compare between targeted gene set models and full gene set models?
- Does the identity of the genes in the targeted gene set matter? In other words, can a random gene set of the same size perform just as well as the targeted gene set?
I think the first question is best answered through a paired test. I think the second question is best addressed via permutation testing, as I say in one of my comments. I'm finding one series of displays for both questions confusing.
Again, let's chat before you change anything!
predict/predict_targeted_gene_set.R
Outdated
There was a problem hiding this comment.
I don't think you need this?
| targeted_list <- purrr::map2(targeted_kTSP_RF_weighted_models_list, | ||
| targeted_kTSP_RF_lasso_unweighted_models_list, | ||
| c) |> | ||
| purrr::map(\(x) x[!duplicated(names(x))]) # remove duplicate list items |
There was a problem hiding this comment.
What duplicated list items would we expect?
There was a problem hiding this comment.
Each list contains some metadata that is the same, so this removes those redundant items.
predict/predict_targeted_gene_set.R
Outdated
There was a problem hiding this comment.
What number of pathways is it?
There was a problem hiding this comment.
I moved this to the analysis notebook and it gets printed out.
| dplyr::filter(p.adj < 0.05) |> | ||
| dplyr::arrange(p.adj) |> | ||
| knitr::kable() | ||
|
|
There was a problem hiding this comment.
So I take it there's no difference in LASSO performance between full and targeted gene sets?
There was a problem hiding this comment.
Let's add more text about these results 😸
| dplyr::arrange(p.adj) |> | ||
| knitr::kable() | ||
|
|
||
| ``` |
There was a problem hiding this comment.
I'm just going to say that my intuition when I look at this is to think "well, the effect sizes are probably very different" by looking at the statistic column, but I don't think I can tell that because of the difference in sample size. It might be nice to calculate an effect size and display it?
There was a problem hiding this comment.
This is now part of the analysis notebook!
predict/predict_targeted_gene_set.R
Outdated
There was a problem hiding this comment.
I know this would probably take too long, but I think the more interesting framing of this question is, "Does the targeted gene set yield a better classifier than random gene sets of the same size?" In my opinion, that is best answered by generating 1000+ random gene sets and calculating a p-value.
There was a problem hiding this comment.
This got implemented using tryCatch() to allow for inexplicable random gene set model failures.
|
This is ready for the second round review! Key changes:
Questions for reviewer:
Thanks! |
jaclyn-taroni
left a comment
There was a problem hiding this comment.
Thank you, @envest. This is a great improvement on the previous analysis! I would like the notebook to include a little more interpretation/explanation throughout, particularly around the calculation of the permuted p-values, before I approve. But it's close!
predict/predict_targeted_gene_set.R
Outdated
There was a problem hiding this comment.
| # set subgroups analyzed in this notebook (canonical MB subgroups) | |
| # set subgroups (canonical MB subgroups) |
| initial_seed = seed, | ||
| n_repeats = n_repeats, | ||
| n_cores = n_cores, | ||
| ktsp_featureNo = 1000, |
There was a problem hiding this comment.
A lot of these values are repeated in 4(?) places. It might be helpful to gather them all at the top of the script.
There was a problem hiding this comment.
Yea that's a lot of repetition... the values are either set at the top of the script or are the default in run_many_models(). The main differences between the four calls is the input genes set via genex_df = and what models / weighting are being used.
I haven't made any changes yet -- do you think it's better to just delete the values that are kept as default, or explicitly state them once at the top?
There was a problem hiding this comment.
We probably need to cover medulloPackage as well, which I expect might have the same issue (i.e., not enough relevant genes captured by the assay)?
There was a problem hiding this comment.
Oh, I'm noticing you have one sentence on that. Let's update the header and add a bit more detail.
There was a problem hiding this comment.
Partially addressed this -- need to dig a bit more on the why
There was a problem hiding this comment.
I was not familiar with dplyr::percent_rank(). Reading the documentation, it appears that this will give us what we're looking for, but I don't love how non-explicit this is. There should at least be a comment explaining why this calculation is appropriate.
Do we ever get permuted p=0? Doesn't look like it from the HTML!
There was a problem hiding this comment.
Added a comment for explanation... also possible to write it as
dplyr::mutate(perm_pvalue = dplyr::percent_rank(dplyr::desc(value)),
but I don't think that code is any more interpretable than 1 - percent_rank().
There was a problem hiding this comment.
I think this might be leftover from earlier. (My enthusiasm for this visualization has not increased!)
| dplyr::filter(p.adj < 0.05) |> | ||
| dplyr::arrange(p.adj) |> | ||
| knitr::kable() | ||
|
|
There was a problem hiding this comment.
Let's add more text about these results 😸
There was a problem hiding this comment.
I might arrange this by model_type and platform so it's easy to ascertain whether it's all subgroups or not. I might also expect more text with interpretation after this chunk.
There was a problem hiding this comment.
Interpretation added. Let's discuss more about table sorting synchronously 👍
There was a problem hiding this comment.
Should we add paired points with geom_point() and geom_line(aes(group = {whatever gives us the pairs}), color = "#000000")?
There was a problem hiding this comment.
Paired points and lines added!
|
|
||
| ``` | ||
|
|
||
| # Interpretation |
There was a problem hiding this comment.
I think this is a great addition! Since it doesn't fully capture the results (e.g., LASSO models do not perform differently when using a targeted gene set), I still would lean towards adding more interpretation throughout.
| ggplot2::theme_bw() + | ||
| ggplot2::theme(legend.position = "bottom", legend.direction = "horizontal") | ||
|
|
||
| targeted_perm_test_plot |
There was a problem hiding this comment.
I like this visualization!
…r for deeper explanation
…e, add interpration
|
More interpretation, better figures, and an explanation about medulloPackage not working... thank you for your months of discussion and reviews (and patience!) |
|
I'm going to merge this so the image gets rebuilt! |
Closes #45
Here we explore if using a targeted gene set with clinical applicability performs as well as the universal gene set, and how a targeted gene set performs next to a random gene set of the same size. We can do this experiment using kTSP, RF, and lasso models. MM2S and medulloPackage did not work in this context.
The main points of this PR:
predict/predict_targeted_gene_set.Ranalysis_notebooks/targeted_gene_set.Rmdprocessed_data/NS_IO_360_v1.0_Genes.tsvrenv.lockwithrstatixand friendsNotes:
A previous version of this used the same training/testing study selection seed across the different gene sets. This allowed for pairwise comparison between gene sets that used the same studies and I calculated "delta kappa" values to show the difference in performance. However, I found that measure to be confusing, and also that using the same seed was an unnecessary constraint especially after the introduction of medulloPackage forced certain limitations on study selection. So... I just used different starting seeds to split training/testing, and now compare Kappa values directly instead of calculating a pairwise delta kappa. I think this also allows for more intuitive statistical comparison between the gene sets.
Questions for reviewer:
renv::snapshot(type = "all")to create a newrenv.lockfile since some of the existing packages inrenv.lockwere not explicitly referenced in our codebase. I wonder if that's advised, or if using the "implicit" default is better.06_targeted_gene_set.shfor both model prediction and the analysis notebook?Thank you!