Skip to content

Envest/45 reduced gene sets#128

Merged
jaclyn-taroni merged 54 commits intomainfrom
envest/45-reduced_gene_sets
Jan 1, 2026
Merged

Envest/45 reduced gene sets#128
jaclyn-taroni merged 54 commits intomainfrom
envest/45-reduced_gene_sets

Conversation

@envest
Copy link
Collaborator

@envest envest commented Jun 16, 2025

Closes #45

Here we explore if using a targeted gene set with clinical applicability performs as well as the universal gene set, and how a targeted gene set performs next to a random gene set of the same size. We can do this experiment using kTSP, RF, and lasso models. MM2S and medulloPackage did not work in this context.

The main points of this PR:

  • Train and test models using targeted and random gene sets in predict/predict_targeted_gene_set.R
  • Compare kappa and accuracy values between the full, random, and targeted gene sets in analysis_notebooks/targeted_gene_set.Rmd
  • Add the targeted gene set list in processed_data/NS_IO_360_v1.0_Genes.tsv
  • Update renv.lock with rstatix and friends

Notes:

A previous version of this used the same training/testing study selection seed across the different gene sets. This allowed for pairwise comparison between gene sets that used the same studies and I calculated "delta kappa" values to show the difference in performance. However, I found that measure to be confusing, and also that using the same seed was an unnecessary constraint especially after the introduction of medulloPackage forced certain limitations on study selection. So... I just used different starting seeds to split training/testing, and now compare Kappa values directly instead of calculating a pairwise delta kappa. I think this also allows for more intuitive statistical comparison between the gene sets.

Questions for reviewer:

  1. I used renv::snapshot(type = "all") to create a new renv.lock file since some of the existing packages in renv.lock were not explicitly referenced in our codebase. I wonder if that's advised, or if using the "implicit" default is better.
  2. How best to run these scripts? Add an 06_targeted_gene_set.sh for both model prediction and the analysis notebook?

Thank you!

@envest envest requested a review from jaclyn-taroni June 16, 2025 19:35
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! We should talk more about this at our meeting before you make any changes (if you make changes).

You've introduced additional variability by not using the same training/testing split. As in, it's harder to tell if there's a difference in performance because you're using targeted gene sets or because you happen to be testing on different datasets (probably not an issue for RNA-seq in practice and may not be an issue for arrays either). You could use the same splits and not delta Kappa as a measure – you could use a Wilcoxon test for paired samples instead.

I also think that having different sample sizes for the full/targeted vs. random gene sets is a bit of an issue for display – a boxplot obscures that information, and (this is a personal problem) I'm reading too much into the differences in statistics.

I agree that we're dealing with two questions here:

  1. Given a targeted gene set of cancer-related genes, how does model performance compare between targeted gene set models and full gene set models?
  2. Does the identity of the genes in the targeted gene set matter? In other words, can a random gene set of the same size perform just as well as the targeted gene set?

I think the first question is best answered through a paired test. I think the second question is best addressed via permutation testing, as I say in one of my comments. I'm finding one series of displays for both questions confusing.

Again, let's chat before you change anything!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this?

targeted_list <- purrr::map2(targeted_kTSP_RF_weighted_models_list,
targeted_kTSP_RF_lasso_unweighted_models_list,
c) |>
purrr::map(\(x) x[!duplicated(names(x))]) # remove duplicate list items
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What duplicated list items would we expect?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each list contains some metadata that is the same, so this removes those redundant items.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What number of pathways is it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to the analysis notebook and it gets printed out.

dplyr::filter(p.adj < 0.05) |>
dplyr::arrange(p.adj) |>
knitr::kable()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I take it there's no difference in LASSO performance between full and targeted gene sets?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add more text about these results 😸

dplyr::arrange(p.adj) |>
knitr::kable()

```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just going to say that my intuition when I look at this is to think "well, the effect sizes are probably very different" by looking at the statistic column, but I don't think I can tell that because of the difference in sample size. It might be nice to calculate an effect size and display it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now part of the analysis notebook!

Comment on lines 122 to 127
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this would probably take too long, but I think the more interesting framing of this question is, "Does the targeted gene set yield a better classifier than random gene sets of the same size?" In my opinion, that is best answered by generating 1000+ random gene sets and calculating a p-value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got implemented using tryCatch() to allow for inexplicable random gene set model failures.

@envest
Copy link
Collaborator Author

envest commented Jul 16, 2025

This is ready for the second round review! Key changes:

  • Use the same training/test split when comparing full gene set vs. targeted gene set
  • Use the same training/test split when comparing random gene set vs. targeted gene set
  • Fully separate comparisons of full gene set vs. targeted gene set and random gene set vs. targeted gene set
  • Use paired Wilcox test and effect size to compare full gene set vs. targeted gene set
  • Use permutation test to compare random gene set vs. targeted gene set
  • Reverted renv.lock and updated it the right way
  • Added 06-targeted_gene_set.sh to run modeling script and analysis notebook

Questions for reviewer:

  • How to restructure/combine statistics and visualizations in analysis notebook
  • rstatix didn't automatically do p-value correction as it has done previously when there were multiple comparisons. Add this back? p-value correction is covered in 🪄 Defense Against the Dark Arts 🪄 at Hogwarts.

Thanks!

@envest envest requested a review from jaclyn-taroni July 16, 2025 19:48
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @envest. This is a great improvement on the previous analysis! I would like the notebook to include a little more interpretation/explanation throughout, particularly around the calculation of the permuted p-values, before I approve. But it's close!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# set subgroups analyzed in this notebook (canonical MB subgroups)
# set subgroups (canonical MB subgroups)

initial_seed = seed,
n_repeats = n_repeats,
n_cores = n_cores,
ktsp_featureNo = 1000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these values are repeated in 4(?) places. It might be helpful to gather them all at the top of the script.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's a lot of repetition... the values are either set at the top of the script or are the default in run_many_models(). The main differences between the four calls is the input genes set via genex_df = and what models / weighting are being used.

I haven't made any changes yet -- do you think it's better to just delete the values that are kept as default, or explicitly state them once at the top?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to cover medulloPackage as well, which I expect might have the same issue (i.e., not enough relevant genes captured by the assay)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm noticing you have one sentence on that. Let's update the header and add a bit more detail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially addressed this -- need to dig a bit more on the why

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not familiar with dplyr::percent_rank(). Reading the documentation, it appears that this will give us what we're looking for, but I don't love how non-explicit this is. There should at least be a comment explaining why this calculation is appropriate.

Do we ever get permuted p=0? Doesn't look like it from the HTML!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment for explanation... also possible to write it as

dplyr::mutate(perm_pvalue = dplyr::percent_rank(dplyr::desc(value)),

but I don't think that code is any more interpretable than 1 - percent_rank().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be leftover from earlier. (My enthusiasm for this visualization has not increased!)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

dplyr::filter(p.adj < 0.05) |>
dplyr::arrange(p.adj) |>
knitr::kable()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add more text about these results 😸

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might arrange this by model_type and platform so it's easy to ascertain whether it's all subgroups or not. I might also expect more text with interpretation after this chunk.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interpretation added. Let's discuss more about table sorting synchronously 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add paired points with geom_point() and geom_line(aes(group = {whatever gives us the pairs}), color = "#000000")?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paired points and lines added!


```

# Interpretation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great addition! Since it doesn't fully capture the results (e.g., LASSO models do not perform differently when using a targeted gene set), I still would lean towards adding more interpretation throughout.

ggplot2::theme_bw() +
ggplot2::theme(legend.position = "bottom", legend.direction = "horizontal")

targeted_perm_test_plot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this visualization!

@envest
Copy link
Collaborator Author

envest commented Dec 24, 2025

More interpretation, better figures, and an explanation about medulloPackage not working... thank you for your months of discussion and reviews (and patience!)

@envest envest requested a review from jaclyn-taroni December 24, 2025 19:04
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@jaclyn-taroni
Copy link
Member

I'm going to merge this so the image gets rebuilt!

@jaclyn-taroni jaclyn-taroni merged commit eae0e40 into main Jan 1, 2026
1 check passed
@envest envest deleted the envest/45-reduced_gene_sets branch January 21, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Targeted gene set models

2 participants

Comments