Envest/45 reduced gene sets by envest · Pull Request #128 · AlexsLemonade/medulloblastoma-classifier

envest · 2025-06-16T19:35:02Z

Closes #45

Here we explore if using a targeted gene set with clinical applicability performs as well as the universal gene set, and how a targeted gene set performs next to a random gene set of the same size. We can do this experiment using kTSP, RF, and lasso models. MM2S and medulloPackage did not work in this context.

The main points of this PR:

Train and test models using targeted and random gene sets in predict/predict_targeted_gene_set.R
Compare kappa and accuracy values between the full, random, and targeted gene sets in analysis_notebooks/targeted_gene_set.Rmd
Add the targeted gene set list in processed_data/NS_IO_360_v1.0_Genes.tsv
Update renv.lock with rstatix and friends

Notes:

A previous version of this used the same training/testing study selection seed across the different gene sets. This allowed for pairwise comparison between gene sets that used the same studies and I calculated "delta kappa" values to show the difference in performance. However, I found that measure to be confusing, and also that using the same seed was an unnecessary constraint especially after the introduction of medulloPackage forced certain limitations on study selection. So... I just used different starting seeds to split training/testing, and now compare Kappa values directly instead of calculating a pairwise delta kappa. I think this also allows for more intuitive statistical comparison between the gene sets.

Questions for reviewer:

I used renv::snapshot(type = "all") to create a new renv.lock file since some of the existing packages in renv.lock were not explicitly referenced in our codebase. I wonder if that's advised, or if using the "implicit" default is better.
How best to run these scripts? Add an 06_targeted_gene_set.sh for both model prediction and the analysis notebook?

Thank you!

…e_sets

jaclyn-taroni

Thanks for doing this! We should talk more about this at our meeting before you make any changes (if you make changes).

You've introduced additional variability by not using the same training/testing split. As in, it's harder to tell if there's a difference in performance because you're using targeted gene sets or because you happen to be testing on different datasets (probably not an issue for RNA-seq in practice and may not be an issue for arrays either). You could use the same splits and not delta Kappa as a measure – you could use a Wilcoxon test for paired samples instead.

I also think that having different sample sizes for the full/targeted vs. random gene sets is a bit of an issue for display – a boxplot obscures that information, and (this is a personal problem) I'm reading too much into the differences in statistics.

I agree that we're dealing with two questions here:

Given a targeted gene set of cancer-related genes, how does model performance compare between targeted gene set models and full gene set models?

Does the identity of the genes in the targeted gene set matter? In other words, can a random gene set of the same size perform just as well as the targeted gene set?

I think the first question is best answered through a paired test. I think the second question is best addressed via permutation testing, as I say in one of my comments. I'm finding one series of displays for both questions confusing.

Again, let's chat before you change anything!

jaclyn-taroni · 2025-06-19T18:14:54Z

predict/predict_targeted_gene_set.R

I don't think you need this?

jaclyn-taroni · 2025-06-19T18:18:55Z

predict/predict_targeted_gene_set.R

+  targeted_list <- purrr::map2(targeted_kTSP_RF_weighted_models_list,
+                               targeted_kTSP_RF_lasso_unweighted_models_list,
+                               c) |>
+    purrr::map(\(x) x[!duplicated(names(x))]) # remove duplicate list items


What duplicated list items would we expect?

Each list contains some metadata that is the same, so this removes those redundant items.

jaclyn-taroni · 2025-06-19T18:23:30Z

predict/predict_targeted_gene_set.R

What number of pathways is it?

I moved this to the analysis notebook and it gets printed out.

jaclyn-taroni · 2025-06-19T18:32:09Z

analysis_notebooks/targeted_gene_set.Rmd

+  dplyr::filter(p.adj < 0.05) |>
+  dplyr::arrange(p.adj) |>
+  knitr::kable()
+


So I take it there's no difference in LASSO performance between full and targeted gene sets?

Let's add more text about these results 😸

jaclyn-taroni · 2025-06-19T18:44:36Z

analysis_notebooks/targeted_gene_set.Rmd

+  dplyr::arrange(p.adj) |>
+  knitr::kable()
+
+```


I'm just going to say that my intuition when I look at this is to think "well, the effect sizes are probably very different" by looking at the statistic column, but I don't think I can tell that because of the difference in sample size. It might be nice to calculate an effect size and display it?

This is now part of the analysis notebook!

jaclyn-taroni · 2025-06-19T18:49:06Z

predict/predict_targeted_gene_set.R

I know this would probably take too long, but I think the more interesting framing of this question is, "Does the targeted gene set yield a better classifier than random gene sets of the same size?" In my opinion, that is best answered by generating 1000+ random gene sets and calculating a p-value.

This got implemented using tryCatch() to allow for inexplicable random gene set model failures.

…ilure

envest · 2025-07-16T19:48:39Z

This is ready for the second round review! Key changes:

Use the same training/test split when comparing full gene set vs. targeted gene set
Use the same training/test split when comparing random gene set vs. targeted gene set
Fully separate comparisons of full gene set vs. targeted gene set and random gene set vs. targeted gene set
Use paired Wilcox test and effect size to compare full gene set vs. targeted gene set
Use permutation test to compare random gene set vs. targeted gene set
Reverted renv.lock and updated it the right way
Added 06-targeted_gene_set.sh to run modeling script and analysis notebook

Questions for reviewer:

How to restructure/combine statistics and visualizations in analysis notebook
rstatix didn't automatically do p-value correction as it has done previously when there were multiple comparisons. Add this back? p-value correction is covered in 🪄 Defense Against the Dark Arts 🪄 at Hogwarts.

Thanks!

jaclyn-taroni

Thank you, @envest. This is a great improvement on the previous analysis! I would like the notebook to include a little more interpretation/explanation throughout, particularly around the calculation of the permuted p-values, before I approve. But it's close!

jaclyn-taroni · 2025-07-21T12:35:17Z

predict/predict_targeted_gene_set.R

Suggested change

# set subgroups analyzed in this notebook (canonical MB subgroups)

# set subgroups (canonical MB subgroups)

jaclyn-taroni · 2025-07-21T12:38:02Z

predict/predict_targeted_gene_set.R

+    initial_seed = seed,
+    n_repeats = n_repeats,
+    n_cores = n_cores,
+    ktsp_featureNo = 1000,


A lot of these values are repeated in 4(?) places. It might be helpful to gather them all at the top of the script.

Yea that's a lot of repetition... the values are either set at the top of the script or are the default in run_many_models(). The main differences between the four calls is the input genes set via genex_df = and what models / weighting are being used.

I haven't made any changes yet -- do you think it's better to just delete the values that are kept as default, or explicitly state them once at the top?

jaclyn-taroni · 2025-07-21T12:45:30Z

analysis_notebooks/targeted_gene_set.Rmd

We probably need to cover medulloPackage as well, which I expect might have the same issue (i.e., not enough relevant genes captured by the assay)?

Oh, I'm noticing you have one sentence on that. Let's update the header and add a bit more detail.

Partially addressed this -- need to dig a bit more on the why

jaclyn-taroni · 2025-07-21T12:54:32Z

analysis_notebooks/targeted_gene_set.Rmd

I was not familiar with dplyr::percent_rank(). Reading the documentation, it appears that this will give us what we're looking for, but I don't love how non-explicit this is. There should at least be a comment explaining why this calculation is appropriate.

Do we ever get permuted p=0? Doesn't look like it from the HTML!

Added a comment for explanation... also possible to write it as

dplyr::mutate(perm_pvalue = dplyr::percent_rank(dplyr::desc(value)),

but I don't think that code is any more interpretable than 1 - percent_rank().

jaclyn-taroni · 2025-07-21T13:10:09Z

plots/targeted_random_full_kappa_accuracy.pdf

I think this might be leftover from earlier. (My enthusiasm for this visualization has not increased!)

jaclyn-taroni · 2025-07-21T13:18:29Z

analysis_notebooks/targeted_gene_set.Rmd

+  dplyr::filter(p.adj < 0.05) |>
+  dplyr::arrange(p.adj) |>
+  knitr::kable()
+


Let's add more text about these results 😸

jaclyn-taroni · 2025-07-21T13:20:29Z

analysis_notebooks/targeted_gene_set.Rmd

I might arrange this by model_type and platform so it's easy to ascertain whether it's all subgroups or not. I might also expect more text with interpretation after this chunk.

Interpretation added. Let's discuss more about table sorting synchronously 👍

jaclyn-taroni · 2025-07-21T13:25:23Z

analysis_notebooks/targeted_gene_set.Rmd

Should we add paired points with geom_point() and geom_line(aes(group = {whatever gives us the pairs}), color = "#000000")?

Paired points and lines added!

jaclyn-taroni · 2025-07-21T13:27:49Z

analysis_notebooks/targeted_gene_set.Rmd

+
+```
+
+# Interpretation


I think this is a great addition! Since it doesn't fully capture the results (e.g., LASSO models do not perform differently when using a targeted gene set), I still would lean towards adding more interpretation throughout.

jaclyn-taroni · 2025-07-21T13:28:24Z

analysis_notebooks/targeted_gene_set.Rmd

+  ggplot2::theme_bw() +
+  ggplot2::theme(legend.position = "bottom", legend.direction = "horizontal")
+
+targeted_perm_test_plot


I like this visualization!

…r for deeper explanation

…e, add interpration

…ne set

envest · 2025-12-24T19:04:19Z

More interpretation, better figures, and an explanation about medulloPackage not working... thank you for your months of discussion and reviews (and patience!)

jaclyn-taroni

LGTM 🚀

jaclyn-taroni · 2026-01-01T13:52:14Z

I'm going to merge this so the image gets rebuilt!

envest added 28 commits May 10, 2023 11:27

Nanostring IO 360 gene list

efedbb6

Initial code for reduced gene sets notebook

1952a18

Change analysis name from reduced gene sets to targeted gene set

0f9a8c3

Convert Nanostring excel file to tsv

73d99db

Selected targeted and random genes for modeling, then model

4c97f46

Save random output to random filepath

c8949f1

Why no MM2S

410e474

Plot delta kappa

345ed16

Merge branch 'envest/return_model_metrics' into envest/45-reduced_gen…

fa2fbf2

…e_sets

Merge branch 'main' into envest/45-reduced_gene_sets

ef92d39

More rationale and explanation throughout notebook

49f9f77

More rationale and explanation throughout notebook

5971c2b

Merge branch 'main' into envest/45-reduced_gene_sets

a27f88e

Merge branch 'main' into envest/45-reduced_gene_sets

efc5678

Merge branch 'main' into envest/45-reduced_gene_sets

6bd1e01

Merge branch 'main' into envest/45-reduced_gene_sets

aad33df

Train/test targeted and random gene set models

7add33b

Compare kappa values across gene sets

18169cb

Plot target, random, and full kappa and accuracy

1addf30

Remove old file version name

6ba9d30

No longer need delta kappa

7198581

Add rstatix

73cc7e1

Wilcoxon test comparison kappa and accuracy

5f521f2

Run model on kTSP (w/unw), RF (w/unw), lasso (unw)

b5d7d16

Add all model types in notebook

147ddc9

Fix renaming random list

756476f

Fix combining random models

560fa50

Add interpretation of results

7f277c8

envest requested a review from jaclyn-taroni June 16, 2025 19:35

jaclyn-taroni reviewed Jun 19, 2025

View reviewed changes

envest added 6 commits July 8, 2025 20:18

Add note about perm_repeat 38

37b6cd5

Restructure random gene set code to use tryCatch allowing for soft fa…

9cf496c

…ilure

Final wording updat to tryCatch, move MM2S test over to notebook

1176fb9

Read in random model metrics and do pairwise wilcox and effect size

b2c60d9

Focus comparisons: targeted v full, targeted v random

e8e236d

Add number of genes, MM2S, and more interpretation

a582d58

envest requested a review from jaclyn-taroni July 16, 2025 19:48

And the html file

5fbdf21

jaclyn-taroni reviewed Jul 21, 2025

View reviewed changes

envest added 14 commits September 3, 2025 09:34

Merge branch 'main' into envest/45-reduced_gene_sets

41570ac

Reduce wordiness

62cdf8e

Partially addressed need for medulloPackage explanation -- placeholde…

5c0ae87

…r for deeper explanation

Add explanation about permutation test p-value

e124c5a

Remove plots/targeted_random_full_kappa_accuracy.pdf

09f3e96

Add interpretation to Kappa results table

3107bad

Add ggridges

7ad4256

Add lines connecting paired points, rearrange presentation of evidenc…

41e7e6b

…e, add interpration

Re-render after updates

eb4c112

Merge from main, accepting all 'their' conflicting changes in renv.lock

baf2c75

renv snapshot

5415945

Add coin to dependencies.R, install, and re-snapshot renv.lock

468ada3

Add seed for position_jitterdodge

7cdac3c

Comment on medulloPackage gene pairs not overlapping with targeted ge…

ac6d012

…ne set

envest requested a review from jaclyn-taroni December 24, 2025 19:04

jaclyn-taroni approved these changes Jan 1, 2026

View reviewed changes

jaclyn-taroni merged commit eae0e40 into main Jan 1, 2026
1 check passed

envest deleted the envest/45-reduced_gene_sets branch January 21, 2026 14:58

	# set subgroups analyzed in this notebook (canonical MB subgroups)
	# set subgroups (canonical MB subgroups)


		```

		# Interpretation

Conversation

envest commented Jun 16, 2025

Uh oh!

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

envest commented Jul 16, 2025

Uh oh!

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

envest commented Dec 24, 2025

Uh oh!

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Uh oh!

jaclyn-taroni commented Jan 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone