Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
efedbb6
Nanostring IO 360 gene list
envest May 10, 2023
1952a18
Initial code for reduced gene sets notebook
envest May 10, 2023
0f9a8c3
Change analysis name from reduced gene sets to targeted gene set
envest May 10, 2023
73d99db
Convert Nanostring excel file to tsv
envest May 10, 2023
4c97f46
Selected targeted and random genes for modeling, then model
envest May 10, 2023
c8949f1
Save random output to random filepath
envest May 10, 2023
410e474
Why no MM2S
envest May 10, 2023
345ed16
Plot delta kappa
envest May 11, 2023
fa2fbf2
Merge branch 'envest/return_model_metrics' into envest/45-reduced_gen…
envest May 11, 2023
ef92d39
Merge branch 'main' into envest/45-reduced_gene_sets
envest May 11, 2023
49f9f77
More rationale and explanation throughout notebook
envest Sep 6, 2023
5971c2b
More rationale and explanation throughout notebook
envest Sep 6, 2023
a27f88e
Merge branch 'main' into envest/45-reduced_gene_sets
envest Oct 2, 2024
efc5678
Merge branch 'main' into envest/45-reduced_gene_sets
envest Apr 16, 2025
6bd1e01
Merge branch 'main' into envest/45-reduced_gene_sets
envest May 14, 2025
aad33df
Merge branch 'main' into envest/45-reduced_gene_sets
envest Jun 10, 2025
7add33b
Train/test targeted and random gene set models
envest Jun 11, 2025
18169cb
Compare kappa values across gene sets
envest Jun 11, 2025
1addf30
Plot target, random, and full kappa and accuracy
envest Jun 12, 2025
6ba9d30
Remove old file version name
envest Jun 16, 2025
7198581
No longer need delta kappa
envest Jun 16, 2025
73cc7e1
Add rstatix
envest Jun 16, 2025
5f521f2
Wilcoxon test comparison kappa and accuracy
envest Jun 16, 2025
b5d7d16
Run model on kTSP (w/unw), RF (w/unw), lasso (unw)
envest Jun 16, 2025
147ddc9
Add all model types in notebook
envest Jun 16, 2025
756476f
Fix renaming random list
envest Jun 16, 2025
560fa50
Fix combining random models
envest Jun 16, 2025
7f277c8
Add interpretation of results
envest Jun 16, 2025
7fbadfe
Revert "Add rstatix"
envest Jul 7, 2025
8443668
Add rstatix to renv.lock with vanilla renv::snapshot()
envest Jul 7, 2025
3709d56
Script to run targeted gene set modeling and render analysis notebook
envest Jul 7, 2025
2735f1f
Remove unneeded patchwork
envest Jul 7, 2025
9534b96
Save random gene sets separately
envest Jul 8, 2025
37b6cd5
Add note about perm_repeat 38
envest Jul 8, 2025
9cf496c
Restructure random gene set code to use tryCatch allowing for soft fa…
envest Jul 11, 2025
1176fb9
Final wording updat to tryCatch, move MM2S test over to notebook
envest Jul 15, 2025
b2c60d9
Read in random model metrics and do pairwise wilcox and effect size
envest Jul 15, 2025
e8e236d
Focus comparisons: targeted v full, targeted v random
envest Jul 16, 2025
a582d58
Add number of genes, MM2S, and more interpretation
envest Jul 16, 2025
5fbdf21
And the html file
envest Jul 16, 2025
41570ac
Merge branch 'main' into envest/45-reduced_gene_sets
envest Sep 3, 2025
62cdf8e
Reduce wordiness
envest Sep 3, 2025
5c0ae87
Partially addressed need for medulloPackage explanation -- placeholde…
envest Sep 3, 2025
e124c5a
Add explanation about permutation test p-value
envest Sep 3, 2025
09f3e96
Remove plots/targeted_random_full_kappa_accuracy.pdf
envest Sep 3, 2025
3107bad
Add interpretation to Kappa results table
envest Sep 3, 2025
7ad4256
Add ggridges
envest Oct 28, 2025
41e7e6b
Add lines connecting paired points, rearrange presentation of evidenc…
envest Oct 29, 2025
eb4c112
Re-render after updates
envest Oct 29, 2025
baf2c75
Merge from main, accepting all 'their' conflicting changes in renv.lock
envest Dec 23, 2025
5415945
renv snapshot
envest Dec 24, 2025
468ada3
Add coin to dependencies.R, install, and re-snapshot renv.lock
envest Dec 24, 2025
7cdac3c
Add seed for position_jitterdodge
envest Dec 24, 2025
ac6d012
Comment on medulloPackage gene pairs not overlapping with targeted ge…
envest Dec 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions 06-targeted_gene_set.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash
#
# Steven Foltz
# 2025
#
# Run targeted gene set experiments

set -euo pipefail

# Set the working directory to the directory of this file
cd "$(dirname "${BASH_SOURCE[0]}")"

# Set up directories
predict_dir="predict"
analysis_notebooks_dir="analysis_notebooks"

# Run targeted gene set model prediction
Rscript "${predict_dir}/predict_targeted_gene_set.R"

# Render notebook for targeted gene set analysis
Rscript -e "rmarkdown::render('${analysis_notebooks_dir}/targeted_gene_set.Rmd')"
464 changes: 464 additions & 0 deletions analysis_notebooks/targeted_gene_set.Rmd

Large diffs are not rendered by default.

4,185 changes: 4,185 additions & 0 deletions analysis_notebooks/targeted_gene_set.html

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions dependencies.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ library(optparse)
library(medulloPackage)
library(tidyverse)
library(boot)
library(coin)
Binary file modified plots/baseline_accuracy.pdf
Binary file not shown.
Binary file modified plots/baseline_binary_TSP.pdf
Binary file not shown.
Binary file modified plots/baseline_figure.pdf
Binary file not shown.
Binary file modified plots/baseline_kappa.pdf
Binary file not shown.
311 changes: 311 additions & 0 deletions predict/predict_targeted_gene_set.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,311 @@
# Train and test models to predict MB subgroups using a targeted gene set
# Steven Foltz, updated 2025-06

# set some variables that guide what downstream steps are followed
create_models <- TRUE # train new models (if FALSE, reads existing models from file)
overwrite <- TRUE # if create_models is also TRUE, overwrite existing models file
seed <- 44 # set initial seed for run_many_models()
n_repeats <- 10 # set number of times to repeat in run_many_models()
n_cores <- 10 # set number of cores to use in run_many_models()
ah_date <- "2022-10-30"

# Rationale

# Experiments that generate RNA expression data may use a targeted gene panel as a more cost-effective approach to analyze a subset of genes thought to have greater relevance for disease, especially in clinical settings.
# However, when a model is trained using whole transcriptome data, the genes selected for that model will generally show only partial or no overlap with a targeted gene panel.
# Even though kTSP and RF models can tolerate missing genes, prediction performance on new samples will likely suffer as a result of having fewer gene:gene comparisons to score.
# A model trained using a targeted panel of genes from the start could be applied to sample data that includes more genes, however there may be a decrease in model performance due to starting with a smaller feature space.
# This leads to two key questions:

# 1. Given a targeted gene set of cancer-related genes, how does model performance compare between targeted gene set models and full gene set models?
# 2. Does the identity of the genes in the targeted gene set matter? In other words, can a random gene set of the same size perform just as well as the targeted gene set?

# We will use the [Nanostring nCounter PanCancer IO 360 Panel](https://nanostring.com/products/ncounter-assays-panels/oncology/pancancer-io-360/) targeted gene set published by Nanostring, available for download [here](https://nanostring.com/products/ncounter-assays-panels/panel-selection-tool/) and found in this repository at `processed_data/NS_IO_360_v1.0_Genes.tsv`.

# Setup

# Source code and libraries
source(here::here("utils/convert_gene_names.R"))
source(here::here("utils/modeling.R"))

# Define directories and input/output file paths
processed_data_dir <- here::here("processed_data")
models_dir <- here::here("models")
random_dir <- here::here(models_dir, "targeted_gene_set_random")

bulk_genex_filepath <- file.path(processed_data_dir, "bulk_genex.tsv")
bulk_metadata_filepath <- file.path(processed_data_dir, "bulk_metadata.tsv")

baseline_filepath <- file.path(models_dir, "baseline.rds")
targeted_filepath <- file.path(models_dir, "targeted_gene_set.rds")

# check that existing targeted gene set model file will not be overwritten if overwrite is FALSE
if (create_models &
(file.exists(targeted_filepath)) &
!overwrite) {

stop("Targeted gene set model output file already exists and overwrite is set to FALSE in predict/predict_targeted_gene_set.R.")

}

# check that targeted file exist if create_models is FALSE
if (!create_models & !file.exists(targeted_filepath)) {

stop("Targeted model output file does not exist and create_models is set to FALSE in predict/predict_targeted_gene_set.R.")

}

# We will need this random gene set function to run inside tryCatch()
# Given a random index (this sets the seed), we pick a random
train_and_write_random_model <- function(random_index) {

# The targeted gene panel comprises n_targeted_genes_original genes.
# After converting from SYMBOL to ENSEMBL gene IDs and overlapping with our gene expression data, there are n_targeted_genes_overlap genes available for training using the targeted gene set.
# We use the same number of genes when selecting genes for the random gene set models.

# # select random genes from bulk genex df
set.seed(random_index) # randomly chosen seed
random_gene_set <- sample(row.names(bulk_genex_df),
size = n_targeted_genes_overlap,
replace = FALSE)

# subset bulk_genex_df to random gene set
random_gene_set_bulk_genex_df <- bulk_genex_df[random_gene_set, ]

# train and test models using random gene set
random_kTSP_RF_weighted_models_list <- run_many_models(
genex_df = random_gene_set_bulk_genex_df,
metadata_df = bulk_metadata_df,
labels = mb_subgroups,
model_types = c("ktsp", "rf"),
array_studies_for_training = "GSE37418",
initial_seed = seed,
n_repeats = n_repeats,
n_cores = n_cores,
ktsp_featureNo = 1000,
ktsp_n_rules_min = 5,
ktsp_n_rules_max = 50,
ktsp_weighted = TRUE,
rf_num.trees = 500,
rf_genes_altogether = 50,
rf_genes_one_vs_rest = 50,
rf_gene_repetition = 1,
rf_rules_altogether = 50,
rf_rules_one_vs_rest = 50,
rf_weighted = TRUE)

# Add _weighted to names of kTSP and RF list elements
random_kTSP_RF_weighted_models_list <- random_kTSP_RF_weighted_models_list |>
purrr::map(\(x) setNames(x,
dplyr::case_match(names(x),
"ktsp" ~ "ktsp_weighted",
"rf" ~ "rf_weighted",
.default = names(x))))

# Random gene sets for unweighted kTSP, RF, and lasso
random_kTSP_RF_lasso_unweighted_models_list <- run_many_models(
genex_df = random_gene_set_bulk_genex_df,
metadata_df = bulk_metadata_df,
labels = mb_subgroups,
model_types = c("ktsp", "rf", "lasso"),
array_studies_for_training = "GSE37418",
initial_seed = seed,
n_repeats = n_repeats,
n_cores = n_cores,
ktsp_featureNo = 1000,
ktsp_n_rules_min = 5,
ktsp_n_rules_max = 50,
ktsp_weighted = FALSE,
rf_num.trees = 500,
rf_genes_altogether = 50,
rf_genes_one_vs_rest = 50,
rf_gene_repetition = 1,
rf_rules_altogether = 50,
rf_rules_one_vs_rest = 50,
rf_weighted = FALSE)

# Add _unweighted to names of kTSP and RF list elements
random_kTSP_RF_lasso_unweighted_models_list <- random_kTSP_RF_lasso_unweighted_models_list |>
purrr::map(\(x) setNames(x,
dplyr::case_match(names(x),
"ktsp" ~ "ktsp_unweighted",
"rf" ~ "rf_unweighted",
.default = names(x))))

# combine random lists
random_list <- purrr::map2(
random_kTSP_RF_weighted_models_list,
random_kTSP_RF_lasso_unweighted_models_list,
c) |>
purrr::map(\(y) y[!duplicated(names(y))]) # remove duplicate list items

# write random list to random folder
random_filepath <- here::here(random_dir,
glue::glue("targeted_gene_set.random_",
random_index, ".rds"))

readr::write_rds(x = random_list,
file = random_filepath)

}

# set subgroups (canonical MB subgroups)
mb_subgroups <- c("G3", "G4", "SHH", "WNT")

# Read in essential data
bulk_genex_df <- readr::read_tsv(bulk_genex_filepath,
show_col_types = FALSE) |>
tibble::column_to_rownames(var = "gene")

bulk_metadata_df <- readr::read_tsv(bulk_metadata_filepath,
show_col_types = FALSE) |>
dplyr::filter(sample_accession %in% names(bulk_genex_df),
subgroup %in% mb_subgroups)

bulk_genex_df <- bulk_genex_df |>
dplyr::select(dplyr::all_of(bulk_metadata_df$sample_accession))

check_input_files(genex_df = bulk_genex_df,
metadata_df = bulk_metadata_df)

# targeted genes (Nanostring IO 360 gene panel)
targeted_genes_filepath <- file.path(processed_data_dir,
"NS_IO_360_v1.0_Genes.tsv")

targeted_genes_df <- readr::read_tsv(file = targeted_genes_filepath,
show_col_types = FALSE)

n_targeted_genes_original <- nrow(targeted_genes_df)

# convert targeted gene set from SYMBOL to ENSEMBL
targeted_genes_df <- targeted_genes_df |>
convert_gene_names(gene_column_before = "Gene",
gene_column_after = "gene",
map_from = "SYMBOL",
map_to = "ENSEMBL",
ah_date = ah_date)

# reduce bulk genex df to overlap with targeted genes
bulk_genex_df_targeted <- bulk_genex_df[row.names(bulk_genex_df) %in% targeted_genes_df$gene, ]

n_targeted_genes_overlap <- nrow(bulk_genex_df_targeted)

# convert overlapping targeted gene set from ENSEMBL to ENTREZID for MM2S
targeted_genes_df_ENTREZID <- bulk_genex_df_targeted |>
tibble::rownames_to_column(var = "gene") |>
convert_gene_names(gene_column_before = "gene",
gene_column_after = "gene",
map_from = "ENSEMBL",
map_to = "ENTREZID",
ah_date = ah_date)

# Train and test MB subgroup models using targeted gene set

### Create kTSP, RF, and LASSO models

# The following code creates kTSP (weighted and unweighted), RF (weighted and unweighted), and LASSO models using default settings.
# Supplying the same `initial_seed` to each instance of `run_many_models()` ensures the training/testing sets of each repeat are the same for each model.

if (create_models) {

# Targeted gene set for weighted kTSP and RF
targeted_kTSP_RF_weighted_models_list <- run_many_models(
genex_df = bulk_genex_df_targeted,
metadata_df = bulk_metadata_df,
labels = mb_subgroups,
model_types = c("ktsp", "rf"),
array_studies_for_training = "GSE37418",
initial_seed = seed,
n_repeats = n_repeats,
n_cores = n_cores,
ktsp_featureNo = 1000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these values are repeated in 4(?) places. It might be helpful to gather them all at the top of the script.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's a lot of repetition... the values are either set at the top of the script or are the default in run_many_models(). The main differences between the four calls is the input genes set via genex_df = and what models / weighting are being used.

I haven't made any changes yet -- do you think it's better to just delete the values that are kept as default, or explicitly state them once at the top?

ktsp_n_rules_min = 5,
ktsp_n_rules_max = 50,
ktsp_weighted = TRUE,
rf_num.trees = 500,
rf_genes_altogether = 50,
rf_genes_one_vs_rest = 50,
rf_gene_repetition = 1,
rf_rules_altogether = 50,
rf_rules_one_vs_rest = 50,
rf_weighted = TRUE)

# Add _weighted to names of kTSP and RF list elements
targeted_kTSP_RF_weighted_models_list <- targeted_kTSP_RF_weighted_models_list |>
purrr::map(\(x) setNames(x,
dplyr::case_match(names(x),
"ktsp" ~ "ktsp_weighted",
"rf" ~ "rf_weighted",
.default = names(x))))

# Targeted gene set for unweighted kTSP, RF, and lasso
targeted_kTSP_RF_lasso_unweighted_models_list <- run_many_models(
genex_df = bulk_genex_df_targeted,
metadata_df = bulk_metadata_df,
labels = mb_subgroups,
model_types = c("ktsp", "rf", "lasso"),
array_studies_for_training = "GSE37418",
initial_seed = seed,
n_repeats = n_repeats,
n_cores = n_cores,
ktsp_featureNo = 1000,
ktsp_n_rules_min = 5,
ktsp_n_rules_max = 50,
ktsp_weighted = FALSE,
rf_num.trees = 500,
rf_genes_altogether = 50,
rf_genes_one_vs_rest = 50,
rf_gene_repetition = 1,
rf_rules_altogether = 50,
rf_rules_one_vs_rest = 50,
rf_weighted = FALSE)

# Add _unweighted to names of kTSP and RF list elements
targeted_kTSP_RF_lasso_unweighted_models_list <- targeted_kTSP_RF_lasso_unweighted_models_list |>
purrr::map(\(x) setNames(x,
dplyr::case_match(names(x),
"ktsp" ~ "ktsp_unweighted",
"rf" ~ "rf_unweighted",
.default = names(x))))

# combine targeted lists
targeted_list <- purrr::map2(targeted_kTSP_RF_weighted_models_list,
targeted_kTSP_RF_lasso_unweighted_models_list,
c) |>
purrr::map(\(x) x[!duplicated(names(x))]) # remove duplicate list items
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What duplicated list items would we expect?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each list contains some metadata that is the same, so this removes those redundant items.


# write models to file
readr::write_rds(x = targeted_list,
file = targeted_filepath)

# Random gene sets for weighted kTSP and RF

# run train_and_write_random_model() until it succeeds 1000 times
# tryCatch() allows the random gene set to fail and then tries the next random index
# after 1000 successes, exit the while loop

n_success <- 0
random_index <- 0

while(n_success < 1000) {

random_index <- random_index + 1

tryCatch(
{

train_and_write_random_model(random_index)
n_success <- n_success + 1
print(glue::glue("Random index ", random_index, ": success"))

},
error = function(msg) {

print(glue::glue("Random index ", random_index, ": failure"))

}
)

}

}
Loading