-
Notifications
You must be signed in to change notification settings - Fork 0
Envest/45 reduced gene sets #128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
efedbb6
1952a18
0f9a8c3
73d99db
4c97f46
c8949f1
410e474
345ed16
fa2fbf2
ef92d39
49f9f77
5971c2b
a27f88e
efc5678
6bd1e01
aad33df
7add33b
18169cb
1addf30
6ba9d30
7198581
73cc7e1
5f521f2
b5d7d16
147ddc9
756476f
560fa50
7f277c8
7fbadfe
8443668
3709d56
2735f1f
9534b96
37b6cd5
9cf496c
1176fb9
b2c60d9
e8e236d
a582d58
5fbdf21
41570ac
62cdf8e
5c0ae87
e124c5a
09f3e96
3107bad
7ad4256
41e7e6b
eb4c112
baf2c75
5415945
468ada3
7cdac3c
ac6d012
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| #!/bin/bash | ||
| # | ||
| # Steven Foltz | ||
| # 2025 | ||
| # | ||
| # Run targeted gene set experiments | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| # Set the working directory to the directory of this file | ||
| cd "$(dirname "${BASH_SOURCE[0]}")" | ||
|
|
||
| # Set up directories | ||
| predict_dir="predict" | ||
| analysis_notebooks_dir="analysis_notebooks" | ||
|
|
||
| # Run targeted gene set model prediction | ||
| Rscript "${predict_dir}/predict_targeted_gene_set.R" | ||
|
|
||
| # Render notebook for targeted gene set analysis | ||
| Rscript -e "rmarkdown::render('${analysis_notebooks_dir}/targeted_gene_set.Rmd')" |
Large diffs are not rendered by default.
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,3 +6,4 @@ library(optparse) | |
| library(medulloPackage) | ||
| library(tidyverse) | ||
| library(boot) | ||
| library(coin) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,311 @@ | ||
| # Train and test models to predict MB subgroups using a targeted gene set | ||
| # Steven Foltz, updated 2025-06 | ||
|
|
||
| # set some variables that guide what downstream steps are followed | ||
| create_models <- TRUE # train new models (if FALSE, reads existing models from file) | ||
| overwrite <- TRUE # if create_models is also TRUE, overwrite existing models file | ||
| seed <- 44 # set initial seed for run_many_models() | ||
| n_repeats <- 10 # set number of times to repeat in run_many_models() | ||
| n_cores <- 10 # set number of cores to use in run_many_models() | ||
| ah_date <- "2022-10-30" | ||
|
|
||
| # Rationale | ||
|
|
||
| # Experiments that generate RNA expression data may use a targeted gene panel as a more cost-effective approach to analyze a subset of genes thought to have greater relevance for disease, especially in clinical settings. | ||
| # However, when a model is trained using whole transcriptome data, the genes selected for that model will generally show only partial or no overlap with a targeted gene panel. | ||
| # Even though kTSP and RF models can tolerate missing genes, prediction performance on new samples will likely suffer as a result of having fewer gene:gene comparisons to score. | ||
| # A model trained using a targeted panel of genes from the start could be applied to sample data that includes more genes, however there may be a decrease in model performance due to starting with a smaller feature space. | ||
| # This leads to two key questions: | ||
|
|
||
| # 1. Given a targeted gene set of cancer-related genes, how does model performance compare between targeted gene set models and full gene set models? | ||
| # 2. Does the identity of the genes in the targeted gene set matter? In other words, can a random gene set of the same size perform just as well as the targeted gene set? | ||
|
|
||
| # We will use the [Nanostring nCounter PanCancer IO 360 Panel](https://nanostring.com/products/ncounter-assays-panels/oncology/pancancer-io-360/) targeted gene set published by Nanostring, available for download [here](https://nanostring.com/products/ncounter-assays-panels/panel-selection-tool/) and found in this repository at `processed_data/NS_IO_360_v1.0_Genes.tsv`. | ||
|
|
||
| # Setup | ||
|
|
||
| # Source code and libraries | ||
| source(here::here("utils/convert_gene_names.R")) | ||
| source(here::here("utils/modeling.R")) | ||
|
|
||
| # Define directories and input/output file paths | ||
| processed_data_dir <- here::here("processed_data") | ||
| models_dir <- here::here("models") | ||
| random_dir <- here::here(models_dir, "targeted_gene_set_random") | ||
|
|
||
| bulk_genex_filepath <- file.path(processed_data_dir, "bulk_genex.tsv") | ||
| bulk_metadata_filepath <- file.path(processed_data_dir, "bulk_metadata.tsv") | ||
|
|
||
| baseline_filepath <- file.path(models_dir, "baseline.rds") | ||
| targeted_filepath <- file.path(models_dir, "targeted_gene_set.rds") | ||
|
|
||
| # check that existing targeted gene set model file will not be overwritten if overwrite is FALSE | ||
| if (create_models & | ||
| (file.exists(targeted_filepath)) & | ||
| !overwrite) { | ||
|
|
||
| stop("Targeted gene set model output file already exists and overwrite is set to FALSE in predict/predict_targeted_gene_set.R.") | ||
|
|
||
| } | ||
|
|
||
| # check that targeted file exist if create_models is FALSE | ||
| if (!create_models & !file.exists(targeted_filepath)) { | ||
|
|
||
| stop("Targeted model output file does not exist and create_models is set to FALSE in predict/predict_targeted_gene_set.R.") | ||
|
|
||
| } | ||
|
|
||
| # We will need this random gene set function to run inside tryCatch() | ||
| # Given a random index (this sets the seed), we pick a random | ||
| train_and_write_random_model <- function(random_index) { | ||
|
|
||
| # The targeted gene panel comprises n_targeted_genes_original genes. | ||
| # After converting from SYMBOL to ENSEMBL gene IDs and overlapping with our gene expression data, there are n_targeted_genes_overlap genes available for training using the targeted gene set. | ||
| # We use the same number of genes when selecting genes for the random gene set models. | ||
|
|
||
| # # select random genes from bulk genex df | ||
| set.seed(random_index) # randomly chosen seed | ||
| random_gene_set <- sample(row.names(bulk_genex_df), | ||
| size = n_targeted_genes_overlap, | ||
| replace = FALSE) | ||
|
|
||
| # subset bulk_genex_df to random gene set | ||
| random_gene_set_bulk_genex_df <- bulk_genex_df[random_gene_set, ] | ||
|
|
||
| # train and test models using random gene set | ||
| random_kTSP_RF_weighted_models_list <- run_many_models( | ||
| genex_df = random_gene_set_bulk_genex_df, | ||
| metadata_df = bulk_metadata_df, | ||
| labels = mb_subgroups, | ||
| model_types = c("ktsp", "rf"), | ||
| array_studies_for_training = "GSE37418", | ||
| initial_seed = seed, | ||
| n_repeats = n_repeats, | ||
| n_cores = n_cores, | ||
| ktsp_featureNo = 1000, | ||
| ktsp_n_rules_min = 5, | ||
| ktsp_n_rules_max = 50, | ||
| ktsp_weighted = TRUE, | ||
| rf_num.trees = 500, | ||
| rf_genes_altogether = 50, | ||
| rf_genes_one_vs_rest = 50, | ||
| rf_gene_repetition = 1, | ||
| rf_rules_altogether = 50, | ||
| rf_rules_one_vs_rest = 50, | ||
| rf_weighted = TRUE) | ||
|
|
||
| # Add _weighted to names of kTSP and RF list elements | ||
| random_kTSP_RF_weighted_models_list <- random_kTSP_RF_weighted_models_list |> | ||
| purrr::map(\(x) setNames(x, | ||
| dplyr::case_match(names(x), | ||
| "ktsp" ~ "ktsp_weighted", | ||
| "rf" ~ "rf_weighted", | ||
| .default = names(x)))) | ||
|
|
||
| # Random gene sets for unweighted kTSP, RF, and lasso | ||
| random_kTSP_RF_lasso_unweighted_models_list <- run_many_models( | ||
| genex_df = random_gene_set_bulk_genex_df, | ||
| metadata_df = bulk_metadata_df, | ||
| labels = mb_subgroups, | ||
| model_types = c("ktsp", "rf", "lasso"), | ||
| array_studies_for_training = "GSE37418", | ||
| initial_seed = seed, | ||
| n_repeats = n_repeats, | ||
| n_cores = n_cores, | ||
| ktsp_featureNo = 1000, | ||
| ktsp_n_rules_min = 5, | ||
| ktsp_n_rules_max = 50, | ||
| ktsp_weighted = FALSE, | ||
| rf_num.trees = 500, | ||
| rf_genes_altogether = 50, | ||
| rf_genes_one_vs_rest = 50, | ||
| rf_gene_repetition = 1, | ||
| rf_rules_altogether = 50, | ||
| rf_rules_one_vs_rest = 50, | ||
| rf_weighted = FALSE) | ||
|
|
||
| # Add _unweighted to names of kTSP and RF list elements | ||
| random_kTSP_RF_lasso_unweighted_models_list <- random_kTSP_RF_lasso_unweighted_models_list |> | ||
| purrr::map(\(x) setNames(x, | ||
| dplyr::case_match(names(x), | ||
| "ktsp" ~ "ktsp_unweighted", | ||
| "rf" ~ "rf_unweighted", | ||
| .default = names(x)))) | ||
|
|
||
| # combine random lists | ||
| random_list <- purrr::map2( | ||
| random_kTSP_RF_weighted_models_list, | ||
| random_kTSP_RF_lasso_unweighted_models_list, | ||
| c) |> | ||
| purrr::map(\(y) y[!duplicated(names(y))]) # remove duplicate list items | ||
|
|
||
| # write random list to random folder | ||
| random_filepath <- here::here(random_dir, | ||
| glue::glue("targeted_gene_set.random_", | ||
| random_index, ".rds")) | ||
|
|
||
| readr::write_rds(x = random_list, | ||
| file = random_filepath) | ||
|
|
||
| } | ||
|
|
||
| # set subgroups (canonical MB subgroups) | ||
| mb_subgroups <- c("G3", "G4", "SHH", "WNT") | ||
|
|
||
| # Read in essential data | ||
| bulk_genex_df <- readr::read_tsv(bulk_genex_filepath, | ||
| show_col_types = FALSE) |> | ||
| tibble::column_to_rownames(var = "gene") | ||
|
|
||
| bulk_metadata_df <- readr::read_tsv(bulk_metadata_filepath, | ||
| show_col_types = FALSE) |> | ||
| dplyr::filter(sample_accession %in% names(bulk_genex_df), | ||
| subgroup %in% mb_subgroups) | ||
|
|
||
| bulk_genex_df <- bulk_genex_df |> | ||
| dplyr::select(dplyr::all_of(bulk_metadata_df$sample_accession)) | ||
|
|
||
| check_input_files(genex_df = bulk_genex_df, | ||
| metadata_df = bulk_metadata_df) | ||
|
|
||
| # targeted genes (Nanostring IO 360 gene panel) | ||
| targeted_genes_filepath <- file.path(processed_data_dir, | ||
| "NS_IO_360_v1.0_Genes.tsv") | ||
|
|
||
| targeted_genes_df <- readr::read_tsv(file = targeted_genes_filepath, | ||
| show_col_types = FALSE) | ||
|
|
||
| n_targeted_genes_original <- nrow(targeted_genes_df) | ||
|
|
||
| # convert targeted gene set from SYMBOL to ENSEMBL | ||
| targeted_genes_df <- targeted_genes_df |> | ||
| convert_gene_names(gene_column_before = "Gene", | ||
| gene_column_after = "gene", | ||
| map_from = "SYMBOL", | ||
| map_to = "ENSEMBL", | ||
| ah_date = ah_date) | ||
|
|
||
| # reduce bulk genex df to overlap with targeted genes | ||
| bulk_genex_df_targeted <- bulk_genex_df[row.names(bulk_genex_df) %in% targeted_genes_df$gene, ] | ||
|
|
||
| n_targeted_genes_overlap <- nrow(bulk_genex_df_targeted) | ||
|
|
||
| # convert overlapping targeted gene set from ENSEMBL to ENTREZID for MM2S | ||
| targeted_genes_df_ENTREZID <- bulk_genex_df_targeted |> | ||
| tibble::rownames_to_column(var = "gene") |> | ||
| convert_gene_names(gene_column_before = "gene", | ||
| gene_column_after = "gene", | ||
| map_from = "ENSEMBL", | ||
| map_to = "ENTREZID", | ||
| ah_date = ah_date) | ||
|
|
||
| # Train and test MB subgroup models using targeted gene set | ||
|
|
||
| ### Create kTSP, RF, and LASSO models | ||
|
|
||
| # The following code creates kTSP (weighted and unweighted), RF (weighted and unweighted), and LASSO models using default settings. | ||
| # Supplying the same `initial_seed` to each instance of `run_many_models()` ensures the training/testing sets of each repeat are the same for each model. | ||
|
|
||
| if (create_models) { | ||
|
|
||
| # Targeted gene set for weighted kTSP and RF | ||
| targeted_kTSP_RF_weighted_models_list <- run_many_models( | ||
| genex_df = bulk_genex_df_targeted, | ||
| metadata_df = bulk_metadata_df, | ||
| labels = mb_subgroups, | ||
| model_types = c("ktsp", "rf"), | ||
| array_studies_for_training = "GSE37418", | ||
| initial_seed = seed, | ||
| n_repeats = n_repeats, | ||
| n_cores = n_cores, | ||
| ktsp_featureNo = 1000, | ||
| ktsp_n_rules_min = 5, | ||
| ktsp_n_rules_max = 50, | ||
| ktsp_weighted = TRUE, | ||
| rf_num.trees = 500, | ||
| rf_genes_altogether = 50, | ||
| rf_genes_one_vs_rest = 50, | ||
| rf_gene_repetition = 1, | ||
| rf_rules_altogether = 50, | ||
| rf_rules_one_vs_rest = 50, | ||
| rf_weighted = TRUE) | ||
|
|
||
| # Add _weighted to names of kTSP and RF list elements | ||
| targeted_kTSP_RF_weighted_models_list <- targeted_kTSP_RF_weighted_models_list |> | ||
| purrr::map(\(x) setNames(x, | ||
| dplyr::case_match(names(x), | ||
| "ktsp" ~ "ktsp_weighted", | ||
| "rf" ~ "rf_weighted", | ||
| .default = names(x)))) | ||
|
|
||
| # Targeted gene set for unweighted kTSP, RF, and lasso | ||
| targeted_kTSP_RF_lasso_unweighted_models_list <- run_many_models( | ||
| genex_df = bulk_genex_df_targeted, | ||
| metadata_df = bulk_metadata_df, | ||
| labels = mb_subgroups, | ||
| model_types = c("ktsp", "rf", "lasso"), | ||
| array_studies_for_training = "GSE37418", | ||
| initial_seed = seed, | ||
| n_repeats = n_repeats, | ||
| n_cores = n_cores, | ||
| ktsp_featureNo = 1000, | ||
| ktsp_n_rules_min = 5, | ||
| ktsp_n_rules_max = 50, | ||
| ktsp_weighted = FALSE, | ||
| rf_num.trees = 500, | ||
| rf_genes_altogether = 50, | ||
| rf_genes_one_vs_rest = 50, | ||
| rf_gene_repetition = 1, | ||
| rf_rules_altogether = 50, | ||
| rf_rules_one_vs_rest = 50, | ||
| rf_weighted = FALSE) | ||
|
|
||
| # Add _unweighted to names of kTSP and RF list elements | ||
| targeted_kTSP_RF_lasso_unweighted_models_list <- targeted_kTSP_RF_lasso_unweighted_models_list |> | ||
| purrr::map(\(x) setNames(x, | ||
| dplyr::case_match(names(x), | ||
| "ktsp" ~ "ktsp_unweighted", | ||
| "rf" ~ "rf_unweighted", | ||
| .default = names(x)))) | ||
|
|
||
| # combine targeted lists | ||
| targeted_list <- purrr::map2(targeted_kTSP_RF_weighted_models_list, | ||
| targeted_kTSP_RF_lasso_unweighted_models_list, | ||
| c) |> | ||
| purrr::map(\(x) x[!duplicated(names(x))]) # remove duplicate list items | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What duplicated list items would we expect?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Each list contains some metadata that is the same, so this removes those redundant items. |
||
|
|
||
| # write models to file | ||
| readr::write_rds(x = targeted_list, | ||
| file = targeted_filepath) | ||
|
|
||
| # Random gene sets for weighted kTSP and RF | ||
|
|
||
| # run train_and_write_random_model() until it succeeds 1000 times | ||
| # tryCatch() allows the random gene set to fail and then tries the next random index | ||
| # after 1000 successes, exit the while loop | ||
|
|
||
| n_success <- 0 | ||
| random_index <- 0 | ||
|
|
||
| while(n_success < 1000) { | ||
|
|
||
| random_index <- random_index + 1 | ||
|
|
||
| tryCatch( | ||
| { | ||
|
|
||
| train_and_write_random_model(random_index) | ||
| n_success <- n_success + 1 | ||
| print(glue::glue("Random index ", random_index, ": success")) | ||
|
|
||
| }, | ||
| error = function(msg) { | ||
|
|
||
| print(glue::glue("Random index ", random_index, ": failure")) | ||
|
|
||
| } | ||
| ) | ||
|
|
||
| } | ||
|
|
||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of these values are repeated in 4(?) places. It might be helpful to gather them all at the top of the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's a lot of repetition... the values are either set at the top of the script or are the default in
run_many_models(). The main differences between the four calls is the input genes set viagenex_df =and what models / weighting are being used.I haven't made any changes yet -- do you think it's better to just delete the values that are kept as default, or explicitly state them once at the top?