Skip to content

Conversation

ShixiangWang
Copy link

I was trying to obtain STAR gene expression data from multiple datasets including TCGA, TARGET, and CPTAC. Some errors happened when preparing some projects, and they are mainly related to duplicated records, especially in the CPTAC-3. I tested the code and made a workaround. This PR also amended some data checks to make sure the data preparation process went smoothly.

o Preparing output
-------------------
Downloading data for project CPTAC-3
Of the 2340 files for download 2340 already exist.
All samples have been already downloaded
Removing duplicated cases (with older updated time)
 => 41 records removed

Code to query the data.

library(TCGAbiolinks)
#devtools::load_all("/Volumes/EPan/Repos/TCGAbiolinks/")
#install.packages("/Volumes/EPan/Repos/TCGAbiolinks/", repos = NULL, type = "source")

prjs = TCGAbiolinks::getGDCprojects()
tcga_prjs = prjs$project_id[startsWith(prjs$project_id, "TCGA")]
targ_prjs = prjs$project_id[startsWith(prjs$project_id, "TARGET")]
cpta_prjs = prjs$project_id[startsWith(prjs$project_id, "CPTAC")]

# mRNA pipeline: https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

dir.create("data")
dir.create("data/count")
dir.create("data/tpm")

# GDC data download and prepare ------------------------------------------------

for (i in c(tcga_prjs, targ_prjs, cpta_prjs)) {
  message("Handling ", i)
  # if (file.exists(glue::glue("data/tpm/{i}.exp.tpm.rds"))) {
  #   message("handled already, skipping...")
  #   next()
  # }
  query.exp.hg38 <- GDCquery(
    project = i,
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification",
    workflow.type = "STAR - Counts"
  )

  GDCdownload(query.exp.hg38)
  #debug(GDCprepare)
  #debug(readTranscriptomeProfiling)
  #debug(colDataPrepare)
  expdat <- GDCprepare(
    query = query.exp.hg38
  )
  gc()

  message("saving...")
  saveRDS(SummarizedExperiment::assays(expdat)[["unstranded"]], file = glue::glue("data/count/{i}.exp.count.rds"))
  saveRDS(SummarizedExperiment::assays(expdat)[["tpm_unstrand"]], file = glue::glue("data/tpm/{i}.exp.tpm.rds"))
  message("done")
  gc()
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant