Skip to content

This project demonstrates step-by-step analysis of Bulk RNA-Seq data, starting from raw SRA files to obtaining differentially expressed genes (DEGs).

Notifications You must be signed in to change notification settings

SamantGayatri2002/Bulk-RNA-Seq-Data-Analysis

Repository files navigation

Bulk RNA-Seq Data Analysis – Hypoxia Response in LNCaP and PC3 Cell Lines

🎯 Project Overview

This project demonstrates step-by-step analysis of Bulk RNA-Seq data, starting from raw SRA files to obtaining differentially expressed genes (DEGs).
The dataset help us to investigate how hypoxia (low oxygen conditions) alters gene expression in two prostate cancer cell lines:

1.LNCaP (Lymph Node Carcinoma of the Prostate)
Origin: Derived from a lymph node metastasis of a human prostate adenocarcinoma (prostate cancer spread to the lymph node).

2.PC3 (Prostate Cancer 3)
Origin: Derived from a bone metastasis of a grade IV prostate adenocarcinoma.

Here we comparing Normoxia vs Hypoxia in each cell line.

πŸ“Œ Objectives

  • To learn the basic workflow of Bulk RNA-Seq analysis.
  • To perform quality control, alignment, quantification, and differential expression analysis.
  • To visualize results using plots such as PCA, heatmaps, and volcano plots and pathway enrichment analysis of DEGs.
  • To document the process in a way that is clear for beginners to reproduce.

πŸ“‚ Dataset Information

Dataset Source: GSE106305 (NCBI GEO) (Guo et al., Nature Communications, 2019)

How the raw data is organized in the database

  • Each biological sample (e.g., LNCaP Normoxia replicate 1) is not stored as one file.
  • Instead, it is split into multiple technical runs (SRR IDs) on the SRA database.

For example:
LNCaP Normoxia Replicate 1 (GSM3145509) is split across 4 SRR files: SRR7179504, SRR7179505, SRR7179506, SRR7179507.

This is common in sequencing experiments, where one sample is sequenced in multiple lanes/runs.

What we downloaded

  • In total, we downloaded 20 SRR files.
SRR7179504, SRR7179505, SRR7179506, SRR7179507
SRR7179508, SRR7179509, SRR7179510, SRR7179511
SRR7179520, SRR7179521, SRR7179522, SRR7179523
SRR7179524, SRR7179525, SRR7179526, SRR7179527
SRR7179536, SRR7179537, SRR7179540, SRR7179541
  • These represent 8 biological samples (2 replicates for each condition).

βš™οΈ Steps to be followed

1️⃣ Data Download & Conversion

  • Install SRA Toolkit
  • Download raw .sra files from NCBI using accession numbers
  • Convert .sra β†’ compressed .fastq.gz files
  • (Optional) Automate multiple downloads with a Python script

Output: FASTQ files (raw sequencing reads) image


2️⃣ Quality Control (QC)

  • Run FastQC on all FASTQ files to check:
    • Per-base quality scores
    • GC content
    • Adapter contamination
    • Overrepresented sequences
  • Use MultiQC to summarize results into one report

Output: HTML reports showing sequencing quality Screenshot 2025-08-24 151055


3️⃣ Read Trimming (if needed)

  • Use Trimmomatic or fastp to remove:
    • Low-quality bases
    • Adapter contamination
  • Re-run FastQC to confirm improvements

Output: Cleaned FASTQ files

image

4️⃣ Sample Preparation

  • Merge multiple technical replicates (several SRR files per sample) into one file
  • Rename files clearly (e.g., LNCAP_Normoxia_S1.fastq.gz)

Output: 8 final FASTQ files (one per replicate) 22222222222222


5️⃣ Alignment to Reference Genome

  • Download HISAT2 prebuilt genome index for GRCh38 (Human)
  • Install HISAT2 + Samtools
  • Align each FASTQ to the reference genome β†’ SAM/BAM files
  • Sort & Index BAM files for downstream use

Output: Aligned, sorted, and indexed BAM files Screenshot 2025-08-24 152117


6️⃣ Quantification of Gene Expression

  • Use featureCounts (Subread package) to assign reads to genes
  • Input: BAM files + GRCh38 GTF annotation
  • Output: Count matrix (genes Γ— samples)

Result: raw counts of all the samples which is then used to generate final count matrix using python script. Screenshot 2025-08-24 152222


7️⃣ Post-alignment Quality Check

  • Use Qualimap RNA-seq to evaluate:
    • Mapping statistics
    • Coverage
    • Strand specificity
  • Ensure all samples pass QC before moving forward

Output: QC reports for aligned reads Screenshot 2025-08-23 202801


8️⃣ Differential Expression Analysis (in R)

  • Import counts matrix into R (DESeq2 workflow)
  • Normalise read counts (to remove library size bias)
  • Perform differential expression testing:
    • Normoxia vs Hypoxia (for each cell line separately)
  • Visualisation and pathway enrichment

Output:

  • List of significantly up/downregulated genes
  • plots (PCA plots,Volcano plots,Heatmaps)
  • Pathway enrichment analysis

πŸ—‚οΈ Recommended Folder Layout

Bulk_RNA_Seq_Analysis/
β”œβ”€ fastq/ # FASTQs (raw + merged/renamed)
β”œβ”€ fastqc_results/ # FastQC HTMLs
β”œβ”€ multiqc_report/ # MultiQC summary
β”œβ”€ alignedreads/ # BAM + BAI
β”œβ”€ quants/ # featureCounts outputs
β”œβ”€ rnaseq_qc_results/ # Qualimap outputs

🧭 End-to-End Workflow β€” Commands + Explanations

Step 0 : Creating environment and downloading SRA Toolkit

It is recommended to work in a clean environment so that all tools are properly installed and version conflicts are avoided.
You can use Conda (preferred) or your system package manager (APT) to install the required tools.

πŸ“¦ Using Conda (recommended)

# (optional) create and activate a dedicated environment
conda create -n rnaseq python=3.9 -y
conda activate rnaseq

# go to your project directory
mkdir Bulk_RNA_Seq_Analysis
cd Bulk_RNA_Seq_Analysis

# install SRA Toolkit
conda install -c conda-forge -c bioconda sra-tools=3.2.1

# verify installation
which prefetch
which fastq-dump

πŸ–₯️ Using APT (if Conda is not available)

sudo apt update
sudo apt install sra-toolkit

ℹ️ Why this step?

  • prefetch β†’ downloads .sra files from NCBI
  • fastq-dump β†’ converts .sra β†’ .fastq.gz for downstream analysis

Step 1: Download and Convert SRA Files to FASTQ

In this step, we will download sequencing data from the NCBI SRA database and convert them to FASTQ format for downstream analysis.

1.1 Manual Download Example (Single SRR)

To download and convert a single SRA run (e.g., SRR7179504), use the following commands:

# Download the SRA file
prefetch SRR7179504

# Convert SRA to FASTQ format
fastq-dump --outdir fastq --gzip --skip-technical --readids \
--read-filter pass --dumpbase --split-3 --clip \
SRR7179504/SRR7179504.sra

Note: This command generates compressed FASTQ files in the fastq/ directory and ensures only high-quality reads are retained.

1.2 Automate for Multiple SRR IDs

Processing multiple runs manually is time-consuming. Use a Python script to automate the download and conversion for all SRR IDs.

  1. Check Python installation:
python3 --version || (sudo apt update && sudo apt install python3 -y)
  1. Run the automation script:
python3 fastq_download.py

Why automation matters: Looping through all SRR IDs saves time and ensures that all sequencing runs (e.g., 20 runs) are processed efficiently without manual intervention.

Step 2: Raw Read Quality Control (FastQC β†’ MultiQC)

Before alignment, it’s critical to check the quality of raw reads to identify issues such as adapter contamination, low-quality bases, or GC bias.

2.1 Install Tools

I recommend installing via Conda (preferred), but APT or pip can also be used.

Option 1: Conda (Recommended)

conda install -c bioconda fastqc
conda install -c bioconda multiqc

Option 2: APT (System-wide)

sudo apt update && sudo apt install fastqc multiqc

Option 3: pip (Latest MultiQC)

pip install multiqc

2.2 Run Quality Control

  1. Run FastQC on all FASTQ files
mkdir -p fastqc_results
fastqc fastq/*.fastq.gz -o fastqc_results/ --threads 8
  1. Aggregate results with MultiQC
multiqc fastqc_results/ -o multiqc_report/

Why this step?

  • βœ… Confirms sequencing quality
  • βœ… Detects adapter contamination
  • βœ… Identifies GC bias and other systematic issues
  • βœ… Provides both per-sample (FastQC) and summarized (MultiQC) reports

Step 3: (Optional) Read Trimming with Trimmomatic

Trimming can remove low-quality bases or adapter contamination. In this project, trimming was tested but since the aligner used here can better handle this things,untrimmed reads were used for downstream analysis.

3.1 Install Trimmomatic

Trimmomatic requires Java. Install Java first, then Trimmomatic:

# Check Java or install if missing
java -version || (sudo apt update && sudo apt install default-jre -y)

# Install Trimmomatic via Conda
conda install -c bioconda trimmomatic

3.2 Example: Trim a Single File

Here I demonstrate trimming on one FASTQ file (SRR7179504_pass.fastq.gz):

trimmomatic SE -threads 4 -phred33 \
  fastq/SRR7179504_pass.fastq.gz \
  fastq/SRR7179504_trimmed.fastq.gz \
  TRAILING:10

3.3 Re-check QC

After trimming, re-run FastQC on the trimmed file:

fastqc fastq/SRR7179504_trimmed.fastq.gz

Why this step?

  • πŸ”Ή Removes low-quality bases from read ends
  • πŸ”Ή Can reduce adapter contamination
  • πŸ”Ή Ensures cleaner input for alignment

Note: In our dataset, trimming did not materially improve QC

Step 4: Merge Technical Runs & Rename Samples

Concatenate LNCaP technical runs β†’ replicates and rename PC3 runs:

cat SRR7179504_pass.fastq.gz SRR7179505_pass.fastq.gz SRR7179506_pass.fastq.gz SRR7179507_pass.fastq.gz > LNCAP_Normoxia_S1.fastq.gz
cat SRR7179508_pass.fastq.gz SRR7179509_pass.fastq.gz SRR7179510_pass.fastq.gz SRR7179511_pass.fastq.gz > LNCAP_Normoxia_S2.fastq.gz
cat SRR7179520_pass.fastq.gz SRR7179521_pass.fastq.gz SRR7179522_pass.fastq.gz SRR7179523_pass.fastq.gz > LNCAP_Hypoxia_S1.fastq.gz
cat SRR7179524_pass.fastq.gz SRR7179525_pass.fastq.gz SRR7179526_pass.fastq.gz SRR7179527_pass.fastq.gz > LNCAP_Hypoxia_S2.fastq.gz
mv SRR7179536_pass.fastq.gz PC3_Normoxia_S1.fastq.gz
mv SRR7179537_pass.fastq.gz PC3_Normoxia_S2.fastq.gz
mv SRR7179540_pass.fastq.gz PC3_Hypoxia_S1.fastq.gz
mv SRR7179541_pass.fastq.gz PC3_Hypoxia_S2.fastq.gz
# After verifying the new files exist and sizes look correct:
rm -rf SRR*

πŸ“‘ Merging 20 SRR Runs β†’ 8 Final FASTQ Samples gives below result

Cell line Condition Replicate SRR runs merged Final FASTQ
LNCaP Normoxia Rep1 504, 505, 506, 507 LNCAP_Normoxia_S1.fastq.gz
LNCaP Normoxia Rep2 508, 509, 510, 511 LNCAP_Normoxia_S2.fastq.gz
LNCaP Hypoxia Rep1 520, 521, 522, 523 LNCAP_Hypoxia_S1.fastq.gz
LNCaP Hypoxia Rep2 524, 525, 526, 527 LNCAP_Hypoxia_S2.fastq.gz
PC3 Normoxia Rep1 536 PC3_Normoxia_S1.fastq.gz
PC3 Normoxia Rep2 537 PC3_Normoxia_S2.fastq.gz
PC3 Hypoxia Rep1 540 PC3_Hypoxia_S1.fastq.gz
PC3 Hypoxia Rep2 541 PC3_Hypoxia_S2.fastq.gz

Why merging?

  • βœ… Ensures that all reads from the same biological replicate are analyzed together
  • βœ… Prevents splitting of replicate data across multiple files
  • βœ… Simplifies downstream processing (alignment, quantification, etc.)

After merging, we also renamed the files for clarity and consistency.

Step 5: Reference Genome & Annotation

To perform alignment and gene quantification, we need a reference genome index (for HISAT2) and a gene annotation file (GTF).

5.1 Download Prebuilt HISAT2 Index

HISAT2 requires a prebuilt index of the reference genome. Download and extract:

wget https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz
tar -xvzf grch38_genome.tar.gz

5.2 Install Alignment & BAM Tools

Install HISAT2 (aligner) and Samtools (for handling BAM files):

sudo apt install hisat2
sudo apt install samtools

5.3 Download Gene Annotation (Ensembl GTF)

Download and extract the Ensembl GTF annotation file:

wget https://ftp.ensembl.org/pub/release-114/gtf/homo_sapiens/Homo_sapiens.GRCh38.114.gtf.gz
gunzip Homo_sapiens.GRCh38.114.gtf.gz

Why this step?

  • βœ… HISAT2 requires a reference genome index to align reads
  • βœ… featureCounts and other quantification tools need a gene annotation (GTF) to assign reads to genes

Step 6: Alignment (HISAT2 β†’ Samtools)

In this step, we align the quality-checked FASTQ reads to the reference genome using HISAT2, and then process the results into sorted and indexed BAM files with Samtools.

6.1 Single-Sample Example

The following command aligns one FASTQ file (LNCAP_Hypoxia_S1.fastq.gz) and outputs a sorted and indexed BAM file:

hisat2 -q -x grch38/genome -U fastq/LNCAP_Hypoxia_S1.fastq.gz | \
  samtools sort -o alignedreads/LNCAP_Hypoxia_S1.bam

samtools index alignedreads/LNCAP_Hypoxia_S1.bam

6.2 Automating the Alignment (Multiple Samples)

For multiple samples, use the provided helper script:

chmod +x hisat2_alignment1.sh
./hisat2_alignment1.sh

This script automates alignment for 7 samples; one additional sample was processed manually.

Notes

⚠️ If samtools index fails when piped, run it separately after sorting completes. βœ… Result: Sorted and indexed BAM files stored in the alignedreads/ directory, ready for downstream analysis.

Step 7: Quantification (featureCounts)

After alignment, we need to count how many reads map to each gene.
We use featureCounts (part of the Subread package) to generate a gene-by-sample count matrix, which is essential for differential expression (DE) analysis.

7.1 Install Subread

Install Subread and check that featureCounts is available:

sudo apt-get update
sudo apt-get install subread
featureCounts -v

7.2 Prepare Output Folder

Create a folder for storing quantification results:

mkdir -p quants

7.3 Count Reads (Single BAM Example)

Run featureCounts on one BAM file (tmp.bam) with strand-specific mode (-S 2):

featureCounts -S 2 -a Homo_sapiens.GRCh38.114.gtf \
  -o quants/featurecounts.txt tmp.bam

7.4 Automate for All BAM Files

To automate the process of quantification, use the provided script:

chmod +x featurecounts.sh
./featurecounts.sh

Why this step?

  • βœ… Converts aligned reads into per-gene read counts
  • βœ… Produces a gene Γ— sample count matrix
  • βœ… Provides the featurecounts which can be used to create count matrix ,which is the input required for downstream differential expression analysis (DEA)

Step 8: QC of Aligned Reads (Qualimap RNA-seq)

After alignment and quantification, it is important to assess the quality of aligned reads.
We use Qualimap to evaluate mapping quality, coverage, and biases.

8.1 Install Qualimap

Download manually or install via Conda:

Option 1: Manual Download

wget https://bitbucket.org/kokonech/qualimap/downloads/qualimap_v2.2.2.zip
unzip qualimap_v2.2.2.zip
cd qualimap_v2.2.2
chmod +x qualimap

Option 2: Conda Install

conda install bioconda::qualimap

8.2 Per-Sample QC Example

Run QC on one BAM file (LNCAP_Hypoxia_S1.bam):

./qualimap_v2.3/qualimap rnaseq \
  -bam alignedreads/LNCAP_Hypoxia_S1.bam \
  -gtf Homo_sapiens.GRCh38.114.gtf \
  -outdir rnaseq_qc_results \
  --java-mem-size=8G

8.3 Automate QC for All Samples

For batch QC, use the helper script:

chmod +x run_qualimap.sh
./run_qualimap.sh

Why this step?

  • βœ… Validates mapping percentage
  • βœ… Checks coverage distribution
  • βœ… Detects gene-body bias
  • βœ… Confirms library strandedness

Step 9: Build the Counts Matrix

After quantification, each sample has its own featureCounts output.
In this step, we merge all per-sample counts into a single counts matrix (genes Γ— samples).

9.1 Merge Counts

Use the provided Jupyter Notebook to combine featureCounts outputs:

countsmatrix.ipynb

This notebook consolidates all per-sample counts into one matrix where:

Rows = genes Columns = samples


9.2 DESeq2_analysis using R/Rstudio

Once the counts matrix is built, proceed to R / RStudio, and folllow up the .Rmd file provided for downstream analysis with DESeq2:

  • βœ… Normalization
  • βœ… Differentially expressed genes (DEGs)
  • βœ… PCA plots, Volcano plots, Heatmap
  • βœ… Pathway ebnrichment analysis

πŸ› οΈ Tools & Software Used

  • SRA Toolkit – Download sequencing data
  • FastQC + MultiQC – Quality control
  • Trimmomatic / fastp – Read trimming (optional)
  • HISAT2 – Alignment to reference genome
  • Samtools – BAM sorting and indexing
  • featureCounts – Read quantification
  • Qualimap – QC for aligned reads
  • R (DESeq2, ggplot2, pheatmap) – Statistical analysis ,and visualization

πŸ“š Learning Outcomes

  • Learned how to process raw RNA-Seq data step by step
  • Understood the importance of QC at every stage
  • Experienced handling large datasets and merging technical replicates
  • Performed differential expression analysis using DESeq2
  • Gained insights into how hypoxia impacts prostate cancer cell lines

πŸ‘©β€πŸ« Mentor

This work was carried out under the guidance of "Smriti Arora"

For questions or collaborations:

πŸ“¬ Contact
Author: Gayatri Sunil Samant
Email: [email protected]

About

This project demonstrates step-by-step analysis of Bulk RNA-Seq data, starting from raw SRA files to obtaining differentially expressed genes (DEGs).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages