This project demonstrates step-by-step analysis of Bulk RNA-Seq data, starting from raw SRA files to obtaining differentially expressed genes (DEGs).
The dataset help us to investigate how hypoxia (low oxygen conditions) alters gene expression in two prostate cancer cell lines:
1.LNCaP (Lymph Node Carcinoma of the Prostate)
Origin: Derived from a lymph node metastasis of a human prostate adenocarcinoma (prostate cancer spread to the lymph node).
2.PC3 (Prostate Cancer 3)
Origin: Derived from a bone metastasis of a grade IV prostate adenocarcinoma.
Here we comparing Normoxia vs Hypoxia in each cell line.
- To learn the basic workflow of Bulk RNA-Seq analysis.
- To perform quality control, alignment, quantification, and differential expression analysis.
- To visualize results using plots such as PCA, heatmaps, and volcano plots and pathway enrichment analysis of DEGs.
- To document the process in a way that is clear for beginners to reproduce.
Dataset Source: GSE106305 (NCBI GEO) (Guo et al., Nature Communications, 2019)
- Each biological sample (e.g., LNCaP Normoxia replicate 1) is not stored as one file.
- Instead, it is split into multiple technical runs (SRR IDs) on the SRA database.
For example:
LNCaP Normoxia Replicate 1 (GSM3145509) is split across 4 SRR files:
SRR7179504, SRR7179505, SRR7179506, SRR7179507.
This is common in sequencing experiments, where one sample is sequenced in multiple lanes/runs.
What we downloaded
- In total, we downloaded 20 SRR files.
SRR7179504, SRR7179505, SRR7179506, SRR7179507
SRR7179508, SRR7179509, SRR7179510, SRR7179511
SRR7179520, SRR7179521, SRR7179522, SRR7179523
SRR7179524, SRR7179525, SRR7179526, SRR7179527
SRR7179536, SRR7179537, SRR7179540, SRR7179541
- These represent 8 biological samples (2 replicates for each condition).
- Install SRA Toolkit
- Download raw
.srafiles from NCBI using accession numbers - Convert
.sraβ compressed.fastq.gzfiles - (Optional) Automate multiple downloads with a Python script
Output: FASTQ files (raw sequencing reads)

- Run FastQC on all FASTQ files to check:
- Per-base quality scores
- GC content
- Adapter contamination
- Overrepresented sequences
- Use MultiQC to summarize results into one report
Output: HTML reports showing sequencing quality

- Use Trimmomatic or fastp to remove:
- Low-quality bases
- Adapter contamination
- Re-run FastQC to confirm improvements
Output: Cleaned FASTQ files
- Merge multiple technical replicates (several SRR files per sample) into one file
- Rename files clearly (e.g.,
LNCAP_Normoxia_S1.fastq.gz)
Output: 8 final FASTQ files (one per replicate)

- Download HISAT2 prebuilt genome index for GRCh38 (Human)
- Install HISAT2 + Samtools
- Align each FASTQ to the reference genome β SAM/BAM files
- Sort & Index BAM files for downstream use
Output: Aligned, sorted, and indexed BAM files

- Use featureCounts (Subread package) to assign reads to genes
- Input: BAM files + GRCh38 GTF annotation
- Output: Count matrix (genes Γ samples)
Result: raw counts of all the samples which is then used to generate final count matrix using python script.

- Use Qualimap RNA-seq to evaluate:
- Mapping statistics
- Coverage
- Strand specificity
- Ensure all samples pass QC before moving forward
Output: QC reports for aligned reads

- Import counts matrix into R (DESeq2 workflow)
- Normalise read counts (to remove library size bias)
- Perform differential expression testing:
- Normoxia vs Hypoxia (for each cell line separately)
- Visualisation and pathway enrichment
Output:
- List of significantly up/downregulated genes
- plots (PCA plots,Volcano plots,Heatmaps)
- Pathway enrichment analysis
Bulk_RNA_Seq_Analysis/
ββ fastq/ # FASTQs (raw + merged/renamed)
ββ fastqc_results/ # FastQC HTMLs
ββ multiqc_report/ # MultiQC summary
ββ alignedreads/ # BAM + BAI
ββ quants/ # featureCounts outputs
ββ rnaseq_qc_results/ # Qualimap outputs
It is recommended to work in a clean environment so that all tools are properly installed and version conflicts are avoided.
You can use Conda (preferred) or your system package manager (APT) to install the required tools.
# (optional) create and activate a dedicated environment
conda create -n rnaseq python=3.9 -y
conda activate rnaseq
# go to your project directory
mkdir Bulk_RNA_Seq_Analysis
cd Bulk_RNA_Seq_Analysis
# install SRA Toolkit
conda install -c conda-forge -c bioconda sra-tools=3.2.1
# verify installation
which prefetch
which fastq-dumpsudo apt update
sudo apt install sra-toolkitprefetchβ downloads.srafiles from NCBIfastq-dumpβ converts.sraβ.fastq.gzfor downstream analysis
In this step, we will download sequencing data from the NCBI SRA database and convert them to FASTQ format for downstream analysis.
To download and convert a single SRA run (e.g., SRR7179504), use the following commands:
# Download the SRA file
prefetch SRR7179504
# Convert SRA to FASTQ format
fastq-dump --outdir fastq --gzip --skip-technical --readids \
--read-filter pass --dumpbase --split-3 --clip \
SRR7179504/SRR7179504.sraNote: This command generates compressed FASTQ files in the
fastq/directory and ensures only high-quality reads are retained.
Processing multiple runs manually is time-consuming. Use a Python script to automate the download and conversion for all SRR IDs.
- Check Python installation:
python3 --version || (sudo apt update && sudo apt install python3 -y)- Run the automation script:
python3 fastq_download.pyWhy automation matters: Looping through all SRR IDs saves time and ensures that all sequencing runs (e.g., 20 runs) are processed efficiently without manual intervention.
Before alignment, itβs critical to check the quality of raw reads to identify issues such as adapter contamination, low-quality bases, or GC bias.
I recommend installing via Conda (preferred), but APT or pip can also be used.
Option 1: Conda (Recommended)
conda install -c bioconda fastqc
conda install -c bioconda multiqcOption 2: APT (System-wide)
sudo apt update && sudo apt install fastqc multiqcOption 3: pip (Latest MultiQC)
pip install multiqc- Run FastQC on all FASTQ files
mkdir -p fastqc_results
fastqc fastq/*.fastq.gz -o fastqc_results/ --threads 8- Aggregate results with MultiQC
multiqc fastqc_results/ -o multiqc_report/- β Confirms sequencing quality
- β Detects adapter contamination
- β Identifies GC bias and other systematic issues
- β Provides both per-sample (FastQC) and summarized (MultiQC) reports
Trimming can remove low-quality bases or adapter contamination. In this project, trimming was tested but since the aligner used here can better handle this things,untrimmed reads were used for downstream analysis.
Trimmomatic requires Java. Install Java first, then Trimmomatic:
# Check Java or install if missing
java -version || (sudo apt update && sudo apt install default-jre -y)
# Install Trimmomatic via Conda
conda install -c bioconda trimmomaticHere I demonstrate trimming on one FASTQ file (SRR7179504_pass.fastq.gz):
trimmomatic SE -threads 4 -phred33 \
fastq/SRR7179504_pass.fastq.gz \
fastq/SRR7179504_trimmed.fastq.gz \
TRAILING:10After trimming, re-run FastQC on the trimmed file:
fastqc fastq/SRR7179504_trimmed.fastq.gz- πΉ Removes low-quality bases from read ends
- πΉ Can reduce adapter contamination
- πΉ Ensures cleaner input for alignment
Note: In our dataset, trimming did not materially improve QC
Concatenate LNCaP technical runs β replicates and rename PC3 runs:
cat SRR7179504_pass.fastq.gz SRR7179505_pass.fastq.gz SRR7179506_pass.fastq.gz SRR7179507_pass.fastq.gz > LNCAP_Normoxia_S1.fastq.gz
cat SRR7179508_pass.fastq.gz SRR7179509_pass.fastq.gz SRR7179510_pass.fastq.gz SRR7179511_pass.fastq.gz > LNCAP_Normoxia_S2.fastq.gz
cat SRR7179520_pass.fastq.gz SRR7179521_pass.fastq.gz SRR7179522_pass.fastq.gz SRR7179523_pass.fastq.gz > LNCAP_Hypoxia_S1.fastq.gz
cat SRR7179524_pass.fastq.gz SRR7179525_pass.fastq.gz SRR7179526_pass.fastq.gz SRR7179527_pass.fastq.gz > LNCAP_Hypoxia_S2.fastq.gz
mv SRR7179536_pass.fastq.gz PC3_Normoxia_S1.fastq.gz
mv SRR7179537_pass.fastq.gz PC3_Normoxia_S2.fastq.gz
mv SRR7179540_pass.fastq.gz PC3_Hypoxia_S1.fastq.gz
mv SRR7179541_pass.fastq.gz PC3_Hypoxia_S2.fastq.gz
# After verifying the new files exist and sizes look correct:
rm -rf SRR*
| Cell line | Condition | Replicate | SRR runs merged | Final FASTQ |
|---|---|---|---|---|
| LNCaP | Normoxia | Rep1 | 504, 505, 506, 507 | LNCAP_Normoxia_S1.fastq.gz |
| LNCaP | Normoxia | Rep2 | 508, 509, 510, 511 | LNCAP_Normoxia_S2.fastq.gz |
| LNCaP | Hypoxia | Rep1 | 520, 521, 522, 523 | LNCAP_Hypoxia_S1.fastq.gz |
| LNCaP | Hypoxia | Rep2 | 524, 525, 526, 527 | LNCAP_Hypoxia_S2.fastq.gz |
| PC3 | Normoxia | Rep1 | 536 | PC3_Normoxia_S1.fastq.gz |
| PC3 | Normoxia | Rep2 | 537 | PC3_Normoxia_S2.fastq.gz |
| PC3 | Hypoxia | Rep1 | 540 | PC3_Hypoxia_S1.fastq.gz |
| PC3 | Hypoxia | Rep2 | 541 | PC3_Hypoxia_S2.fastq.gz |
- β Ensures that all reads from the same biological replicate are analyzed together
- β Prevents splitting of replicate data across multiple files
- β Simplifies downstream processing (alignment, quantification, etc.)
After merging, we also renamed the files for clarity and consistency.
To perform alignment and gene quantification, we need a reference genome index (for HISAT2) and a gene annotation file (GTF).
HISAT2 requires a prebuilt index of the reference genome. Download and extract:
wget https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz
tar -xvzf grch38_genome.tar.gzInstall HISAT2 (aligner) and Samtools (for handling BAM files):
sudo apt install hisat2
sudo apt install samtoolsDownload and extract the Ensembl GTF annotation file:
wget https://ftp.ensembl.org/pub/release-114/gtf/homo_sapiens/Homo_sapiens.GRCh38.114.gtf.gz
gunzip Homo_sapiens.GRCh38.114.gtf.gz- β HISAT2 requires a reference genome index to align reads
- β featureCounts and other quantification tools need a gene annotation (GTF) to assign reads to genes
In this step, we align the quality-checked FASTQ reads to the reference genome using HISAT2, and then process the results into sorted and indexed BAM files with Samtools.
The following command aligns one FASTQ file (LNCAP_Hypoxia_S1.fastq.gz) and outputs a sorted and indexed BAM file:
hisat2 -q -x grch38/genome -U fastq/LNCAP_Hypoxia_S1.fastq.gz | \
samtools sort -o alignedreads/LNCAP_Hypoxia_S1.bam
samtools index alignedreads/LNCAP_Hypoxia_S1.bamFor multiple samples, use the provided helper script:
chmod +x hisat2_alignment1.sh
./hisat2_alignment1.shThis script automates alignment for 7 samples; one additional sample was processed manually.
samtools index fails when piped, run it separately after sorting completes.
β
Result: Sorted and indexed BAM files stored in the alignedreads/ directory, ready for downstream analysis.
After alignment, we need to count how many reads map to each gene.
We use featureCounts (part of the Subread package) to generate a gene-by-sample count matrix, which is essential for differential expression (DE) analysis.
Install Subread and check that featureCounts is available:
sudo apt-get update
sudo apt-get install subread
featureCounts -vCreate a folder for storing quantification results:
mkdir -p quantsRun featureCounts on one BAM file (tmp.bam) with strand-specific mode (-S 2):
featureCounts -S 2 -a Homo_sapiens.GRCh38.114.gtf \
-o quants/featurecounts.txt tmp.bamTo automate the process of quantification, use the provided script:
chmod +x featurecounts.sh
./featurecounts.sh- β Converts aligned reads into per-gene read counts
- β Produces a gene Γ sample count matrix
- β Provides the featurecounts which can be used to create count matrix ,which is the input required for downstream differential expression analysis (DEA)
After alignment and quantification, it is important to assess the quality of aligned reads.
We use Qualimap to evaluate mapping quality, coverage, and biases.
Download manually or install via Conda:
Option 1: Manual Download
wget https://bitbucket.org/kokonech/qualimap/downloads/qualimap_v2.2.2.zip
unzip qualimap_v2.2.2.zip
cd qualimap_v2.2.2
chmod +x qualimapOption 2: Conda Install
conda install bioconda::qualimapRun QC on one BAM file (LNCAP_Hypoxia_S1.bam):
./qualimap_v2.3/qualimap rnaseq \
-bam alignedreads/LNCAP_Hypoxia_S1.bam \
-gtf Homo_sapiens.GRCh38.114.gtf \
-outdir rnaseq_qc_results \
--java-mem-size=8GFor batch QC, use the helper script:
chmod +x run_qualimap.sh
./run_qualimap.sh- β Validates mapping percentage
- β Checks coverage distribution
- β Detects gene-body bias
- β Confirms library strandedness
After quantification, each sample has its own featureCounts output.
In this step, we merge all per-sample counts into a single counts matrix (genes Γ samples).
Use the provided Jupyter Notebook to combine featureCounts outputs:
countsmatrix.ipynbThis notebook consolidates all per-sample counts into one matrix where:
Rows = genes Columns = samples
Once the counts matrix is built, proceed to R / RStudio, and folllow up the .Rmd file provided for downstream analysis with DESeq2:
- β Normalization
- β Differentially expressed genes (DEGs)
- β PCA plots, Volcano plots, Heatmap
- β Pathway ebnrichment analysis
- SRA Toolkit β Download sequencing data
- FastQC + MultiQC β Quality control
- Trimmomatic / fastp β Read trimming (optional)
- HISAT2 β Alignment to reference genome
- Samtools β BAM sorting and indexing
- featureCounts β Read quantification
- Qualimap β QC for aligned reads
- R (DESeq2, ggplot2, pheatmap) β Statistical analysis ,and visualization
- Learned how to process raw RNA-Seq data step by step
- Understood the importance of QC at every stage
- Experienced handling large datasets and merging technical replicates
- Performed differential expression analysis using DESeq2
- Gained insights into how hypoxia impacts prostate cancer cell lines
This work was carried out under the guidance of "Smriti Arora"
π¬ Contact
Author: Gayatri Sunil Samant
Email: [email protected]