Skip to content

ISYSLAB-HUST/ncRNAHD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ncRNAHD: Non-coding RNA Homolog Detection

ncRNAHD is a tool for detecting homologous non-coding RNA sequences using deep learning embeddings and efficient similarity search.

Features

  • Deep learning-based RNA sequence embedding using ncRNABert
  • Efficient similarity search with FAISS indexing
  • Multiple sequence alignment (MSA) generation support

Installation

1. Clone the repository

git clone https://github.com/ISYSLAB-HUST/ncRNAHD
cd ncRNAHD

2. Create conda environment

conda env create -f environment.yml
conda activate ncRNAHD

3. Download and process RNACentral database

bash setup/download_data.sh
python process_rna_sequences.py

4. Generate embeddings for the database

python embedding/generate_embeddings.py

5. Build FAISS index

python indexing/build_faiss_index.py

6. Setup custom rMSA

# Setup rMSA
bash msa/setup_rmsa.sh

# Replace with custom rMSA.pl
bash msa/replace_rmsa.sh

Usage

Step 1:homolog search

python homolog_search.py --query_fasta your_query.fasta --output_dir results

# Example1:
python homolog_search.py --query_fasta examples/5kh8.fasta --output_dir results
# Example2:
python homolog_search.py --query_fasta examples/batch_query.fasta --output_dir results

Step 2:MSA Generation

cd rMSA
# 1. Format the candidate database
database/script/makeblastdb -in ../results/Homologs_your_query.fasta -parse_seqids -hash_index -dbtype nucl
# 2. Generate MSA
perl rMSA.pl your_query.fasta -db1=../results/Homologs_your_query.fasta -cpu=16
# 3. A3m format (Optional)
# perl ${WORK_DIR}/bin/reformat.pl fas a3m -l 10000 your_query.afa your_query.a3m

# Complete example:
database/script/makeblastdb -in ../results/Homologs_5kh8.fasta -parse_seqids -hash_index -dbtype nucl
perl rMSA.pl 5kh8.fasta -db1=../results/Homologs_5kh8.fasta -cpu=16
# perl ${WORK_DIR}/bin/reformat.pl fas a3m -l 10000 5kh8.afa 5kh8.a3m

File Structure

ncRNAHD/
├── homolog_search.py          # Main search tool
├── environment.yml            # Conda environment configuration
├── README.md                  # This file
├── setup/                     # Data download and preprocessing scripts
│   ├── download_data.sh
│   └── process_rna_sequences.py
├── embedding/                 # Embedding generation
│   └── generate_embeddings.py
├── indexing/                  # FAISS index building
│   └── build_faiss_index.py
├── search/                    # Search components
│   ├── embedding_generator.py
│   └── faiss_searcher.py
├── msa/                       # MSA generation tools
│   ├── setup_rmsa.sh
│   ├── setup_trrosetta.sh
│   ├── replace_rmsa.sh
│   └── rMSA.pl
├── data/                      # Generated data files (created during setup)
│   ├── rnacentral_active.fasta
│   ├── rnacentral_active_processed.fasta
│   ├── rna_embeddings.npy
│   ├── trained_index.faiss
│   ├── whiten_params.npz
│   └── sequence_index.json
├── results/                   # Search results (created during search)
│   └── Homologs_{sequence_id}.fasta    # Candidate sequences for each query
└── examples/                  # Example query files
    ├── batch_query.fasta
    ├── 5kh8.fasta
    └── Homologs_5kh8.fasta
    


Requirements

  • Python 3.12
  • PyTorch
  • BioPython
  • FAISS
  • ncRNABert
  • See environment.yml for complete dependencies

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published