ncRNAHD is a tool for detecting homologous non-coding RNA sequences using deep learning embeddings and efficient similarity search.
- Deep learning-based RNA sequence embedding using ncRNABert
- Efficient similarity search with FAISS indexing
- Multiple sequence alignment (MSA) generation support
git clone https://github.com/ISYSLAB-HUST/ncRNAHD
cd ncRNAHDconda env create -f environment.yml
conda activate ncRNAHDbash setup/download_data.sh
python process_rna_sequences.pypython embedding/generate_embeddings.pypython indexing/build_faiss_index.py# Setup rMSA
bash msa/setup_rmsa.sh
# Replace with custom rMSA.pl
bash msa/replace_rmsa.shpython homolog_search.py --query_fasta your_query.fasta --output_dir results
# Example1:
python homolog_search.py --query_fasta examples/5kh8.fasta --output_dir results
# Example2:
python homolog_search.py --query_fasta examples/batch_query.fasta --output_dir resultscd rMSA
# 1. Format the candidate database
database/script/makeblastdb -in ../results/Homologs_your_query.fasta -parse_seqids -hash_index -dbtype nucl
# 2. Generate MSA
perl rMSA.pl your_query.fasta -db1=../results/Homologs_your_query.fasta -cpu=16
# 3. A3m format (Optional)
# perl ${WORK_DIR}/bin/reformat.pl fas a3m -l 10000 your_query.afa your_query.a3m
# Complete example:
database/script/makeblastdb -in ../results/Homologs_5kh8.fasta -parse_seqids -hash_index -dbtype nucl
perl rMSA.pl 5kh8.fasta -db1=../results/Homologs_5kh8.fasta -cpu=16
# perl ${WORK_DIR}/bin/reformat.pl fas a3m -l 10000 5kh8.afa 5kh8.a3mncRNAHD/
├── homolog_search.py # Main search tool
├── environment.yml # Conda environment configuration
├── README.md # This file
├── setup/ # Data download and preprocessing scripts
│ ├── download_data.sh
│ └── process_rna_sequences.py
├── embedding/ # Embedding generation
│ └── generate_embeddings.py
├── indexing/ # FAISS index building
│ └── build_faiss_index.py
├── search/ # Search components
│ ├── embedding_generator.py
│ └── faiss_searcher.py
├── msa/ # MSA generation tools
│ ├── setup_rmsa.sh
│ ├── setup_trrosetta.sh
│ ├── replace_rmsa.sh
│ └── rMSA.pl
├── data/ # Generated data files (created during setup)
│ ├── rnacentral_active.fasta
│ ├── rnacentral_active_processed.fasta
│ ├── rna_embeddings.npy
│ ├── trained_index.faiss
│ ├── whiten_params.npz
│ └── sequence_index.json
├── results/ # Search results (created during search)
│ └── Homologs_{sequence_id}.fasta # Candidate sequences for each query
└── examples/ # Example query files
├── batch_query.fasta
├── 5kh8.fasta
└── Homologs_5kh8.fasta
- Python 3.12
- PyTorch
- BioPython
- FAISS
- ncRNABert
- See
environment.ymlfor complete dependencies