TinyDNABERT is a lightweight genomic language model built from scratch, employing a BPE tokenizer and a RoBERTa architecture. It is pre-trained on the human reference genome GRCh38.p14 and evaluated using the NT Benchmark. Training is performed using only two NVIDIA RTX 4090 GPUs. All processed datasets and pre-trained models are publicly available on Hugging Face.
Most existing genomic language models only open-source the inference components, with complete training pipelines from scratch rarely provided. This lack of transparency makes it difficult for beginners to learn and experiment. I hope TinyDNABERT can serve as a valuable resource for newcomers interested in genomic language modeling.
During the preprocessing stage, we remove all “N” markers from the FASTA file and randomly split the remaining sequences into segments ranging from 256 to 1024 base pairs, resulting in a total of 4,904,969 training samples. To accommodate memory constraints, we uniformly sample one-tenth of these sequences to train the BPE tokenizer. The vocabulary size is set to 1,024, and the minimum frequency threshold is set to 4.
TinyDNABERT is designed with a hidden size of 512, 6 transformer layers, 8 attention heads, and an intermediate size of 2048. In total, the model contains 19,968,512 parameters.
We pretrain TinyDNABERT using the MLM objective with a masking rate of 15%. The model is trained with a batch size of 128 and a maximum sequence length of 512. We use the AdamW optimizer for 200,000 steps (
The evaluation phase follow a procedure similar to pretraining. We retain a batch size of 64 and use the AdamW optimizer for 5,000 steps with a weight decay of 1e-5,
- Scratch: The same model architecture fine-tuned from scratch without any pretraining
- SOTA: The state-of-the-art results reported in the Nucleotide Transformer Paper
Compared to training from scratch, TinyDNABERT achieves a 19.77% performance improvement, reaching 89.38% of the SOTA level.
Judging from the pretraining loss, TinyDNABERT has not yet reached its full performance potential. With longer training time and more data, it can likely achieve even more impressive results. However, as the objective of this project tutorial is already fulfilled, and given our limited resources, we make the difficult decision to terminate its pretraining.
This project is developed with reference to the following works:
- Transformers Github
- Fine-tune a Language Model
- Build a RoBERTa Model from Scratch
- Create a Tokenizer and Train a Huggingface RoBERTa Model from Scratch
If you have any question, please contact [email protected]

