DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes

Authors: Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana V Davuluri, Han Liu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive experiments on the GUE benchmark, we demonstrate that DNABERT-2 achieves comparable performance to the state-of-the-art model with 21 fewer parameters and approximately 92 less GPU time 1 in pre-training.
Researcher Affiliation Academia Department of Computer Science, Northwestern University, Evanston, IL, USA Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA {zhihanzhou, yanrongji, weijianli}@u.northwestern.edu pratik.dutta@stonybrook.edu, Ramana.Davuluri@stonybrookmedicine.edu hanliu@northwestern.edu
Pseudocode No The paper describes algorithms and methods but does not include any figure, block, or section explicitly labeled "Pseudocode" or "Algorithm".
Open Source Code Yes The code, data, and pre-trained model are available at https://github.com/MAGICS-LAB/DNABERT_2.
Open Datasets Yes To investigate the impact of species diversity on genome foundation models, we compile and made publicly available two datasets for foundation model pre-training: the human genome and the multispecies genome.
Dataset Splits Yes We explicitly define evaluation metrics for each task and split each dataset into training, validation, and test data for a fair comparison across different models.
Hardware Specification Yes About 14 days on 8 NVIDIA RTX 2080Ti V.S. 28 days on 128 NVIDIA A100. Estimated with the Method 2: GPU Time introduced by Open AI in https://openai.com/research/ai-and-compute. and The pre-training stage takes approximately 14 days using eight Nvidia RTX 2080Ti GPUs.
Software Dependencies No The paper mentions using the "Transformers library (Wolf et al., 2020) and the Composer library (Team, 2021)" but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We pre-train DNABERT-2 with the Masked Language Modeling (MLM) loss with a mask ratio of 15%. We use a batch size of 4096 and a max sequence length of 128. We train the model for 500000 steps using the Adam W (Loshchilov & Hutter, 2019) optimizer with β1 = 0.9, β2 = 0.98, ϵ = 1e-6 and weight decay of 1e-5. The learning rate linearly increases from 0 to 5e-4 during the first 30000 steps while linearly decreasing to 0 in the last 470000 steps. and We keep most of the other hyperparameters the same for all the models across all the datasets, including a batch size of 32, a warmup step of 50, and a weight decay of 0.01. For DNABERT and DNABERT-2, we perform standard fine-tuning with a learning rate of 3e-5, while for the Nucleotide Transformers, we perform parameter efficient fine-tuning (PEFT) using Low-Rank Adaptation (Lo RA) with a learning rate of 1e-4, a Lo RA alpha of 16, a Lo RA dropout of 0.05, and a Lo RA r of 8.