DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes
Authors: Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana V Davuluri, Han Liu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive experiments on the GUE benchmark, we demonstrate that DNABERT-2 achieves comparable performance to the state-of-the-art model with 21 fewer parameters and approximately 92 less GPU time 1 in pre-training. |
| Researcher Affiliation | Academia | Department of Computer Science, Northwestern University, Evanston, IL, USA Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA {zhihanzhou, yanrongji, weijianli}@u.northwestern.edu pratik.dutta@stonybrook.edu, Ramana.Davuluri@stonybrookmedicine.edu hanliu@northwestern.edu |
| Pseudocode | No | The paper describes algorithms and methods but does not include any figure, block, or section explicitly labeled "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | The code, data, and pre-trained model are available at https://github.com/MAGICS-LAB/DNABERT_2. |
| Open Datasets | Yes | To investigate the impact of species diversity on genome foundation models, we compile and made publicly available two datasets for foundation model pre-training: the human genome and the multispecies genome. |
| Dataset Splits | Yes | We explicitly define evaluation metrics for each task and split each dataset into training, validation, and test data for a fair comparison across different models. |
| Hardware Specification | Yes | About 14 days on 8 NVIDIA RTX 2080Ti V.S. 28 days on 128 NVIDIA A100. Estimated with the Method 2: GPU Time introduced by Open AI in https://openai.com/research/ai-and-compute. and The pre-training stage takes approximately 14 days using eight Nvidia RTX 2080Ti GPUs. |
| Software Dependencies | No | The paper mentions using the "Transformers library (Wolf et al., 2020) and the Composer library (Team, 2021)" but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We pre-train DNABERT-2 with the Masked Language Modeling (MLM) loss with a mask ratio of 15%. We use a batch size of 4096 and a max sequence length of 128. We train the model for 500000 steps using the Adam W (Loshchilov & Hutter, 2019) optimizer with β1 = 0.9, β2 = 0.98, ϵ = 1e-6 and weight decay of 1e-5. The learning rate linearly increases from 0 to 5e-4 during the first 30000 steps while linearly decreasing to 0 in the last 470000 steps. and We keep most of the other hyperparameters the same for all the models across all the datasets, including a batch size of 32, a warmup step of 50, and a weight decay of 0.01. For DNABERT and DNABERT-2, we perform standard fine-tuning with a learning rate of 3e-5, while for the Nucleotide Transformers, we perform parameter efficient fine-tuning (PEFT) using Low-Rank Adaptation (Lo RA) with a learning rate of 1e-4, a Lo RA alpha of 16, a Lo RA dropout of 0.05, and a Lo RA r of 8. |