BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks
Authors: Frederikke Isa Marin, Felix Teufel, Marc Horlacher, Dennis Madsen, Dennis Pultz, Ole Winther, Wouter Boomsma
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features. |
| Researcher Affiliation | Collaboration | 1Bioinformatics & Design, Enzyme Research, Novozymes A/S 2Department of Computer Science, University of Copenhagen 3Digital Science & Innovation, Novo Nordisk A/S 4Department of Biology, University of Copenhagen 5Computational Health Center, Helmholtz Center Munich 6DTU Compute, Technical University of Denmark |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | BEND is available at https://github.com/frederikkemarin/BEND. Code to extract DNA sequences from the reference genome with the bed coordinates, dataloaders, models and config files is available on Github (https://anonymous.4open.science/r/BEND-8C42/README.md). |
| Open Datasets | Yes | All data is available at https://sid.erda.dk/cgi-sid/ls.py?share_id= a NQa0Oz2l Y and mentions sources like "GENCODE (Frankish et al., 2021)", "ENCODE Project Consortium (2012)", "Deep SEA dataset (Zhou & Troyanskaya, 2015)", "Clin Var (Landrum et al., 2020)". |
| Dataset Splits | Yes | Graph Part (Teufel et al., 2023) with Needleman Wunsch global sequence alignments was used for splitting at a 80% sequence identity into train (80% of the data), test and validation (10% each). |
| Hardware Specification | Yes | The model was trained on a single NVIDIA RTX 6000 GPU on a local cluster for 35 days. using 4 NVIDIA A40 GPUs for 14 days. |
| Software Dependencies | No | The paper mentions software tools like 'Graph Part (Teufel et al., 2023)', 'Ensembl VEP (Mc Laren et al., 2016)', and 'AUGUSTUS', but does not provide specific version numbers for these or other key software components or libraries. |
| Experiment Setup | Yes | CNN models were trained using Adam W with a learning rate of 0.003 and a weight decay of 0.01 for 100 epochs with a batch size of 64. CNN models with channel size 2 were trained using Adam W with a learning rate of 0.001 and a weight decay of 0.01 for 100 epochs with a batch size of 8. |