BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks

Authors: Frederikke Isa Marin, Felix Teufel, Marc Horlacher, Dennis Madsen, Dennis Pultz, Ole Winther, Wouter Boomsma

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features.
Researcher Affiliation Collaboration 1Bioinformatics & Design, Enzyme Research, Novozymes A/S 2Department of Computer Science, University of Copenhagen 3Digital Science & Innovation, Novo Nordisk A/S 4Department of Biology, University of Copenhagen 5Computational Health Center, Helmholtz Center Munich 6DTU Compute, Technical University of Denmark
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes BEND is available at https://github.com/frederikkemarin/BEND. Code to extract DNA sequences from the reference genome with the bed coordinates, dataloaders, models and config files is available on Github (https://anonymous.4open.science/r/BEND-8C42/README.md).
Open Datasets Yes All data is available at https://sid.erda.dk/cgi-sid/ls.py?share_id= a NQa0Oz2l Y and mentions sources like "GENCODE (Frankish et al., 2021)", "ENCODE Project Consortium (2012)", "Deep SEA dataset (Zhou & Troyanskaya, 2015)", "Clin Var (Landrum et al., 2020)".
Dataset Splits Yes Graph Part (Teufel et al., 2023) with Needleman Wunsch global sequence alignments was used for splitting at a 80% sequence identity into train (80% of the data), test and validation (10% each).
Hardware Specification Yes The model was trained on a single NVIDIA RTX 6000 GPU on a local cluster for 35 days. using 4 NVIDIA A40 GPUs for 14 days.
Software Dependencies No The paper mentions software tools like 'Graph Part (Teufel et al., 2023)', 'Ensembl VEP (Mc Laren et al., 2016)', and 'AUGUSTUS', but does not provide specific version numbers for these or other key software components or libraries.
Experiment Setup Yes CNN models were trained using Adam W with a learning rate of 0.003 and a weight decay of 0.01 for 100 epochs with a batch size of 64. CNN models with channel size 2 were trained using Adam W with a learning rate of 0.001 and a weight decay of 0.01 for 100 epochs with a batch size of 8.