HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

Authors: Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Michael Wornow, Callum Birch-Sykes, Stefano Massaroli, Aman Patel, Clayton Rabideau, Yoshua Bengio, Stefano Ermon, Christopher Ré, Stephen Baccus

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Hyena DNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution... On fine-tuned benchmarks from the Nucleotide Transformer, Hyena DNA reaches state-of-the-art (Sot A) on 12 of 18 datasets... On the Genomic Benchmarks, Hyena DNA surpasses Sot A on 7 of 8 datasets... Code available at https://github.com/HazyResearch/hyenadna.
Researcher Affiliation Collaboration 1Stanford University. 2Harvard University. 3Syn Tensor. 4Mila and Université de Montréal.
Pseudocode No The paper includes block diagrams and architectural figures (e.g., Figure 1.3, Figure 3.1) to illustrate the model structure, but it does not contain any formal pseudocode or algorithm blocks.
Open Source Code Yes Code available at https://github.com/HazyResearch/hyenadna.
Open Datasets Yes We pretrain Hyena DNA on the human reference genome [22] using next nucleotide (token) prediction...Genome Reference Consortium. Genome reference consortium human build 38 (grch38). National Center for Biotechnology Information, 2013. URL https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/.
Dataset Splits Yes For pretraining, we use a single human reference genome [22], and leverage the training and validation intervals (start and end) from [1]. During training, we sample an interval and obtain a sequence of length L by adjusting the intervals on both ends. For the test set, we use chromosomes 14 and X, exclusively, and sample non-overlapping sequences of length L...For the Nucleotide Transformer, we generated our own train-test splits using a 90:10 ratio...We stop training early if the model s validation loss does not improve for two epochs.
Hardware Specification Yes At sequence length 1M, Hyena DNA is 160x faster than its Transformer counterpart as shown in Fig. 4.1...A100 80GB...pretrained using up to 8 Nvidia A100 (80GB) GPUs. We train on a mix of Nvidia GPUs with A100s, V100s, and T4s.
Software Dependencies No The paper states 'Across all experiments, we use Pytorch and Pytorch Lightning.' However, it does not specify any version numbers for these software dependencies, which is necessary for reproducibility.
Experiment Setup Yes Table A.1: Hyperparameter settings for Hyena DNA pretraining (select models)...Table A.3: Genomic Benchmarks hyperparameters for Hyena DNA and the baseline Transformer (GPT from 4.1), which uses Flash Attention [11]...Table A.5: Hyperparameter ranges used to fine-tune Hyena DNA for all Nucleotide transformer datasets...Table A.7: Optimization settings for in-context learning...Table A.8: Chromatin profile model settings. Hyena DNA hyperparameter settings used in the chromatin profile prediction experiments (fine-tuning).