Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model

Authors: Qihao Duan, Bingding Huang, Zhenqiao Song, Irina Lehmann, Lei Gu, Roland Eils, Benjamin Wild

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments and ablation studies demonstrate that Janus DNA achieves new state-of-the-art performance on three genomic representation benchmarks.
Researcher Affiliation Academia Qihao Duan1,4,5, Bingding Huang2, Zhenqiao Song3, Irina Lehmann1, Lei Gu4 , Roland Eils1,5,6,7 , Benjamin Wild1 1Berlin Institute of Health, Charité Universitätsmedizin Berlin 2College of Big Data and Internet, Shenzhen Technology University 3Language Technologies Institute, Carnegie Mellon University 4Epigenetics Laboratory, Max Planck Institute for Heart and Lung Research 5Department of Mathematics and Computer Science, Freie Universität Berlin 6Health Data Science Unit, Heidelberg University Hospital and Bio Quant 7Intelligent Medicine Institute, Fudan University
Pseudocode No The paper describes the architecture and modeling approach through text and figures (Figure 1, Figure 2) but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code available at https://github.com/Qihao-Duan/Janus DNA.
Open Datasets Yes To ensure a fair comparison with prior work, we pre-train our model on only the human reference genome (HG38 [32]) following the training setup described in [8].
Dataset Splits Yes We perform 5-fold cross-validation for each task using the same seed as [8]. Models are fine-tuned for 10 epochs with a batch size of 256. For the Nucleotide Transformer tasks, we perform 10-fold cross-validation for each task, adhering to the same experimental settings as [8]. splitting the dataset into 90/10 train/validation subsets and applying early stopping based on validation performance.
Hardware Specification Yes Notably, Janus DNA can process up to 1 million base pairs at single-nucleotide resolution on a single 80GB GPU using its hybrid architecture.
Software Dependencies No Optimization is performed using Adam W [45] with a weight decay of 0.1, β1 = 0.9, and β2 = 0.95. A cosine learning rate scheduler is applied, incorporating a warmup phase for 10% of the training steps. The mid-attention layer is implemented with Flex Attention2 [44].
Experiment Setup Yes Pre-training setup To ensure a fair comparison with prior work, we pre-train our model on the human reference genome (HG38 [32]) following the training setup described in [8]. We use cross-entropy loss for pre-training. The model is trained with a learning rate of 8 10 3, maintaining a constant token count of 220 tokens per batch. Two sequence lengths are used: 1024 and 131072, with corresponding batch sizes of 128 and 1, respectively, across 8 GPUs. Optimization is performed using Adam W [45] with a weight decay of 0.1, β1 = 0.9, and β2 = 0.95. A cosine learning rate scheduler is applied, incorporating a warmup phase for 10% of the training steps. The learning rate starts at 1 10 6 and peaks at 1 10 4. The coefficient for the Mo E auxiliary loss is set to 0.2. The gradient clipping threshold is set to 1.0.