Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model

Authors: Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J Maddison, Bo Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across biological reasoning benchmarks, BIOREASON significantly improves performance, raising accuracy on KEGG-based disease pathway prediction from 86% to 98% and delivering an average 15% gain over strong single-modality baselines in variant effect prediction tasks.
Researcher Affiliation	Collaboration	1University of Toronto 2Vector Institute 3University Health Network (UHN) 4Arc Institute 5Cohere 6University of California, San Francisco 7Google Deep Mind
Pseudocode	No	The paper describes algorithms like Group Relative Policy Optimization (GRPO) and its formal objective in Appendix A.4, but it does not present a step-by-step pseudocode or algorithm block.
Open Source Code	Yes	Data, code, and checkpoints are publicly available at https://github.com/bowang-lab/BioReason.
Open Datasets	Yes	We curated three datasets: one novel dataset specifically designed to incentivize reasoning and two adapted from established benchmarks. The adapted datasets are derived from Clin Var [24] and OMIM [1]... Our novel dataset is based on KEGG Network Variants data [20] and enhanced with cross-linked metadata from several public variant repositories including Clin Var [24], OMIM [1], db SNP [37], and COSMIC [38].
Dataset Splits	Yes	Split: Chromosomes (Chr) 1 7, 9 22, X, Y for train/validation; Chr 8 for testing. ... We used stratified train/test splits to ensure balanced disease representation. ... D. Distribution of train/test splits across the three curated datasets. 10% of train dataset was used for validation.
Hardware Specification	Yes	We conducted experiments using multiple GPU clusters equipped with NVIDIA A100 and H100 GPUs. A100 systems were equipped with Intel Xeon Silver CPUs, featuring 16-24 CPU cores, 24-32 threads, and 188-251 GB of RAM. We used 4 A100 GPUs for reinforcement learning, while other experiments were performed on single H100 GPUs with Slurm-based orchestration and Deepspeed.
Software Dependencies	No	The paper mentions software like AdamW optimizer and Deepspeed strategy but does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch, CUDA).
Experiment Setup	Yes	Optimizer: Adam W Learning rate: 5 10 5 Weight decay: 1 10 2 Gradient accumulation: 8 steps Random seed: 23 Devices: 1 Lo RA adapters (SFT). Rank: 32, Alpha: 64, Dropout: 0.05 ... GRPO Parameters. Number of generations: 8 Per device batch size: 8 Steps: 1000 (7 epochs) Devices: 2 Temperature (4B parameters): 0.7 Temperature (1.7B parameters): 1 Top p: 0.95 Top k: 20 Beta: 0.0 Epsilon: 0.2 ... Task-specific settings: KEGG pathway reasoning: Batch size: 1 Epochs: 5 Max legnth DNA: 2048 Max text length: 1024 (for LLM only increases to 8192 to fit the raw DNA sequences) Variant effect prediction (coding & non-SNV): Batch size: 2 Epochs: 3 Max legnth DNA: 2048 Max text length: 1024 (for LLM only increases to 8192 to fit the raw DNA sequences)