Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Evolutionary Reasoning Does Not Arise in Standard Usage of Protein Language Models

Authors: Yasha Ektefaie, Andrew Shen, Lavik Jain, Maha Farhat, Marinka Zitnik

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test this capability by evaluating whether standard PLM usage, frozen or fine-tuned embeddings with distance-based comparison, supports evolutionary reasoning. Existing PLMs consistently fail to recover phylogenetic structure, despite strong performance on sequence-level tasks such as masked-token and contact prediction. We present PHYLA, a hybrid state-space and transformer model that jointly processes multiple sequences and is trained using a tree-based objective across 3,000 phylogenies spanning diverse protein families.
Researcher Affiliation	Academia	Yasha Ektefaie* Eric and Wendy Schmidt Center Broad Institute of Harvard and MIT Cambridge, MA 02142 EMAIL Andrew Shen* Department of Biomedical Data Science Stanford University Stanford, CA 94305 EMAIL Lavik Jain Department of Biomedical Informatics Harvard Medical School Boston, MA 02115 EMAIL Maha Farhat Department of Biomedical Informatics Harvard Medical School Boston, MA 02115 EMAIL Marinka Zitnik Department of Biomedical Informatics Harvard Medical School Boston, MA 02115 EMAIL
Pseudocode	No	The paper describes the model architecture and training procedure in sections 4.1 and 4.2 using prose and mathematical equations (e.g., Equation 1) and diagrams (Figure 2), but does not include explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code to run PHYLA can be found in the project Github.
Open Datasets	Yes	For phylogenetic tree reconstruction, we use two held-out datasets: Tree Base, which includes 1,533 curated phylogenetic trees across diverse species (Piel & Tannen (2009)), and Tree Fam, which contains 9,586 gene-family trees spanning a wide evolutionary range (Li et al. (2006)). To assess taxonomic classification, we use bacterial isolate sequences from the Genome Taxonomy Database (GTDB) (Parks et al. (2021))... To evaluate performance beyond evolutionary structure, we also assess performance on functional prediction using the Protein Gym benchmark, which consists of 83 protein mutation effect datasets (Notin et al., 2023a)... PHYLA was trained on distances derived from 3,321 high-quality multiple sequence alignments (MSAs) curated from the Open Protein Set Ahdritz et al. (2023).
Dataset Splits	Yes	For phylogenetic tree reconstruction, we use two held-out datasets: Tree Base, which includes 1,533 curated phylogenetic trees across diverse species (Piel & Tannen (2009)), and Tree Fam, which contains 9,586 gene-family trees spanning a wide evolutionary range (Li et al. (2006)).
Hardware Specification	Yes	The current 24M parameter model was trained on a single 80GB H100 GPU for 3 days with the Adam W optimizer using a 10,000 step linear warmup up to a learning rate of 1e-5 (Loshchilov & Hutter (2019)).
Software Dependencies	No	The paper mentions "ETE3 Toolkit" for normalized Robinson-Foulds distance and "scikit-learn" for k-means clustering, but does not provide specific version numbers for these software components.
Experiment Setup	Yes	The current 24M parameter model was trained on a single 80GB H100 GPU for 3 days with the Adam W optimizer using a 10,000 step linear warmup up to a learning rate of 1e-5 (Loshchilov & Hutter (2019)). We employ an adaptive batch sizing approach to efficiently utilize GPU memory and avoid overfitting to a specific tree topology. We determine the largest subtree t T at every training step that can fit within the available GPU memory. Next, we randomly sample a subtree size n such that 10 n \|t\|, where \|t\| is the number of sequences in t. Finally, we identify how many subtrees of the sampled size \|t\| can be accommodated within the GPU memory. If the model encounters an out-of-memory (OOM) error during this process, the subtrees are resampled with both the subtree size and the number of subtrees halved. We empirically determined that PHYLA can process input lengths up to 213,350 tokens on a 32 GB GPU and up to 302,350 tokens on a 48 GB GPU. For other GPU memory sizes, we used a linear model to estimate the maximum allowable input length. Given the length of the longest protein in the input, we computed the maximum number of sequences that could fit within the memory limit.