Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Prot2Text-V2: Protein Function Prediction with Multimodal Contrastive Alignment

Authors: Xiao Fei, Michail Chatzianastasis, Sarah Carneiro, Hadi Abdine, Lawrence Petalidis, Michalis Vazirgiannis

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train Prot2Text-V2 on about 250K curated entries from Swiss Prot and evaluate it under low-homology conditions, where test sequences have low similarity with training samples. Our evaluation includes automated metrics, LLM-as-a-judge scoring for semantic fidelity, and expert human assessment, providing a multi-faceted understanding of model quality and robustness.
Researcher Affiliation	Collaboration	1École Polytechnique, Institut Polytechnique de Paris, France 2Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates 3M42 Health, United Arab Emirates
Pseudocode	No	The paper includes figures illustrating the model architecture and data flow, and describes the methodology in text and mathematical equations, but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The source code for this project is available at https://github.com/Colin FX/Prot2Text-V2/
Open Datasets	Yes	The dataset used during our experiments was the one proposed by Abdine et al. [2024]. This dataset is a multimodal dataset with 256,690 proteins, filtered originally from Swiss Prot (see Appendix C) [Bairoch and Apweiler, 2000], with their respective corresponding sequence, the predicted structure from Alpha Fold [Jumper et al., 2021], and the textual description. This curated dataset provides aligned sequence, structure, and text modalities, with rigorous filtering and low redundancy across splits, making it better suited for our multimodal learning tasks. Notably, to the best of our knowledge, this is the only publicly available dataset that minimizes protein sequence similarity between training and test sets.
Dataset Splits	Yes	The resulting split includes 248,312 proteins for training, 4,172 for validation and 4,203 for testing. We use this established split to allow direct comparison with prior work and to asses our model s ability to transfer knowledge beyond close homologs.
Hardware Specification	Yes	Our model is implemented using Py Torch and trained on a single node with 8 NVIDIA A100 80GB GPUs.
Software Dependencies	No	The paper mentions 'Py Torch' and 'Adam W optimizer' but does not specify version numbers for these or any other key software components.
Experiment Setup	Yes	We use the Adam W optimizer [Loshchilov and Hutter, 2017] with ϵ = 1 10 6, β1 = 0.9, and β2 = 0.999. The learning rate starts at 0.0002 and decays to zero following a cosine scheduler, with a warm-up period covering 6% of the total training steps. For the Lo RA adapter, we apply it to the self-attention modules in both the ESM encoder and LLa MA decoder, using a rank of 32 and an α value of 64. Training lasts for 12 epochs in contrastive learning and 24 epochs for supervised fine-tuning. During the contrastive learning stage, the batch size per device is 1024, which is further divided into 8 chunks to accommodate memory constraints. The batch size is set to 4 per GPU, and gradient accumulation is applied every 8 forward passes, resulting in an effective batch size of 256.