Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Protein Inverse Folding From Structure Feedback

Authors: Junde Xu, Zijun Gao, Xinyi Zhou, hujie, Xingyi Cheng, Le Song, Guangyong Chen, Pheng-Ann Heng, Jiezhong Qiu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results on the CATH 4.2 test set demonstrate that DPO fine-tuning not only improves sequence recovery of baseline models but also leads to a significant improvement in average TM-Score from 0.77 to 0.81, indicating enhanced structure similarity. Furthermore, iterative application of our DPO-based method on challenging protein structures yields substantial gains, with an average TM-Score increase of 79.5% with regard to the baseline model.
Researcher Affiliation	Academia	1 CUHK 2Hangzhou Institute of Medicine, CAS 3 Zhejiang Lab 4 MBZUAI
Pseudocode	No	The paper includes mathematical formulas and textual descriptions of processes, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Corresponding author. Code available at https://github.com/Eikor/iplm-rl
Open Datasets	Yes	Our results on the CATH 4.2 test set demonstrate that DPO fine-tuning...Our DPO models consistently achieve higher sequence recovery across all 3 datasets (CATH 4.2 [32], TS50 and TS500 [12]) compared to their baselines.
Dataset Splits	No	Specifically, we first generate sequences and train models on the CATH 4.2 training set and investigate the performance on the test set. To further probe generalization, we also evaluate on two additional benchmarks, TS50 and TS500 [12], which comprise 50 and 470 diverse proteins, and are often employed as additional benchmarks to further test generalization capability[53, 14, 15]
Hardware Specification	Yes	All experiments are done with 8*NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions software components like 'Adam W' as an optimizer and 'Lo RA' as an adaptation technique, and models like 'ESMfold' and 'Alpha Fold 3', but does not specify their version numbers or other key software dependencies with specific versions.
Experiment Setup	Yes	We use two different training configs for single-round training and multi-round training. The main difference lies in the generation setting and training steps. Specifically, for the construction of the single-round dataset, we generate 20 responses for each structure, with temperature at 1.0 and top p at 0.9. For multi-round, in order to encourage model exploration, we set the temperature at 1.1 and top p at 1 and generate 200 responses for each structure. For DPO training, we use Adam W [26] as optimizer. We set β to 0.5 for all experiments. We train our model on 8 * Nvidia A100 GPUs with a batch size of 128. Other parameters are summarized in Tab. 7.