Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning to design protein-protein interactions with enhanced generalization
Authors: Anton Bushuiev, Roman Bushuiev, Petr Kouba, Anatolii Filkin, Marketa Gabrielova, Michal Gabriel, Jiri Sedlar, Tomas Pluskal, Jiri Damborsky, Stanislav Mazurenko, Josef Sivic
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we construct PPIRef, the largest and non-redundant dataset of 3D protein protein interactions, enabling effective large-scale learning. Second, we leverage the PPIRef dataset to pre-train PPIformer, a new SE(3)-equivariant model generalizing across diverse protein-binder variants. We finetune PPIformer to predict effects of mutations on protein protein interactions via a thermodynamically motivated adjustment of the pre-training loss function. Finally, we demonstrate the enhanced generalization of our new PPIformer approach by outperforming other state-of-the-art methods on new, non-leaking splits of standard labeled PPI mutational data and independent case studies optimizing a human antibody against SARS-Co V-2 and increasing the thrombolytic activity of staphylokinase. |
| Researcher Affiliation | Academia | 1Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University 2Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Masaryk University 3International Clinical Research Center, St. Anne s University Hospital Brno 4Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences |
| Pseudocode | Yes | Algorithm 1 i Dist EMBED |
| Open Source Code | Yes | 1https://github.com/anton-bushuiev/PPIRef 2https://github.com/anton-bushuiev/PPIformer 3https://huggingface.co/spaces/anton-bushuiev/PPIformer |
| Open Datasets | Yes | First, we construct PPIRef, the largest and non-redundant dataset of 3D protein protein interactions, enabling effective large-scale learning. |
| Dataset Splits | Yes | Therefore, to ensure effective evaluation of generalization and mitigate the risk of overfitting, we divide SKEMPI v2.0 into 3 cross-validation folds based on the Hold out proteins feature, as originally proposed by the dataset authors. Additionally, we stratify the G distribution across the folds to ensure balanced labels. Before constructing the cross-validation split, we reserve 5 distinct PPIs to create 5 test folds. |
| Hardware Specification | Yes | We pre-train our model on four AMD MI250X GPUs (8 Py Torch devices) in a distributed data parallel (DDP) mode. |
| Software Dependencies | No | The paper mentions software like Py Torch, Py Torch Geometric, Py Torch Lightning, Graphein, and Equiformer, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We partially explore the grid of hyper-parameters given by Table 5, and select the best model according to the performance on zero-shot G inference on the training set of SKEMPI v2.0. We further fine-tune the model on the same data with the learning rate of 3 10 4 and sampling 32 mutations per GPU in a single training step, such that each mutation is from a different PPI. |