reproducibilityindex.ai

Learning to design protein-protein interactions with enhanced generalization

Authors: Anton Bushuiev, Roman Bushuiev, Petr Kouba, Anatolii Filkin, Marketa Gabrielova, Michal Gabriel, Jiri Sedlar, Tomas Pluskal, Jiri Damborsky, Stanislav Mazurenko, Josef Sivic

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, we construct PPIRef, the largest and non-redundant dataset of 3D protein protein interactions, enabling effective large-scale learning. Second, we leverage the PPIRef dataset to pre-train PPIformer, a new SE(3)-equivariant model generalizing across diverse protein-binder variants. We finetune PPIformer to predict effects of mutations on protein protein interactions via a thermodynamically motivated adjustment of the pre-training loss function. Finally, we demonstrate the enhanced generalization of our new PPIformer approach by outperforming other state-of-the-art methods on new, non-leaking splits of standard labeled PPI mutational data and independent case studies optimizing a human antibody against SARS-Co V-2 and increasing the thrombolytic activity of staphylokinase.
Researcher Affiliation	Academia	1Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University 2Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Masaryk University 3International Clinical Research Center, St. Anne s University Hospital Brno 4Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
Pseudocode	Yes	Algorithm 1 i Dist EMBED
Open Source Code	Yes	1https://github.com/anton-bushuiev/PPIRef 2https://github.com/anton-bushuiev/PPIformer 3https://huggingface.co/spaces/anton-bushuiev/PPIformer
Open Datasets	Yes	First, we construct PPIRef, the largest and non-redundant dataset of 3D protein protein interactions, enabling effective large-scale learning.
Dataset Splits	Yes	Therefore, to ensure effective evaluation of generalization and mitigate the risk of overfitting, we divide SKEMPI v2.0 into 3 cross-validation folds based on the Hold out proteins feature, as originally proposed by the dataset authors. Additionally, we stratify the G distribution across the folds to ensure balanced labels. Before constructing the cross-validation split, we reserve 5 distinct PPIs to create 5 test folds.
Hardware Specification	Yes	We pre-train our model on four AMD MI250X GPUs (8 Py Torch devices) in a distributed data parallel (DDP) mode.
Software Dependencies	No	The paper mentions software like Py Torch, Py Torch Geometric, Py Torch Lightning, Graphein, and Equiformer, but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	We partially explore the grid of hyper-parameters given by Table 5, and select the best model according to the performance on zero-shot G inference on the training set of SKEMPI v2.0. We further fine-tune the model on the same data with the learning rate of 3 10 4 and sampling 32 mutations per GPU in a single training step, such that each mutation is from a different PPI.