Learning to design protein-protein interactions with enhanced generalization

Authors: Anton Bushuiev, Roman Bushuiev, Petr Kouba, Anatolii Filkin, Marketa Gabrielova, Michal Gabriel, Jiri Sedlar, Tomas Pluskal, Jiri Damborsky, Stanislav Mazurenko, Josef Sivic

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, we construct PPIRef, the largest and non-redundant dataset of 3D protein protein interactions, enabling effective large-scale learning. Second, we leverage the PPIRef dataset to pre-train PPIformer, a new SE(3)-equivariant model generalizing across diverse protein-binder variants. We finetune PPIformer to predict effects of mutations on protein protein interactions via a thermodynamically motivated adjustment of the pre-training loss function. Finally, we demonstrate the enhanced generalization of our new PPIformer approach by outperforming other state-of-the-art methods on new, non-leaking splits of standard labeled PPI mutational data and independent case studies optimizing a human antibody against SARS-Co V-2 and increasing the thrombolytic activity of staphylokinase.
Researcher Affiliation Academia 1Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University 2Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Masaryk University 3International Clinical Research Center, St. Anne s University Hospital Brno 4Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences
Pseudocode Yes Algorithm 1 i Dist EMBED
Open Source Code Yes 1https://github.com/anton-bushuiev/PPIRef 2https://github.com/anton-bushuiev/PPIformer 3https://huggingface.co/spaces/anton-bushuiev/PPIformer
Open Datasets Yes First, we construct PPIRef, the largest and non-redundant dataset of 3D protein protein interactions, enabling effective large-scale learning.
Dataset Splits Yes Therefore, to ensure effective evaluation of generalization and mitigate the risk of overfitting, we divide SKEMPI v2.0 into 3 cross-validation folds based on the Hold out proteins feature, as originally proposed by the dataset authors. Additionally, we stratify the G distribution across the folds to ensure balanced labels. Before constructing the cross-validation split, we reserve 5 distinct PPIs to create 5 test folds.
Hardware Specification Yes We pre-train our model on four AMD MI250X GPUs (8 Py Torch devices) in a distributed data parallel (DDP) mode.
Software Dependencies No The paper mentions software like Py Torch, Py Torch Geometric, Py Torch Lightning, Graphein, and Equiformer, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We partially explore the grid of hyper-parameters given by Table 5, and select the best model according to the performance on zero-shot G inference on the training set of SKEMPI v2.0. We further fine-tune the model on the same data with the learning rate of 3 10 4 and sampling 32 mutations per GPU in a single training step, such that each mutation is from a different PPI.