Learning to design protein-protein interactions with enhanced generalization
Authors: Anton Bushuiev, Roman Bushuiev, Petr Kouba, Anatolii Filkin, Marketa Gabrielova, Michal Gabriel, Jiri Sedlar, Tomas Pluskal, Jiri Damborsky, Stanislav Mazurenko, Josef Sivic
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we construct PPIRef, the largest and non-redundant dataset of 3D protein protein interactions, enabling effective large-scale learning. Second, we leverage the PPIRef dataset to pre-train PPIformer, a new SE(3)-equivariant model generalizing across diverse protein-binder variants. We finetune PPIformer to predict effects of mutations on protein protein interactions via a thermodynamically motivated adjustment of the pre-training loss function. Finally, we demonstrate the enhanced generalization of our new PPIformer approach by outperforming other state-of-the-art methods on new, non-leaking splits of standard labeled PPI mutational data and independent case studies optimizing a human antibody against SARS-Co V-2 and increasing the thrombolytic activity of staphylokinase. |
| Researcher Affiliation | Academia | 1Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University 2Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Masaryk University 3International Clinical Research Center, St. Anne s University Hospital Brno 4Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences |
| Pseudocode | Yes | Algorithm 1 i Dist EMBED |
| Open Source Code | Yes | 1https://github.com/anton-bushuiev/PPIRef 2https://github.com/anton-bushuiev/PPIformer 3https://huggingface.co/spaces/anton-bushuiev/PPIformer |
| Open Datasets | Yes | First, we construct PPIRef, the largest and non-redundant dataset of 3D protein protein interactions, enabling effective large-scale learning. |
| Dataset Splits | Yes | Therefore, to ensure effective evaluation of generalization and mitigate the risk of overfitting, we divide SKEMPI v2.0 into 3 cross-validation folds based on the Hold out proteins feature, as originally proposed by the dataset authors. Additionally, we stratify the G distribution across the folds to ensure balanced labels. Before constructing the cross-validation split, we reserve 5 distinct PPIs to create 5 test folds. |
| Hardware Specification | Yes | We pre-train our model on four AMD MI250X GPUs (8 Py Torch devices) in a distributed data parallel (DDP) mode. |
| Software Dependencies | No | The paper mentions software like Py Torch, Py Torch Geometric, Py Torch Lightning, Graphein, and Equiformer, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We partially explore the grid of hyper-parameters given by Table 5, and select the best model according to the performance on zero-shot G inference on the training set of SKEMPI v2.0. We further fine-tune the model on the same data with the learning rate of 3 10 4 and sampling 32 mutations per GPU in a single training step, such that each mutation is from a different PPI. |