End-to-End Learning on 3D Protein Structure for Interface Prediction

Authors: Raphael Townshend, Rishi Bedi, Patricia Suriana, Ron Dror

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We built a training dataset, the Database of Interacting Protein Structures (DIPS), that contains biases but is two orders of magnitude larger than those used previously. We found that these biases significantly degrade the performance of existing methods on gold-standard data. Hypothesizing that assumptions baked into the hand-crafted features on which these methods depend were the source of the problem, we developed the first end-to-end learning model for protein interface prediction, the Siamese Atomic Surfacelet Network (SASNet). Using only spatial coordinates and identities of atoms, SASNet outperforms state-of-the-art methods trained on gold-standard structural data, even when trained on only 3% of our new dataset. Code and data available at https://github.com/drorlab/DIPS.
Researcher Affiliation Academia Raphael J. L. Townshend Stanford University raphael@cs.stanford.edu Rishi Bedi Stanford University rbedi@cs.stanford.edu Patricia A. Suriana Stanford University psuriana@stanford.edu Ron O. Dror Stanford University rondror@cs.stanford.edu
Pseudocode No The paper describes the SASNet architecture textually and with a diagram (Figure 2F), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Code and data available at https://github.com/drorlab/DIPS.
Open Datasets Yes We built a training dataset, the Database of Interacting Protein Structures (DIPS), that contains biases but is two orders of magnitude larger than those used previously.
Dataset Splits Yes State-of-the-art methods [23], [24] further split DB5 into a training/validation set of 175 complexes, DB5-train, corresponding to DB4 (the complexes from the previous version, Docking Benchmark 4) and a test set, DB5-test, of 55 complexes (the complexes added in the update from DB4 to DB5).
Hardware Specification Yes All models were trained across 4 Titan X GPUs using data-level parallelism, and the best model took 12 hours to train.
Software Dependencies No The paper mentions using RMSProp optimizer and convolutional neural networks, but does not specify versions for any programming languages or software libraries (e.g., Python, TensorFlow, PyTorch, scikit-learn).
Experiment Setup Yes Our model with the best validation performance involved training on 163840 examples, featurizing a grid of edge length 41 Å with voxel resolution of 1 Å (thus starting at a cube size of 41x41x41), and then applying 6 layers of convolution (each of size 3x3x3, with the 6 layers having 32, 32, 64, 64, 128, 128 convolutional filters, respectively) and 2 layers of max pooling... A fully connected layer with 512 parameters lays at the top of each tower, and the outputs of both towers are concatenated and passed through two more fully connected layers with 512 parameters each, leading to the final prediction. The number of filters used in each convolutional layer is doubled every other layer to allow for an increase of the specificity of the filters as the spatial resolution decreases. We use the RMSProp optimizer with a learning rate of 0.0001. The positive-negative class imbalance was set to 1:1.