Learning Protein Structure with a Differentiable Simulator
Authors: John Ingraham, Adam Riesselman, Chris Sander, Debora Marks
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train and evaluate the model on a set of 67,000 protein structures (domains) that are hierarchically and temporally split. We compared it to a strong RNN baseline model and demonstrated its ability to generalize to unobserved protein fold types. Table 1: Test set performance across different levels of generalization. For each of the 10,381 protein structures in our test set, we sampled 100 models from NEMO, clustered them by structural similarity, and selected a representative structure by a standard consensus algorithm (Ginalski et al., 2003). For evaluation of performance we focus on the TM-Score (Zhang & Skolnick, 2005). |
| Researcher Affiliation | Academia | John Ingraham1 , Adam Riesselman1, Chris Sander1,2,3, Debora Marks1,3 1Harvard Medical School 2Dana-Farber Cancer Institute 3Broad Institute of Harvard and MIT |
| Pseudocode | Yes | Algorithm 1: Direct integrator, Algorithm 2: Transform integrator, Algorithm 3: Mixed Integrator, Algorithm 4: Damped Backpropagation Through Time |
| Open Source Code | No | The paper does not provide any explicit statement about releasing the source code or a link to a code repository for their described methodology. |
| Open Datasets | Yes | We train and evaluate the model on a set of 67,000 protein structures (domains) that are hierarchically and temporally split. To test these various levels of generalization systematically across many different protein families, we built a dataset on top of the CATH hierarchical classification of protein folds (Orengo et al., 1997). We collected protein domains from CATH releases 4.1 and 4.2 up to length 200 and hierarchically and temporally split this set ( B.1) into training ( 35k folds), validation ( 21k folds), and test sets ( 10k folds). The Protein Data Bank (Berman et al., 2000) |
| Dataset Splits | Yes | We collected protein domains from CATH releases 4.1 and 4.2 up to length 200 and hierarchically and temporally split this set ( B.1) into training ( 35k folds), validation ( 21k folds), and test sets ( 10k folds). For a training and validation set, we downloaded all protein domains of length L ≤ 200 from Classes α, β, and α/β in CATH release 4.1 (2015), and then hierarchically purged a randomly selected set of A, T, and H categories. This created three validation sets of increasing levels of difficulty: H, which contains domains with superfamilies that are excluded from train (but fold topologies may be present), T, which contains fold topologies that were excluded from train (fold generalization), and A which contains secondary structure architectures that were excluded from train. |
| Hardware Specification | Yes | 2 months on 2 M40 GPUs. on a single Tesla M40 GPU with 12GB memory and 20 cores. |
| Software Dependencies | No | The paper mentions using TensorFlow and Adam optimizer, along with specific techniques like dropout and Batch Renormalization, but does not provide version numbers for any of these software components. For example, "In Tensor Flow this operation is stop gradient." and "We optimized all models for 200,000 iterations with Adam (Kingma & Ba, 2014)." |
| Experiment Setup | Yes | We optimized all models for 200,000 iterations with Adam (Kingma & Ba, 2014). We use dropout with p = 0.9 and Batch Renormalization (Ioffe, 2017) on all convolutional layers. The TM-score... requires iterative optimization, which we implemented with a sign gradient descent with 100 iterations... |