Learning inverse folding from millions of predicted structures
Authors: Chloe Hsu, Robert Verkuil, Jason Liu, Zeming Lin, Brian Hie, Tom Sercu, Adam Lerer, Alexander Rives
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We augment training data by nearly three orders of magnitude by predicting structures for 12M protein sequences using Alpha Fold2. Trained with this additional data, a sequence-to-sequence transformer with invariant geometric input processing layers achieves 51% native sequence recovery on structurally held-out backbones with 72% recovery for buried residues, an overall improvement of almost 10 percentage points over existing methods. |
| Researcher Affiliation | Collaboration | Chloe Hsu 1 Robert Verkuil 2 Jason Liu 2 Zeming Lin 2 3 Brian Hie 2 Tom Sercu 2 Adam Lerer * 2 Alexander Rives * 2 *Equal contribution 1University of California, Berkeley. Work done during internship at Facebook AI Research. 2Facebook AI Research. 3New York University. |
| Pseudocode | No | The paper describes the model architectures and training procedures in text, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and weights available at https://github.com/facebookresearch/esm. |
| Open Datasets | Yes | We evaluate models on a structurally held-out subset of CATH (Orengo et al., 1997). ... predict structures for 12 million sequences in Uni Ref50 using Alpha Fold2. (Suzek et al., 2015) |
| Dataset Splits | Yes | We partition CATH at the topology level with an 80/10/10 split resulting in 16153 structures assigned to the training set, 1457 to the validation set, and 1797 to the test set. |
| Hardware Specification | Yes | We profile the sampling speed with Py Torch Profiler, averaging over the sampling time for 30 sequences in each sequence length bucket on a Quadro RTX 8000 GPU with 48GB memory. |
| Software Dependencies | No | The paper mentions software like 'Py Torch Profiler' and 'fairseq' but does not specify their version numbers or other key software dependencies with versioning. |
| Experiment Setup | Yes | The GVP-GNN, GVP-GNN-large, and GVP-Transformer models used in the evaluations in this manuscript are all trained to convergence, with detailed hyperparameters listed in Table A.1. |