Learning inverse folding from millions of predicted structures

Authors: Chloe Hsu, Robert Verkuil, Jason Liu, Zeming Lin, Brian Hie, Tom Sercu, Adam Lerer, Alexander Rives

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We augment training data by nearly three orders of magnitude by predicting structures for 12M protein sequences using Alpha Fold2. Trained with this additional data, a sequence-to-sequence transformer with invariant geometric input processing layers achieves 51% native sequence recovery on structurally held-out backbones with 72% recovery for buried residues, an overall improvement of almost 10 percentage points over existing methods.
Researcher Affiliation Collaboration Chloe Hsu 1 Robert Verkuil 2 Jason Liu 2 Zeming Lin 2 3 Brian Hie 2 Tom Sercu 2 Adam Lerer * 2 Alexander Rives * 2 *Equal contribution 1University of California, Berkeley. Work done during internship at Facebook AI Research. 2Facebook AI Research. 3New York University.
Pseudocode No The paper describes the model architectures and training procedures in text, but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Code and weights available at https://github.com/facebookresearch/esm.
Open Datasets Yes We evaluate models on a structurally held-out subset of CATH (Orengo et al., 1997). ... predict structures for 12 million sequences in Uni Ref50 using Alpha Fold2. (Suzek et al., 2015)
Dataset Splits Yes We partition CATH at the topology level with an 80/10/10 split resulting in 16153 structures assigned to the training set, 1457 to the validation set, and 1797 to the test set.
Hardware Specification Yes We profile the sampling speed with Py Torch Profiler, averaging over the sampling time for 30 sequences in each sequence length bucket on a Quadro RTX 8000 GPU with 48GB memory.
Software Dependencies No The paper mentions software like 'Py Torch Profiler' and 'fairseq' but does not specify their version numbers or other key software dependencies with versioning.
Experiment Setup Yes The GVP-GNN, GVP-GNN-large, and GVP-Transformer models used in the evaluations in this manuscript are all trained to convergence, with detailed hyperparameters listed in Table A.1.