Transformer protein language models are unsupervised structure learners
Authors: Roshan Rao, Joshua Meier, Tom Sercu, Sergey Ovchinnikov, Alexander Rives
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare ESM-1b (Rives et al., 2020), a large-scale (650M parameters) Transformer model trained on Uni Ref50 (Suzek et al., 2007) to the Gremlin (Kamisetty et al., 2013) pipeline which implements a log linear model trained with pseudolikelihood (Balakrishnan et al., 2011; Ekeberg et al., 2013). Contacts can be extracted from the attention maps of the Transformer model by a sparse linear combination of attention heads identified by logistic regression. ESM-1b model contacts have higher precision than Gremlin contacts. When ESM and Gremlin are compared with access to the same set of sequences the precision gain from the protein language model is significant; the advantage holds on average even when Gremlin is given access to an optimized set of multiple sequence alignments incorporating metagenomics data. |
| Researcher Affiliation | Collaboration | Roshan Rao UC Berkeley rmrao@berkeley.edu Joshua Meier Facebook AI Research jmeier@fb.com Tom Sercu Facebook AI Research tsercu@fb.com Sergey Ovchinnikov Harvard University so@g.harvard.edu Alexander Rives Facebook AI Research & New York University arives@cs.nyu.edu |
| Pseudocode | Yes | Algorithm 1 presents the algorithm used to generate pseudo-MSAs from ESM-1b. |
| Open Source Code | Yes | Weights for all ESM-1 and ESM-1b models, as well as regressions trained on these models can be found at https://github.com/facebookresearch/esm. |
| Open Datasets | Yes | We compare ESM-1b (Rives et al., 2020), a large-scale (650M parameters) Transformer model trained on Uni Ref50 (Suzek et al., 2007) to the Gremlin (Kamisetty et al., 2013) pipeline which implements a log linear model trained with pseudolikelihood (Balakrishnan et al., 2011; Ekeberg et al., 2013). We evaluate models with the 15051 proteins in the tr Rosetta training dataset (Yang et al., 2019) |
| Dataset Splits | Yes | We reserve 20 sequences for training, 20 sequences for validation, and 14842 sequences for testing. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, memory, or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper mentions 'We fit the parameters β via scikit-learn (Pedregosa et al., 2011)' but does not provide specific version numbers for scikit-learn or any other software dependencies crucial for replication. |
| Experiment Setup | Yes | For the limited supervision setting, we use the same 20 proteins used to train the sparse logistic regression model. For the full supervision setting we generate a 95/5% random training/validation split of the 15008 tr Rosetta proteins with sequence length <= 1024. For the n = 20 setting, we found a learning rate of 0.001, weight decay of 10.0, and projection size of 512 had best performance on the validation set. For the n = 14257 setting we found a learning rate of 0.001, weight decay of 0.01, and projection size of 512 had best performance on the validation set. All models were trained to convergence to maximize validation long range P@L with a patience of 10. The n = 20 models were trained with a batch size of 20 (i.e. 1 batch = 1 epoch) and the n = 14257 models were trained with a batch size of 128. |