Tranception: Protein Fitness Prediction with Autoregressive Transformers and Inference-time Retrieval
Authors: Pascal Notin, Mafalda Dias, Jonathan Frazer, Javier Marchena-Hurtado, Aidan N Gomez, Debora Marks, Yarin Gal
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art fitness prediction performance. Given its markedly higher performance on multiple mutants, robustness to shallow alignments and ability to score indels, our approach offers significant gain of scope over existing approaches. To enable more rigorous model testing across a broader range of protein families, we develop Protein Gym an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks. |
| Researcher Affiliation | Collaboration | 1OATML Group, Department of Computer Science, University of Oxford, Oxford, UK 2Marks Group, Department of Systems Biology, Harvard Medical School, Boston, MA, USA 3Cohere, Toronto, Canada. |
| Pseudocode | No | The paper describes the Tranception architecture and its mechanisms using prose, figures (e.g., Figure 1), and mathematical equations, but it does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states, 'The Protein Gym benchmarks are made publicly available (both raw and processed assay data) on our Git Hub repository.' (Section F) and references performance tables and a reference file on GitHub. However, it does not explicitly state that the source code for the Tranception methodology itself is being released or provide a link to it. Mentions of code for baselines (e.g., EVE, Wavenet) refer to third-party implementations, not the authors' own Tranception code. |
| Open Datasets | Yes | Our models are trained on Uni Ref (Suzek et al., 2014), a large scale protein sequence database. We therefore train our final model (700M parameters) on Uni Ref100 which, after preprocessing (Appendix B.2), leads to a training dataset of 250 million protein sequences. The Protein Gym benchmarks are made publicly available (both raw and processed assay data) on our Git Hub repository. |
| Dataset Splits | Yes | We use 99% of the data ( 249 million sequences) for training and set aside 1% of the data for validation ( 2.5 million sequences). To decide between different architecture options while not overfitting these decisions to our benchmark, we selected a small yet representative subset of DMS assays in the Protein Gym substitution benchmark (10 out of 87 substitution DMS assays)... Downstream performance of the different ablations on this validation set, and the overall substitution set are reported in Table 6. We perform a linearly-spaced grid search for α on our validation DMS set and obtain an optimal rate of 0.6. |
| Hardware Specification | Yes | In terms of computing resources, small architectures are trained on 8 V100 GPUs for 1 week, medium architectures with 32 V100 GPUs for 1 week, and our largest model, Tranception L, is trained on 64 A100 GPUs for 2 weeks. P.N. is supported by GSK and the UK Engineering and Physical Sciences Research Council (EPSRC ICASE award no. 18000077). M.D., J.F. and J.M.H. are supported by the Chan Zuckerberg Initiative CZI2018-191853. A.G. is a Clarendon Scholar and Open Philanthropy AI Fellow. D.S.M. holds a Ben Barres Early Career Award by the Chan Zuckerberg Initiative as part of the Neurodegeneration Challenge Network, CZI2018-191853. Y.G. holds a Turing AI Fellowship (Phase 1) at the Alan Turing Institute, which is supported by EPSRC grant reference V030302/1. |
| Software Dependencies | No | The paper mentions software like 'Adam W optimizer (Loshchilov & Hutter, 2019)', 'Jackhmmer (Eddy, 2011)', and 'Pytorch implementation' for baselines, but it does not specify version numbers for these or other key software components (e.g., Python, CUDA, specific libraries). |
| Experiment Setup | Yes | All model variants are trained for 150k steps, with a batch size of 1,024 sequences. We train with the Adam W optimizer (Loshchilov & Hutter, 2019), with a learning rate schedule annealed over the first 10k steps up to the maximum value (3 10 4), and then linearly decreased until the end of training. Other training hyperparameters are summarized in Table 8. (Table 8: Training steps 150k, Batch size 1,024, Peak learning rate 3 10 4, Weight decay 10 4, Optimizer Adam W). We use the default dropout value of 0.1 in all variants. |