Evaluating Protein Transfer Learning with TAPE
Authors: Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John Canny, Pieter Abbeel, Yun Song
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Table 2 contains results for all benchmarked architectures and training procedures on all downstream tasks in TAPE. |
| Researcher Affiliation | Collaboration | Roshan Rao* UC Berkeley roshan_rao@berkeley.edu Nicholas Bhattacharya* UC Berkeley nick_bhat@berkeley.edu Neil Thomas* UC Berkeley nthomas@berkeley.edu Yan Duan covariant.ai rocky@covariant.ai Xi Chen covariant.ai peter@covariant.ai John Canny UC Berkeley canny@berkeley.edu Pieter Abbeel UC Berkeley pabbeel@berkeley.edu Yun S. Song UC Berkeley yss@berkeley.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape. |
| Open Datasets | Yes | Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape. We use Pfam [33], a database of thirty-one million protein domains used extensively in bioinformatics, as the pretraining corpus for TAPE. The data are from the Protein Net dataset [25]. |
| Dataset Splits | Yes | We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. For the remaining data we construct training and test sets using a random 95/5% split. |
| Hardware Specification | Yes | All self-supervised models are trained on four NVIDIA V100 GPUs for one week. |
| Software Dependencies | No | The paper mentions software components and architectures like LSTM, Transformer, Res Net, and ELMo, but does not provide specific version numbers for any programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | We use a 12-layer Transformer with a hidden size of 512 units and 8 attention heads, leading to a 38M-parameter model. Hyperparameters for the other models were chosen to approximately match the number of parameters in the Transformer. Our LSTM consists of two three-layer LSTMs with 1024 hidden units corresponding to the forward and backward language models, whose outputs are concatenated in the final layer, similar to ELMo [5]. For the Res Net we use 35 residual blocks, each containing two convolutional layers with 256 filters, kernel size 9, and dilation rate 2. |