reproducibilityindex.ai

Evaluating Protein Transfer Learning with TAPE

Authors: Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John Canny, Pieter Abbeel, Yun Song

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We ﬁnd that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Table 2 contains results for all benchmarked architectures and training procedures on all downstream tasks in TAPE.
Researcher Affiliation	Collaboration	Roshan Rao* UC Berkeley roshan_rao@berkeley.edu Nicholas Bhattacharya* UC Berkeley nick_bhat@berkeley.edu Neil Thomas* UC Berkeley nthomas@berkeley.edu Yan Duan covariant.ai rocky@covariant.ai Xi Chen covariant.ai peter@covariant.ai John Canny UC Berkeley canny@berkeley.edu Pieter Abbeel UC Berkeley pabbeel@berkeley.edu Yun S. Song UC Berkeley yss@berkeley.edu
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.
Open Datasets	Yes	Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape. We use Pfam [33], a database of thirty-one million protein domains used extensively in bioinformatics, as the pretraining corpus for TAPE. The data are from the Protein Net dataset [25].
Dataset Splits	Yes	We curate tasks into speciﬁc training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. For the remaining data we construct training and test sets using a random 95/5% split.
Hardware Specification	Yes	All self-supervised models are trained on four NVIDIA V100 GPUs for one week.
Software Dependencies	No	The paper mentions software components and architectures like LSTM, Transformer, Res Net, and ELMo, but does not provide specific version numbers for any programming languages, libraries, or frameworks used for implementation.
Experiment Setup	Yes	We use a 12-layer Transformer with a hidden size of 512 units and 8 attention heads, leading to a 38M-parameter model. Hyperparameters for the other models were chosen to approximately match the number of parameters in the Transformer. Our LSTM consists of two three-layer LSTMs with 1024 hidden units corresponding to the forward and backward language models, whose outputs are concatenated in the ﬁnal layer, similar to ELMo [5]. For the Res Net we use 35 residual blocks, each containing two convolutional layers with 256 ﬁlters, kernel size 9, and dilation rate 2.