reproducibilityindex.ai

Learning protein sequence embeddings using information from structure

Authors: Tristan Bepler, Bonnie Berger

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show empirically that our multi-task framework outperforms other sequence-based methods and even a top-performing structure-based alignment method when predicting structural similarity, our goal. Finally, we demonstrate that our learned embeddings can be transferred to other protein sequence problems, improving the state-of-the-art in transmembrane domain prediction.
Researcher Affiliation	Academia	Tristan Bepler Computational and Systems Biology Computer Science and Artiﬁcial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA tbepler@mit.edu Bonnie Berger Computer Science and Artiﬁcial Intelligence Laboratory Department of Mathematics Massachusetts Institute of Technology Cambridge, MA 02139, USA bab@mit.edu
Pseudocode	No	The paper describes methods through narrative and diagrams but does not include formal pseudocode or algorithm blocks.
Open Source Code	Yes	1source code and datasets are available at https://github.com/tbepler/ protein-sequence-embedding-iclr2019
Open Datasets	Yes	We ﬁrst benchmark our model s ability to correctly predict structural similarity between pairs of sequences using the SCOPe ASTRAL dataset [5]. and 8-class secondary structure prediction on a 40% sequence identity ﬁltered dataset containing 22,086 protein sequences from the protein data bank (PDB) [37]
Dataset Splits	Yes	The SCOP benchmark datasets are formed by splitting the SCOPe ASTRAL 2.06 dataset, ﬁltered to a maximum sequence identity of 95%, into 22,408 train and 5,602 heldout sequences. From the heldout sequences, we randomly sample 100,000 pairs as the ASTRAL 2.06 structural similarity test set. For these experiments, we hold out 2,240 random sequences from the 22,408 sequences of the training set. From these held out sequences, we randomly sample 100,000 pairs as the validation set.
Hardware Specification	Yes	All models were implemented in Py Torch and trained on a single NVIDIA Tesla V100 GPU. Each model took roughly 3 days to train and required 16 GB of GPU RAM.
Software Dependencies	No	All models were implemented in Py Torch and trained on a single NVIDIA Tesla V100 GPU. and Sequence embedding models are trained for 100 epochs using ADAM with a learning rate of 0.001 and otherwise default parameters provided by Py Torch. No specific version numbers for PyTorch or other libraries are given.
Experiment Setup	Yes	Our encoder consists of 3 bi LSTM layers with 512 hidden units each and a ﬁnal output embedding dimension of 100... Language model hidden states are projected into a 512 dimension vector... In the contact prediction module, we use a hidden layer with dimension 50... Sequence embedding models are trained for 100 epochs using ADAM with a learning rate of 0.001 and otherwise default parameters provided by Py Torch. Each epoch consists of 100,000 examples sampled from the SCOP structural similarity training set... The structural similarity component of the loss is estimated with minibatches of 64 pairs of sequences. When using the full multitask objective, the contact prediction component uses minibatches of 10 sequences and λ = 0.1. Furthermore, during training we apply a small perturbation to the sequences by resampling the amino acid at each position from the uniform distribution with probability 0.05.