Learning protein sequence embeddings using information from structure

Authors: Tristan Bepler, Bonnie Berger

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show empirically that our multi-task framework outperforms other sequence-based methods and even a top-performing structure-based alignment method when predicting structural similarity, our goal. Finally, we demonstrate that our learned embeddings can be transferred to other protein sequence problems, improving the state-of-the-art in transmembrane domain prediction.
Researcher Affiliation Academia Tristan Bepler Computational and Systems Biology Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA tbepler@mit.edu Bonnie Berger Computer Science and Artificial Intelligence Laboratory Department of Mathematics Massachusetts Institute of Technology Cambridge, MA 02139, USA bab@mit.edu
Pseudocode No The paper describes methods through narrative and diagrams but does not include formal pseudocode or algorithm blocks.
Open Source Code Yes 1source code and datasets are available at https://github.com/tbepler/ protein-sequence-embedding-iclr2019
Open Datasets Yes We first benchmark our model s ability to correctly predict structural similarity between pairs of sequences using the SCOPe ASTRAL dataset [5]. and 8-class secondary structure prediction on a 40% sequence identity filtered dataset containing 22,086 protein sequences from the protein data bank (PDB) [37]
Dataset Splits Yes The SCOP benchmark datasets are formed by splitting the SCOPe ASTRAL 2.06 dataset, filtered to a maximum sequence identity of 95%, into 22,408 train and 5,602 heldout sequences. From the heldout sequences, we randomly sample 100,000 pairs as the ASTRAL 2.06 structural similarity test set. For these experiments, we hold out 2,240 random sequences from the 22,408 sequences of the training set. From these held out sequences, we randomly sample 100,000 pairs as the validation set.
Hardware Specification Yes All models were implemented in Py Torch and trained on a single NVIDIA Tesla V100 GPU. Each model took roughly 3 days to train and required 16 GB of GPU RAM.
Software Dependencies No All models were implemented in Py Torch and trained on a single NVIDIA Tesla V100 GPU. and Sequence embedding models are trained for 100 epochs using ADAM with a learning rate of 0.001 and otherwise default parameters provided by Py Torch. No specific version numbers for PyTorch or other libraries are given.
Experiment Setup Yes Our encoder consists of 3 bi LSTM layers with 512 hidden units each and a final output embedding dimension of 100... Language model hidden states are projected into a 512 dimension vector... In the contact prediction module, we use a hidden layer with dimension 50... Sequence embedding models are trained for 100 epochs using ADAM with a learning rate of 0.001 and otherwise default parameters provided by Py Torch. Each epoch consists of 100,000 examples sampled from the SCOP structural similarity training set... The structural similarity component of the loss is estimated with minibatches of 64 pairs of sequences. When using the full multitask objective, the contact prediction component uses minibatches of 10 sequences and λ = 0.1. Furthermore, during training we apply a small perturbation to the sequences by resampling the amino acid at each position from the uniform distribution with probability 0.05.