Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model for Protein Design

Authors: Yue Cao, Payel Das, Vijil Chenthamarakshan, Pin-Yu Chen, Igor Melnyk, Yang Shen

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On test sets with single, high-resolution and complete structure inputs for individual folds, our experiments demonstrate improved or comparable performance of Fold2Seq in terms of speed, coverage, and reliability for sequence design, when compared to existing state-of-the-art methods that include datadriven deep generative models and physics-based Rosetta Design.
Researcher Affiliation Collaboration Work primarily done during Yue Cao s internship at IBM Research. 1IBM Research 2Texas A&M University.
Pseudocode No The paper describes the model architecture and training strategy in detail with text and diagrams, but it does not include pseudocode or an algorithm block.
Open Source Code Yes Source code and data are available at https://github.com/ IBM/fold2seq.
Open Datasets Yes We used protein structure data from CATH 4.2 (Sillitoe et al., 2019) filtered by 100% sequence identity.
Dataset Splits Yes We randomly split the dataset at the fold level into 95%, 2.5%, 2.5% as dataset (a), (b) and (c), respectively, which means that the three datasets have non-overlapping folds. We further randomly split the dataset (a) at the structure level into 95%, 2.5% and 2.5% as dataset (a1), (a2) and (a3), respectively. Datasets (a1), (a2), and (a3) have overlapping folds. We use dataset (a1) as the training set, (b)+(a2) as the validation set, (a3) as the In-Distribution (ID) test set and (c) as the Out-of-Distribution (OD) test set.
Hardware Specification Yes We train our model on 2 Tesla K80 GPUs, with batch size 128. ... CPU: Intel Xeon E5-2680 v4 2.40GHz, GPU: Nvidia Tesla K80.
Software Dependencies No We implement our model in Pytorch (Paszke et al., 2019). The learning rate schedule follows the original transformer paper (Vaswani et al., 2017).
Experiment Setup Yes Each transformer block has 4 layers and d = 256 latent dimensions. ... We use the exponential decay (Blundell et al., 2015) for λ5 = 1/2#epoch e in the loss function, while λ1 through λ4 and e are tuned based on the validation set, resulting in λ1 = 1.0, λ2 = 1.0, λ3 = 0.02, λ4 = 1.0, e = 3. We train our model on 2 Tesla K80 GPUs, with batch size 128. In every training stage we train up to 200 epochs with an early stopping strategy based on the validation loss. ... Top-k sampling strategy (Fan et al., 2018) is used for sequence generation, where k is tuned to be 5 based on the validation set.