reproducibilityindex.ai

Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model for Protein Design

Authors: Yue Cao, Payel Das, Vijil Chenthamarakshan, Pin-Yu Chen, Igor Melnyk, Yang Shen

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On test sets with single, high-resolution and complete structure inputs for individual folds, our experiments demonstrate improved or comparable performance of Fold2Seq in terms of speed, coverage, and reliability for sequence design, when compared to existing state-of-the-art methods that include datadriven deep generative models and physics-based Rosetta Design.
Researcher Affiliation	Collaboration	Work primarily done during Yue Cao s internship at IBM Research. 1IBM Research 2Texas A&M University.
Pseudocode	No	The paper describes the model architecture and training strategy in detail with text and diagrams, but it does not include pseudocode or an algorithm block.
Open Source Code	Yes	Source code and data are available at https://github.com/ IBM/fold2seq.
Open Datasets	Yes	We used protein structure data from CATH 4.2 (Sillitoe et al., 2019) ﬁltered by 100% sequence identity.
Dataset Splits	Yes	We randomly split the dataset at the fold level into 95%, 2.5%, 2.5% as dataset (a), (b) and (c), respectively, which means that the three datasets have non-overlapping folds. We further randomly split the dataset (a) at the structure level into 95%, 2.5% and 2.5% as dataset (a1), (a2) and (a3), respectively. Datasets (a1), (a2), and (a3) have overlapping folds. We use dataset (a1) as the training set, (b)+(a2) as the validation set, (a3) as the In-Distribution (ID) test set and (c) as the Out-of-Distribution (OD) test set.
Hardware Specification	Yes	We train our model on 2 Tesla K80 GPUs, with batch size 128. ... CPU: Intel Xeon E5-2680 v4 2.40GHz, GPU: Nvidia Tesla K80.
Software Dependencies	No	We implement our model in Pytorch (Paszke et al., 2019). The learning rate schedule follows the original transformer paper (Vaswani et al., 2017).
Experiment Setup	Yes	Each transformer block has 4 layers and d = 256 latent dimensions. ... We use the exponential decay (Blundell et al., 2015) for λ5 = 1/2#epoch e in the loss function, while λ1 through λ4 and e are tuned based on the validation set, resulting in λ1 = 1.0, λ2 = 1.0, λ3 = 0.02, λ4 = 1.0, e = 3. We train our model on 2 Tesla K80 GPUs, with batch size 128. In every training stage we train up to 200 epochs with an early stopping strategy based on the validation loss. ... Top-k sampling strategy (Fan et al., 2018) is used for sequence generation, where k is tuned to be 5 based on the validation set.