Importance Weighted Expectation-Maximization for Protein Sequence Design

Authors: Zhenqiao Song, Lei Li

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on eight protein sequence design tasks show that our Is EM-Pro outperforms the previous best methods by at least 55% on average fitness score and generates more diverse and novel protein sequences. We carry out extensive experiments on eight protein sequence design tasks and compare the proposed method with previous strong baselines.
Researcher Affiliation Academia Zhenqiao Song 1 Lei Li 1 1Department of Computer Science, University of California, Santa Barbara, California, the United States. Correspondence to: Zhenqiao Song <zhenqiao@ucsb.edu>, Lei Li <leili@cs.ucsb.edu>.
Pseudocode Yes Algorithm 1 Importance Sampling based Expectation-Maximization Training
Open Source Code Yes The code is available at https://github.com/Jocelyn Song/Is EM-Pro.git.
Open Datasets Yes The detailed data statistics, including protein sequence length, data size and data source are provided in Appendix A. Table 5. Detailed statistics of the eight protein datasets. av GFP 237 49, 855 https://figshare.com/articles/dataset/Local fitness landscape of the green fluorescent protein
Dataset Splits Yes We randomly split each dataset into training/validation sets with the ratio of 9:1.
Hardware Specification Yes The model is trained with 1 NVIDIA RTX A6000 GPU card.
Software Dependencies No Our model is built based on Transformer (Vaswani et al., 2017) with 6-layer encoder initialized by ESM-2 (Lin et al., 2022)1 and 2-layer decoder with random initialization, of which the encoder parameters are fixed during training process. Thus the MRFs features are only incorporated in decoder. The model hidden size and feed-forward hidden size are set to 320 and 1280 respectively as ESM-2. We use the [CLS] representation from the last layer of encoder to calculate the mean and variance vectors of the latent variable through single-layer mapping. Then the sampled latent vector is used as the first token input of decoder. The latent vector size is correspondingly set to 320. We first train a VAE model as Pθ for 30 epochs and ϕ(0) is initialized by θ. The number of iterative process in the importance sampling based VEM is set to 10. The protein combinatorial structure constraints ε are learned on the training sequences for each dataset instead of real multiple sequence alignments (MSAs) to keep a fair comparison. The mini-batch size and learning rate are set to 4, 096 tokens and 1e-5 respectively. The model is trained with 1 NVIDIA RTX A6000 GPU card. We apply Adam algorithm (Kingma & Ba, 2014) as the optimizer with a linear warm-up over the first 4, 000 steps and linear decay for later steps. We randomly split each dataset into training/validation sets with the ratio of 9:1. We run all the experiments for five times and report the average scores. More experimental settings are given in Appendix B.1.
Experiment Setup Yes Our model is built based on Transformer (Vaswani et al., 2017) with 6-layer encoder initialized by ESM-2 (Lin et al., 2022)1 and 2-layer decoder with random initialization, of which the encoder parameters are fixed during training process. Thus the MRFs features are only incorporated in decoder. The model hidden size and feed-forward hidden size are set to 320 and 1280 respectively as ESM-2. We use the [CLS] representation from the last layer of encoder to calculate the mean and variance vectors of the latent variable through single-layer mapping. Then the sampled latent vector is used as the first token input of decoder. The latent vector size is correspondingly set to 320. We first train a VAE model as Pθ for 30 epochs and ϕ(0) is initialized by θ. The number of iterative process in the importance sampling based VEM is set to 10. The protein combinatorial structure constraints ε are learned on the training sequences for each dataset instead of real multiple sequence alignments (MSAs) to keep a fair comparison. The mini-batch size and learning rate are set to 4, 096 tokens and 1e-5 respectively. The model is trained with 1 NVIDIA RTX A6000 GPU card. We apply Adam algorithm (Kingma & Ba, 2014) as the optimizer with a linear warm-up over the first 4, 000 steps and linear decay for later steps. We randomly split each dataset into training/validation sets with the ratio of 9:1. We run all the experiments for five times and report the average scores. More experimental settings are given in Appendix B.1.