Importance Weighted Expectation-Maximization for Protein Sequence Design
Authors: Zhenqiao Song, Lei Li
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on eight protein sequence design tasks show that our Is EM-Pro outperforms the previous best methods by at least 55% on average fitness score and generates more diverse and novel protein sequences. We carry out extensive experiments on eight protein sequence design tasks and compare the proposed method with previous strong baselines. |
| Researcher Affiliation | Academia | Zhenqiao Song 1 Lei Li 1 1Department of Computer Science, University of California, Santa Barbara, California, the United States. Correspondence to: Zhenqiao Song <zhenqiao@ucsb.edu>, Lei Li <leili@cs.ucsb.edu>. |
| Pseudocode | Yes | Algorithm 1 Importance Sampling based Expectation-Maximization Training |
| Open Source Code | Yes | The code is available at https://github.com/Jocelyn Song/Is EM-Pro.git. |
| Open Datasets | Yes | The detailed data statistics, including protein sequence length, data size and data source are provided in Appendix A. Table 5. Detailed statistics of the eight protein datasets. av GFP 237 49, 855 https://figshare.com/articles/dataset/Local fitness landscape of the green fluorescent protein |
| Dataset Splits | Yes | We randomly split each dataset into training/validation sets with the ratio of 9:1. |
| Hardware Specification | Yes | The model is trained with 1 NVIDIA RTX A6000 GPU card. |
| Software Dependencies | No | Our model is built based on Transformer (Vaswani et al., 2017) with 6-layer encoder initialized by ESM-2 (Lin et al., 2022)1 and 2-layer decoder with random initialization, of which the encoder parameters are fixed during training process. Thus the MRFs features are only incorporated in decoder. The model hidden size and feed-forward hidden size are set to 320 and 1280 respectively as ESM-2. We use the [CLS] representation from the last layer of encoder to calculate the mean and variance vectors of the latent variable through single-layer mapping. Then the sampled latent vector is used as the first token input of decoder. The latent vector size is correspondingly set to 320. We first train a VAE model as Pθ for 30 epochs and ϕ(0) is initialized by θ. The number of iterative process in the importance sampling based VEM is set to 10. The protein combinatorial structure constraints ε are learned on the training sequences for each dataset instead of real multiple sequence alignments (MSAs) to keep a fair comparison. The mini-batch size and learning rate are set to 4, 096 tokens and 1e-5 respectively. The model is trained with 1 NVIDIA RTX A6000 GPU card. We apply Adam algorithm (Kingma & Ba, 2014) as the optimizer with a linear warm-up over the first 4, 000 steps and linear decay for later steps. We randomly split each dataset into training/validation sets with the ratio of 9:1. We run all the experiments for five times and report the average scores. More experimental settings are given in Appendix B.1. |
| Experiment Setup | Yes | Our model is built based on Transformer (Vaswani et al., 2017) with 6-layer encoder initialized by ESM-2 (Lin et al., 2022)1 and 2-layer decoder with random initialization, of which the encoder parameters are fixed during training process. Thus the MRFs features are only incorporated in decoder. The model hidden size and feed-forward hidden size are set to 320 and 1280 respectively as ESM-2. We use the [CLS] representation from the last layer of encoder to calculate the mean and variance vectors of the latent variable through single-layer mapping. Then the sampled latent vector is used as the first token input of decoder. The latent vector size is correspondingly set to 320. We first train a VAE model as Pθ for 30 epochs and ϕ(0) is initialized by θ. The number of iterative process in the importance sampling based VEM is set to 10. The protein combinatorial structure constraints ε are learned on the training sequences for each dataset instead of real multiple sequence alignments (MSAs) to keep a fair comparison. The mini-batch size and learning rate are set to 4, 096 tokens and 1e-5 respectively. The model is trained with 1 NVIDIA RTX A6000 GPU card. We apply Adam algorithm (Kingma & Ba, 2014) as the optimizer with a linear warm-up over the first 4, 000 steps and linear decay for later steps. We randomly split each dataset into training/validation sets with the ratio of 9:1. We run all the experiments for five times and report the average scores. More experimental settings are given in Appendix B.1. |