Improving protein optimization with smoothed fitness landscapes
Authors: Andrew Kirjner, Jason Yim, Raman Samusevich, Shahar Bracha, Tommi S. Jaakkola, Regina Barzilay, Ila R Fiete
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find optimizing in this smoothed landscape leads to improved performance across multiple methods in the GFP and AAV benchmarks. Second, we achieve state-of-the-art results utilizing discrete energy-based models and MCMC in the smoothed landscape. Our method, called Gibbs sampling with Graph-based Smoothing (GGS), demonstrates a unique ability to achieve 2.5 fold fitness improvement (with in-silico evaluation) over its training set. GGS demonstrates potential to optimize proteins in the limited data regime. |
| Researcher Affiliation | Collaboration | Andrew Kirjner Massachusetts Institute of Technology kirjner@mit.edu Massachusetts Institute of Technology jyim@mit.edu Raman Samusevich IOCB Prague, Czech Academy of Sciences, CIIRC, Czech Technical University in Prague, University of Chemistry and Technology, Prague raman.samusevich@uochb.cas.cz Shahar Bracha Massachusetts Institute of Technology shaharbr@mit.edu Tommi Jaakkola Massachusetts Institute of Technology tommi@csail.mit.edu Regina Barzilay Massachusetts Institute of Technology regina@csail.mit.edu Massachusetts Institute of Technology fiete@mit.edu |
| Pseudocode | Yes | The smoothing algorithm is in Algorithm 2. The graph construction algorithm can be found in Algorithm 4. We provide the GWG algorithm in Algorithm 3. The full algorithm for smoothing and clustered sampling is provided in Algorithm 1. |
| Open Source Code | Yes | Code: https://github.com/kirjner/GGS |
| Open Datasets | Yes | To evaluate our method, we introduce a set of tasks using the well studied Green Fluorescent Proteins (GFP) (Sarkisyan et al., 2016) and Adeno-Associated Virus (AAV) (Bryant et al., 2021). We chose GFP and AAV because of their real-world importance and availability of large mutational data. |
| Dataset Splits | No | The paper discusses 'Train MAE' and 'Holdout MAE' in Appendix C.2.3, implying a split for evaluating the smoothed model's extrapolation. However, it does not provide specific percentages or sample counts for training, validation, or test splits of the datasets used for evaluation metrics (Fitness, Diversity, Novelty) on generated sequences, nor for the evaluator model's training. |
| Hardware Specification | Yes | Training is performed with batch size 1024, ADAM optimizer (Kingma & Ba, 2014) (with β1 = 0.9, β2 = 0.999), learning rate 0.0001, and 50 epochs, using a single A6000 Nvidia GPU. |
| Software Dependencies | No | The paper mentions 'ADAM optimizer (Kingma & Ba, 2014)' but does not provide specific version numbers for it or any other software libraries or frameworks used (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | We use τ = 0.1, R = 15 rounds and Nprop = 100 proposals per round during GWG at which sequences would converge and more sampling did not give improvements. We choose the smoothing weight γ = 1.0 through grid search. Training is performed with batch size 1024, ADAM optimizer (Kingma & Ba, 2014) (with β1 = 0.9, β2 = 0.999), learning rate 0.0001, and 50 epochs. |