Proximal Exploration for Model-guided Protein Sequence Design
Authors: Zhizhou Ren, Jiahan Li, Fan Ding, Yuan Zhou, Jianzhu Ma, Jian Peng
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, we extensively evaluate our method on a suite of in-silico protein sequence design tasks and demonstrate substantial improvement over baseline algorithms. |
| Researcher Affiliation | Collaboration | 1Heli Xon Limited 2Department of Computer Science, University of Illinois at Urbana-Champaign 3Institute for Artificial Intelligence, Peking University 4Yau Mathematical Sciences Center, Tsinghua University 5Institute for Industry AI Research, Tsinghua University. |
| Pseudocode | Yes | Algorithm 1 Proximal Exploration (PEX) |
| Open Source Code | Yes | The source code of our algorithm implementation and oracle landscape simulation models are available at https://github.com/ Heli Xon Protein/proximal-exploration. |
| Open Datasets | Yes | We collect several large-scale datasets from previous experimental studies of protein landscape and use TAPE (Rao et al., 2019) as an oracle model to simulate the landscape. |
| Dataset Splits | No | The paper describes an online learning and batch optimization setting where the model is iteratively refined, rather than using a static validation split for hyperparameter tuning or early stopping. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as CPU or GPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions tools like TAPE and ESM-1b but does not provide specific version numbers for these or other software dependencies (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | The exploration procedure contains 10 rounds of black-box queries. Each batch contains 100 sequences. The context window size is Lc = 21 (i.e., radius= 10). We use Adam optimizer with a learning rate 10 3 to train the fitness prediction. The loss function is the mean squared error. We stop network training when the training loss does not decrease in 10 epochs. We consider an ensemble of three CNN models as the default configuration for these model-guided approaches. |