Proximal Exploration for Model-guided Protein Sequence Design

Authors: Zhizhou Ren, Jiahan Li, Fan Ding, Yuan Zhou, Jianzhu Ma, Jian Peng

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we extensively evaluate our method on a suite of in-silico protein sequence design tasks and demonstrate substantial improvement over baseline algorithms.
Researcher Affiliation Collaboration 1Heli Xon Limited 2Department of Computer Science, University of Illinois at Urbana-Champaign 3Institute for Artificial Intelligence, Peking University 4Yau Mathematical Sciences Center, Tsinghua University 5Institute for Industry AI Research, Tsinghua University.
Pseudocode Yes Algorithm 1 Proximal Exploration (PEX)
Open Source Code Yes The source code of our algorithm implementation and oracle landscape simulation models are available at https://github.com/ Heli Xon Protein/proximal-exploration.
Open Datasets Yes We collect several large-scale datasets from previous experimental studies of protein landscape and use TAPE (Rao et al., 2019) as an oracle model to simulate the landscape.
Dataset Splits No The paper describes an online learning and batch optimization setting where the model is iteratively refined, rather than using a static validation split for hyperparameter tuning or early stopping.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as CPU or GPU models, or memory specifications.
Software Dependencies No The paper mentions tools like TAPE and ESM-1b but does not provide specific version numbers for these or other software dependencies (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes The exploration procedure contains 10 rounds of black-box queries. Each batch contains 100 sequences. The context window size is Lc = 21 (i.e., radius= 10). We use Adam optimizer with a learning rate 10 3 to train the fitness prediction. The loss function is the mean squared error. We stop network training when the training loss does not decrease in 10 epochs. We consider an ensemble of three CNN models as the default configuration for these model-guided approaches.