reproducibilityindex.ai

Proximal Exploration for Model-guided Protein Sequence Design

Authors: Zhizhou Ren, Jiahan Li, Fan Ding, Yuan Zhou, Jianzhu Ma, Jian Peng

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments, we extensively evaluate our method on a suite of in-silico protein sequence design tasks and demonstrate substantial improvement over baseline algorithms.
Researcher Affiliation	Collaboration	1Heli Xon Limited 2Department of Computer Science, University of Illinois at Urbana-Champaign 3Institute for Artificial Intelligence, Peking University 4Yau Mathematical Sciences Center, Tsinghua University 5Institute for Industry AI Research, Tsinghua University.
Pseudocode	Yes	Algorithm 1 Proximal Exploration (PEX)
Open Source Code	Yes	The source code of our algorithm implementation and oracle landscape simulation models are available at https://github.com/ Heli Xon Protein/proximal-exploration.
Open Datasets	Yes	We collect several large-scale datasets from previous experimental studies of protein landscape and use TAPE (Rao et al., 2019) as an oracle model to simulate the landscape.
Dataset Splits	No	The paper describes an online learning and batch optimization setting where the model is iteratively refined, rather than using a static validation split for hyperparameter tuning or early stopping.
Hardware Specification	No	The paper does not provide specific details about the hardware used for experiments, such as CPU or GPU models, or memory specifications.
Software Dependencies	No	The paper mentions tools like TAPE and ESM-1b but does not provide specific version numbers for these or other software dependencies (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	The exploration procedure contains 10 rounds of black-box queries. Each batch contains 100 sequences. The context window size is Lc = 21 (i.e., radius= 10). We use Adam optimizer with a learning rate 10 3 to train the fitness prediction. The loss function is the mean squared error. We stop network training when the training loss does not decrease in 10 epochs. We consider an ensemble of three CNN models as the default configuration for these model-guided approaches.