HotProtein: A Novel Framework for Protein Thermostability Prediction and Editing

Authors: Tianlong Chen, Chengyue Gong, Daniel Jesus Diaz, Xuxi Chen, Jordan Tyler Wells, qiang liu, Zhangyang Wang, Andrew Ellington, Alex Dimakis, Adam Klivans

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical studies demonstrate that our framework improves thermostability prediction compared to other deep learning models. Finally, we introduce a novel editing algorithm to efficiently generate positive amino acid mutations that improve thermostability. Codes are available in https://github.com/VITA-Group/Hot Protein.
Researcher Affiliation Academia Tianlong Chen*, Chengyue Gong*, Daniel Jesus Diaz, Xuxi Chen, Jordan Tyler Wells, Qiang Liu, Zhangyang Wang, Andrew Ellington, Alex Dimakis, Adam Klivans The University of Texas at Austin {tianlong.chen,cygong17,danny.diaz,xxchen,jordantwells}@utexas.edu {lqiang,atlaswang,andy.ellington,dimakis,klivans}@utexas.edu
Pseudocode No The paper describes algorithms and methods in textual form and through diagrams, but it does not include formal pseudocode blocks or algorithm listings.
Open Source Code Yes Codes are available in https://github.com/VITA-Group/Hot Protein.
Open Datasets Yes We collect and present a large-scale protein dataset, i.e., Hot Protein, with organism-level temperature annotations. [...] It consists of 182K protein sequences and 3K folded structures from 230 different species, covering a broad temperature range of 20 C 120 C. [...] We set up our benchmark using Fire Prot DB (Stourac et al., 2021), which is a superset of published experiments records and is the most up-to-date dataset to our knowledge.
Dataset Splits Yes For HP-S2C2 and HP-S2C5, 10-fold evaluation is conducted; while on HP-S and HP-SC2, we run three replicates with different random seeds.
Hardware Specification Yes Experiments use Tesla V100-SXM2-32GB GPUs as computing resources. Each experiment can be run with a single V100 GPU.
Software Dependencies No The paper mentions software like AlphaFold V2, ESM-1B, TAPE, and Adam W optimizer, but it does not provide specific version numbers for these or other software libraries/dependencies.
Experiment Setup Yes Training Details. Baselines. 3D GCN (Gligorijevi c et al., 2021) is trained for 20 epochs, with an initial learning rate of 1 10 4 that decays by 0.1 at the 10th epoch. For TAPE (Rao et al., 2019), we train it for 4 epochs, with an initial learning rate of 1 10 4 and a linear decay schedule. As for ESM-1B, we follow (Rives et al., 2021) and only train a linear classification head on the top of ESM-1B backbone. The head tuning consists of 4 epochs with an initial learning rate of 2 10 2 and an One Cycle (Smith & Topin, 2019) decay scheduler. A training batch size of 4 is used across all experiments. [...] FST. For our FST, we choose an initial learning rate of 1 10 2 for the linear classification head, and an initial learning rate of 1 10 3 for training the low-rank and sparse components in ESM-1B. [...] As for the hyperparameters of rank r and the number of non-zero elements |Ω| in FST, we perform screenings on r {4, 8, 16} and |Ω| {16, 32, 64, 128}, where we choose (r, |Ω|) = (4, 64) on HP-S2C2/C5 and (r, |Ω|) = (8, 64) on HP-S and HP-SC2. Meantime, we adopt a one-step gradient ascent with a step size of 1 10 5 to generate worst-case feature augmentations, and apply them to the last two layers of ESM-1B, as suggested in Chen et al. (2021d).