HotProtein: A Novel Framework for Protein Thermostability Prediction and Editing
Authors: Tianlong Chen, Chengyue Gong, Daniel Jesus Diaz, Xuxi Chen, Jordan Tyler Wells, qiang liu, Zhangyang Wang, Andrew Ellington, Alex Dimakis, Adam Klivans
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical studies demonstrate that our framework improves thermostability prediction compared to other deep learning models. Finally, we introduce a novel editing algorithm to efficiently generate positive amino acid mutations that improve thermostability. Codes are available in https://github.com/VITA-Group/Hot Protein. |
| Researcher Affiliation | Academia | Tianlong Chen*, Chengyue Gong*, Daniel Jesus Diaz, Xuxi Chen, Jordan Tyler Wells, Qiang Liu, Zhangyang Wang, Andrew Ellington, Alex Dimakis, Adam Klivans The University of Texas at Austin {tianlong.chen,cygong17,danny.diaz,xxchen,jordantwells}@utexas.edu {lqiang,atlaswang,andy.ellington,dimakis,klivans}@utexas.edu |
| Pseudocode | No | The paper describes algorithms and methods in textual form and through diagrams, but it does not include formal pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Codes are available in https://github.com/VITA-Group/Hot Protein. |
| Open Datasets | Yes | We collect and present a large-scale protein dataset, i.e., Hot Protein, with organism-level temperature annotations. [...] It consists of 182K protein sequences and 3K folded structures from 230 different species, covering a broad temperature range of 20 C 120 C. [...] We set up our benchmark using Fire Prot DB (Stourac et al., 2021), which is a superset of published experiments records and is the most up-to-date dataset to our knowledge. |
| Dataset Splits | Yes | For HP-S2C2 and HP-S2C5, 10-fold evaluation is conducted; while on HP-S and HP-SC2, we run three replicates with different random seeds. |
| Hardware Specification | Yes | Experiments use Tesla V100-SXM2-32GB GPUs as computing resources. Each experiment can be run with a single V100 GPU. |
| Software Dependencies | No | The paper mentions software like AlphaFold V2, ESM-1B, TAPE, and Adam W optimizer, but it does not provide specific version numbers for these or other software libraries/dependencies. |
| Experiment Setup | Yes | Training Details. Baselines. 3D GCN (Gligorijevi c et al., 2021) is trained for 20 epochs, with an initial learning rate of 1 10 4 that decays by 0.1 at the 10th epoch. For TAPE (Rao et al., 2019), we train it for 4 epochs, with an initial learning rate of 1 10 4 and a linear decay schedule. As for ESM-1B, we follow (Rives et al., 2021) and only train a linear classification head on the top of ESM-1B backbone. The head tuning consists of 4 epochs with an initial learning rate of 2 10 2 and an One Cycle (Smith & Topin, 2019) decay scheduler. A training batch size of 4 is used across all experiments. [...] FST. For our FST, we choose an initial learning rate of 1 10 2 for the linear classification head, and an initial learning rate of 1 10 3 for training the low-rank and sparse components in ESM-1B. [...] As for the hyperparameters of rank r and the number of non-zero elements |Ω| in FST, we perform screenings on r {4, 8, 16} and |Ω| {16, 32, 64, 128}, where we choose (r, |Ω|) = (4, 64) on HP-S2C2/C5 and (r, |Ω|) = (8, 64) on HP-S and HP-SC2. Meantime, we adopt a one-step gradient ascent with a step size of 1 10 5 to generate worst-case feature augmentations, and apply them to the last two layers of ESM-1B, as suggested in Chen et al. (2021d). |