Exploring evolution-aware & -free protein language models as protein function predictors
Authors: Mingyang Hu, Fajie Yuan, Kevin Yang, Fusong Ju, Jin Su, Hui Wang, Fei Yang, Qiuyang Ding
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we investigate the representation ability of three popular PLMs: ESM-1b (single sequence) [35], MSA-Transformer (multiple sequence alignment) [30] and Evoformer (structural), with a special focus on Evoformer. Specifically, we aim to answer the following key questions: (i) Does the Evoformer trained as part of Alpha Fold produce representations amenable to predicting protein function? (ii) If yes, can Evoformer replace ESM-1b and MSA-Transformer? (iii) How much do these PLMs rely on evolution-related protein data? In this regard, are they complementary to each other? We compare these models by empirical study along with new insights and conclusions. All code and datasets for reproducibility are available at https://github.com/elttaes/Revisiting-PLMs. |
| Researcher Affiliation | Collaboration | Mingyang Hu Westlake University humingyang@westlake.edu.cn Fajie Yuan Westlake University yuanfajie@westlake.edu.cn Kevin K. Yang Microsoft Research New England yang.kevin@microsoft.com Fusong Ju Microsoft Research Asia fusongju@microsoft.com Jin Su Westlake University sujin@westlake.edu.cn Hui Wang Westlake University wanghui@westlake.edu.cn Fei Yang Zhejiang Lab yangf@zhejianglab.com Qiuyang Ding Westlake University dingqiuyang@westlake.edu.cn |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | All code and datasets for reproducibility are available at https://github.com/elttaes/Revisiting-PLMs. |
| Open Datasets | Yes | For both contacts and secondary structure, we use the dataset in [35] which is constructed from SCOPe [15], and use the suggested split as the training and testing sets (see Table 2). One concern is that the dataset used here has been trained by Alpha Fold as they come from the Protein Data Bank (PDB) [3]. Hence, we investigate 48 additional proteins, which were collected from CAMEO (Continuous Automated Model Evaluati On) with hard category from 2021-08-28 to 2022-04-30. |
| Dataset Splits | Yes | Table 2: Dataset descriptions Task Source Train Test Secondary Structure & Contact Prediction SCOPe 11680 3617 Metal Ion Binding PDB 6000 1332 Antibiotic Resistance CARD 2072 1344 Fluorescence TAPE 21446 27217 Stability TAPE 53614 12851 |
| Hardware Specification | No | Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See appendix A.6. |
| Software Dependencies | No | The paper mentions general software concepts like 'Adamw optimizer' but does not specify software names with version numbers for reproducibility (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We adopt the standard fine-tuning strategy by fine-tuning all parameters using Adamw optimizer with 1e-5 as learning rate. |