ESM All-Atom: Multi-Scale Protein Language Model for Unified Molecular Modeling
Authors: Kangjie Zheng, Siyu Long, Tianyu Lu, Junwei Yang, Xinyu Dai, Ming Zhang, Zaiqing Nie, Wei-Ying Ma, Hao Zhou
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results indicate that ESM-AA surpasses previous methods in protein-molecule tasks, demonstrating the full utilization of protein language models. We fine-tune and evaluate ESM-AA across diverse benchmarks and verify the contribution of each component through ablation experiments. |
| Researcher Affiliation | Collaboration | 1School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University-Anker Embodied AI Lab, Peking University, Beijing 100871, China 2School of Artificial Intelligence, National Key Laboratory for Novel Software Technology, Nanjing University 3Department of Computer Science, Tsinghua University 4Institute for AI Industry Research (AIR), Tsinghua University. This work was done during the internship of Kangjie, Siyu, Tianyu, and Junwei at AIR 5Phar Molix Inc. |
| Pseudocode | No | The paper describes its methods using natural language and mathematical equations, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source codes of ESM-AA are publicly released at https://github.com/zhengkangjie/ESM-AA. |
| Open Datasets | Yes | For the protein dataset, we use Alpha Fold DB (Varadi et al., 2022) dataset, which contains 8M protein sequences and structures predicted by Alpha Fold2 (Jumper et al., 2021)... For the molecule dataset, we use the dataset provided by Zhou et al. (2023), which contains 19M molecules and 209M conformations generated by ETKGD (Riniker & Landrum, 2015) and Merck Molecular Force Field (Halgren, 1996). |
| Dataset Splits | Yes | We use the standard data split provided by Pro Smith in fine-tuning. Specifically, for secondary structure prediction, we use data from Klausen et al. (2019) as training and validation sets and use CB513 (Cuff & Barton, 1999), CASP12 (Moult et al., 2018), and TS115 (Yang et al., 2018) as test sets. The final training, validation and three test sets have 8678, 2170, 513, 21, 115 protein sequences, respectively. |
| Hardware Specification | Yes | We train ESM-AA on 16 NVIDIA A100 GPU cards for 3 days. |
| Software Dependencies | No | The paper mentions using specific optimizers like Adam and Adam W, but it does not specify version numbers for any software components, libraries, or frameworks (e.g., PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | We implement ESM-AA using 12 stacked Transformer layers, each with 20 attention heads. The model dimension and feedforward dimension of each Transformer layer are 480 and 1920. We use Adam (Kingma & Ba, 2014) and polynomial learning rate scheduler to train ESM-AA and set the learning rate 4e-4, weight decay 1e-2, warmup step 5000. The total training step is 300K and each batch has 256K tokens at maximum. |