Mol-AE: Auto-Encoder Based Molecular Representation Learning With 3D Cloze Test Objective

Authors: Junwei Yang, Kangjie Zheng, Siyu Long, Zaiqing Nie, Ming Zhang, Xinyu Dai, Wei-Ying Ma, Hao Zhou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results demonstrate that MOL-AE achieves a large margin performance gain compared to the current state-of-the-art 3D molecular modeling approach. and Extensive experimental results demonstrate that MOL-AE consistently outperforms various molecular representation learning methods across a diverse set of molecular understanding tasks. and Extensive experimental results demonstrate that MOL-AE achieves state-of-the-art performance on a standard molecular benchmark, including various molecular classification and molecular regression tasks.
Researcher Affiliation Collaboration 1School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University-Anker Embodied AI Lab, Peking University. 2School of Artificial Intelligence, National Key Laboratory for Novel Software Technology, Nanjing University. 3Institute for AI Industry Research (AIR), Tsinghua University. This work was done during the internship of Junwei, Kangjie and Siyu at AIR. 4Phar Molix Inc.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The source codes of MOL-AE are publicly available at https://github.com/yjwtheonly/Mol AE.
Open Datasets Yes For pre-training, we use the large-scale molecular dataset provided by Zhou et al. (2023), which contains 19M molecules and 209M conformations generated by ETKGD(Riniker & Landrum, 2015) and Merck Molecular Force Field (Halgren, 1996). For finetuning, we adopt the most widely used benchmark Molecule Net (Wu et al., 2018), including 9 classification datasets and 6 regression datasets and the data split is the same as Zhou et al. (2023) (cf. Appendix D for more details).
Dataset Splits Yes For fine-tuning, we adopt the most widely used benchmark Molecule Net (Wu et al., 2018), including 9 classification datasets and 6 regression datasets and the data split is the same as Zhou et al. (2023) (cf. Appendix D for more details). and in Table 6 (Appendix D): Molecules (train/valid/test) with specific numbers for each dataset like QM7 5,464/685/681.
Hardware Specification Yes We train MOL-AE on a single NVIDIA A100 GPU for about 2 days.
Software Dependencies No The paper mentions using 'Adam' optimizer and 'GELU' activation function, but it does not provide specific version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow, scikit-learn).
Experiment Setup Yes We implement MOL-AE using 15 stacked Transformer layers in encoder and 5 stacked Transformer layers in decoder, each with 64 attention heads. The model dimension and feedforward dimension of each Transformer layer are 512 and 2048. For pre-training, We set the drop ratio=0.15 in drop module D. We use Adam (Kingma & Ba, 2014) and polynomial learning rate scheduler to train MOL-AE and set the learning rate 1e-4, weight decay 1e-4, warmup step 10K. The total training step is 1M and each batch has 128 samples at maximum. For more pre-training hyper-parameters, please refer to Table 7. In different downstream task, we use different hyper-parameters. For detailed fine-tuning hyper-parameters, please refer to Table 8.