One Transformer Can Understand Both 2D & 3D Molecular Data

Authors: Shengjie Luo, Tianlang Chen, Yixian Xu, Shuxin Zheng, Tie-Yan Liu, Liwei Wang, Di He

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted extensive experiments for Transformer-M. All empirical results show that Transformer-M can simultaneously achieve strong performance on 2D and 3D tasks, suggesting its broad applicability.
Researcher Affiliation Collaboration Shengjie Luo1, Tianlang Chen2,5 , Yixian Xu2 , Shuxin Zheng3, Tie-Yan Liu3, Liwei Wang1,4 , Di He1 1National Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2School of EECS, Peking University 3Microsoft Research 4Center for Data Science, Peking University 5Shanghai Artificial Intelligence Laboratory
Pseudocode No No explicit pseudocode or algorithm blocks are provided in the paper.
Open Source Code Yes The code and models will be made publicly available at https://github.com/lsj2408/Transformer-M.
Open Datasets Yes We use the PCQM4Mv2 dataset in the OGB Large-Scale Challenge (OGB-LSC) (Hu et al., 2021) to train our Transformer-M, which consists of 3.4 million molecules of both 2D and 3D forms. We use the PCQM4Mv2 dataset in the OGB Large-Scale Challenge (Hu et al., 2021) to train our Transformer-M... We further fine-tune our model on the PDBBind dataset (version 2016, Wang et al. (2004; 2005b)). We use the QM9 dataset (Ramakrishnan et al., 2014) to evaluate our Transformer-M on molecular tasks in the 3D data format.
Dataset Splits Yes We use the PCQM4Mv2 dataset in the OGB Large-Scale Challenge (OGB-LSC) (Hu et al., 2021) to train our Transformer-M, which consists of 3.4 million molecules of both 2D and 3D forms. First, we show that on the validation set of the PCQM4Mv2 task, which only contains 2D molecular graphs, our Transformer-M surpasses all previous works by a large margin. Following Thölke & De Fabritiis (2021), we randomly choose 10,000 and 10,831 molecules for validation and test evaluation, respectively. The remaining molecules are used to fine-tune our Transformer-M model.
Hardware Specification Yes All models are trained on 4 NVIDIA Tesla A100 GPUs. All models are trained on 2 NVIDIA Tesla V100 GPUs. All models are trained on 1 NVIDIA A100 GPU.
Software Dependencies No The paper mentions Py Torch (Paszke et al., 2019) and Adam W (Kingma & Ba, 2014), but it does not specify version numbers for these or any other key software libraries, which is required for reproducibility.
Experiment Setup Yes Our Transformer-M model consists of 12 layers. The dimension of hidden layers and feed-forward layers is set to 768. The number of attention heads is set to 32. The number of Gaussian Basis kernels is set to 128. We use Adam W (Kingma & Ba, 2014) as the optimizer and set its hyperparameter ϵ to 1e-8 and (β1, β2) to (0.9,0.999). The gradient clip norm is set to 5.0. The peak learning rate is set to 2e-4. The batch size is set to 1024. The model is trained for 1.5 million steps with a 90k-step warm-up stage. After the warm-up stage, the learning rate decays linearly to zero. The dropout ratios for the input embeddings, attention matrices, and hidden representations are set to 0.0, 0.1, and 0.0 respectively. The weight decay is set to 0.0. We also employ the stochastic depth (Huang et al., 2016) and set the probability to 0.2. The probability (p2D, p3D, p2D&3D) of each data instance entering the three modes mentioned in Section 4.1 is set to (0.2, 0.5, 0.3). The scaling factor σ of added noise in the 3D Position Denoising task is set to 0.2. The ratio of the supervised loss to the denoising loss is set to 1:1.