Rethinking Tokenizer and Decoder in Masked Graph Modeling for Molecules
Authors: ZHIYUAN LIU, Yaorui Shi, An Zhang, Enzhi Zhang, Kenji Kawaguchi, Xiang Wang, Tat-Seng Chua
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate that our method outperforms the existing molecule self-supervised learning methods. Our codes and checkpoints are available at https://github.com/syr-cn/Sim SGT. In this section, we perform experiments to assess the roles of tokenizer and decoder in MGM for molecules. Our experiments follow the transfer learning setting in [12, 9]. We pretrain MGM models on 2 million molecules from ZINC15 [42], and evalute the pretrained models on eight classification datasets in Molecule Net [28]: BBBP, Tox21, Tox Cast, Sider, Clin Tox, MUV, HIV, and Bace. |
| Researcher Affiliation | Academia | Zhiyuan Liu Yaorui Shi An Zhang Enzhi Zhang Kenji Kawaguchi Xiang Wang Tat-Seng Chua National University of Singapore, University of Science and Technology of China Hokkaido University |
| Pseudocode | Yes | Algorithm 1 Pytorch style pseudocode of Sim SGT |
| Open Source Code | Yes | Our codes and checkpoints are available at https://github.com/syr-cn/Sim SGT. |
| Open Datasets | Yes | We pretrain MGM models on 2 million molecules from ZINC15 [42], and evalute the pretrained models on eight classification datasets in Molecule Net [28]: BBBP, Tox21, Tox Cast, Sider, Clin Tox, MUV, HIV, and Bace. Following the experimental setting in [45], we pretrain Sim SGT on the 50 thousand molecule samples from the GEOM dataset [46] and We report performances of predicting the quantum chemistry properties of molecules [47]. |
| Dataset Splits | Yes | These downstream datasets are divided into train/valid/test sets by scaffold split to provide an out-of-distribution evaluation setting. We tune the hyperparameters in the fine-tuning stage using the validation performance. |
| Hardware Specification | Yes | We perform experiments on an NVIDIA DGX A100 server. |
| Software Dependencies | No | The paper mentions 'Pytorch style pseudocode' and cites 'RDkit [30]' for extracting FGs, but it does not provide specific version numbers for these or other software dependencies like Python, PyTorch, CUDA, or other libraries used in the experiments. |
| Experiment Setup | Yes | Table 9b summarizes the hyper-parameters. We use different hyper-parameters given different graph encoders. The architectures of the two graph encoders are borrowed from previous works: GINE [12] and GTS [27]. We use large batch sizes of 1024 and 2048 to speed up pretraining. We do not use dropout during pretraining. During fine-tuning, we 50% dropout in GINE layers and 30% dropout in transformer layers. Table 10b: Hyperparameters and their search spaces. Table 11: Hyperparameters for fine-tuning on the QM datasets. |