Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules

Authors: Jun Xia, Chengshuai Zhao, Bozhen Hu, Zhangyang Gao, Cheng Tan, Yue Liu, Siyuan Li, Stan Z. Li

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For the pre-training stage, we use 2 million molecules sampled from the ZINC15 database (Sterling & Irwin, 2015) following previous works (Hu et al., 2020). The main downstream task is molecular property prediction, where we adopt the widely-used 8 binary classification datasets contained in Molecule Net (Wu et al., 2018). Kindly note that we use scaffold splitting (Ramsundar et al., 2019), which splits the molecules according to their structures to mimic real-world use cases. Additionally, we validate the effectiveness of Mole-BERT on a broader range of downstream tasks and datasets (See Section 5.3).
Researcher Affiliation Collaboration Jun Xia1 , Chengshuai Zhao1,2 , Bozhen Hu1, Zhangyang Gao1, Cheng Tan1, Yue Liu1, Siyuan Li1, Stan Z. Li1 1AI Lab, Research Center for Industries of the Future, Westlake University 2University of California, Irvine {xiajun, Stan.ZQ.Li}@westlake.edu.cn; chengsz4@uci.edu
Pseudocode No No explicit pseudocode or algorithm blocks were found.
Open Source Code Yes We release the code at https://github.com/junxia97/Mole-BERT.
Open Datasets Yes For the pre-training stage, we use 2 million molecules sampled from the ZINC15 database (Sterling & Irwin, 2015) following previous works (Hu et al., 2020). The main downstream task is molecular property prediction, where we adopt the widely-used 8 binary classification datasets contained in Molecule Net (Wu et al., 2018).
Dataset Splits Yes The split for train/validation/test sets is 80% : 10% : 10%.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were mentioned in the paper.
Software Dependencies No The paper mentions using RDKit but does not provide specific version numbers for RDKit or any other software dependencies required for reproduction.
Experiment Setup Yes We use a 5-layer Graph Isomorphism Networks (GINs) whose hidden dimension is 300 (Xu et al., 2019) as the backbone architecture... During the pre-training stage, GNNs are pre-trained for 100 epochs with batch-size as 256 and the learning rate as 0.001. During the fine-tuning stage, we train for 100 epochs with batch-size as 32 and report the test score with the best cross-validation performance. The hyper-parameter µ is picked from {0.1, 0.3, 0.5} with the validation set. For tokenizer training, we adopt the above 5-layer GINs as the encoder and the decoder, which is trained for 60 epochs on the 2 million unlabeled molecules sampled from the ZINC15 database with the batch size as 256 and the learning rate as 0.001. For TMCL, we set the masking ratios as 0.15 and 0.30, respectively. The temperature parameter τ is set to 0.1. We use a batch size of 32 and a dropout rate of 0.5.