Does GNN Pretraining Help Molecular Representation?

Authors: Ruoxi Sun, Hanjun Dai, Adams Wei Yu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct thorough ablation studies on the key components of GNN pretraining, including pretraining objectives, data splitting methods, input features, pretraining dataset scales, and GNN architectures, to see how they affect the accuracy of the downstream tasks. Our first important finding is, self-supervised graph pretraining do not always have statistically significant advantages over non-pretraining methods in many settings.
Researcher Affiliation Industry Ruoxi Sun Google Cloud AI Research ruoxis@google.com Hanjun Dai Google Research, Brain Team hadai@google.com Adams Wei Yu Google Research, Brain Team adamsyuwei@google.com
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] . We will prepare code soon.
Open Datasets Yes ZINC15 [25]: ZINC15 contains 2 million molecules. This dataset was preprocessed following Hu et al. [12]. SAVI [19]: The SAVI dataset contains about 1 billion molecules, which are significantly larger than ZINC15. ... Additionally, we used Ch EMBL [8] as the supervised datasets.
Dataset Splits Yes The train/valid/test sets are split with ratio 8:1:1. For molecule domain, the random split is not the most meaningful way to assess the performance, because the real-world scenarios often require generalization ability on out-of-distribution samples. So we consider the following ways to split the data: Scaffold Split [12, 21] This strategy first sorts the molecules according to the scaffold (e.g. molecule structure), and then partition the sorted list into train/valid/test splits consecutively. Balanced Scaffold Split [1, 22] This strategy introduces the randomness in the sorting and splitting stages above, thus one can run on splits with different random seeds and report the average performance to lower the evaluation variance.
Hardware Specification No The paper states 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes]' in its checklist, but no specific hardware details (like GPU models, CPU types, or cloud instance specs) are provided in the main text or readily found within the provided context.
Software Dependencies No The paper mentions 'RDKit [15]' but does not provide specific version numbers for software dependencies.
Experiment Setup Yes We tune the learning rate in {1e 4, 5e 4, 1e 3, 5e 3, 1e 2, 5e 2, 1e 1} for each setup individually and select the one with best validation performance. For GNNs we fix the hidden dimension to 300 and number of layers to 5.