reproducibilityindex.ai

Does GNN Pretraining Help Molecular Representation?

Authors: Ruoxi Sun, Hanjun Dai, Adams Wei Yu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct thorough ablation studies on the key components of GNN pretraining, including pretraining objectives, data splitting methods, input features, pretraining dataset scales, and GNN architectures, to see how they affect the accuracy of the downstream tasks. Our ﬁrst important ﬁnding is, self-supervised graph pretraining do not always have statistically signiﬁcant advantages over non-pretraining methods in many settings.
Researcher Affiliation	Industry	Ruoxi Sun Google Cloud AI Research ruoxis@google.com Hanjun Dai Google Research, Brain Team hadai@google.com Adams Wei Yu Google Research, Brain Team adamsyuwei@google.com
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	No	Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] . We will prepare code soon.
Open Datasets	Yes	ZINC15 [25]: ZINC15 contains 2 million molecules. This dataset was preprocessed following Hu et al. [12]. SAVI [19]: The SAVI dataset contains about 1 billion molecules, which are signiﬁcantly larger than ZINC15. ... Additionally, we used Ch EMBL [8] as the supervised datasets.
Dataset Splits	Yes	The train/valid/test sets are split with ratio 8:1:1. For molecule domain, the random split is not the most meaningful way to assess the performance, because the real-world scenarios often require generalization ability on out-of-distribution samples. So we consider the following ways to split the data: Scaffold Split [12, 21] This strategy ﬁrst sorts the molecules according to the scaffold (e.g. molecule structure), and then partition the sorted list into train/valid/test splits consecutively. Balanced Scaffold Split [1, 22] This strategy introduces the randomness in the sorting and splitting stages above, thus one can run on splits with different random seeds and report the average performance to lower the evaluation variance.
Hardware Specification	No	The paper states 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes]' in its checklist, but no specific hardware details (like GPU models, CPU types, or cloud instance specs) are provided in the main text or readily found within the provided context.
Software Dependencies	No	The paper mentions 'RDKit [15]' but does not provide specific version numbers for software dependencies.
Experiment Setup	Yes	We tune the learning rate in {1e 4, 5e 4, 1e 3, 5e 3, 1e 2, 5e 2, 1e 1} for each setup individually and select the one with best validation performance. For GNNs we ﬁx the hidden dimension to 300 and number of layers to 5.