Does GNN Pretraining Help Molecular Representation?
Authors: Ruoxi Sun, Hanjun Dai, Adams Wei Yu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct thorough ablation studies on the key components of GNN pretraining, including pretraining objectives, data splitting methods, input features, pretraining dataset scales, and GNN architectures, to see how they affect the accuracy of the downstream tasks. Our first important finding is, self-supervised graph pretraining do not always have statistically significant advantages over non-pretraining methods in many settings. |
| Researcher Affiliation | Industry | Ruoxi Sun Google Cloud AI Research ruoxis@google.com Hanjun Dai Google Research, Brain Team hadai@google.com Adams Wei Yu Google Research, Brain Team adamsyuwei@google.com |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] . We will prepare code soon. |
| Open Datasets | Yes | ZINC15 [25]: ZINC15 contains 2 million molecules. This dataset was preprocessed following Hu et al. [12]. SAVI [19]: The SAVI dataset contains about 1 billion molecules, which are significantly larger than ZINC15. ... Additionally, we used Ch EMBL [8] as the supervised datasets. |
| Dataset Splits | Yes | The train/valid/test sets are split with ratio 8:1:1. For molecule domain, the random split is not the most meaningful way to assess the performance, because the real-world scenarios often require generalization ability on out-of-distribution samples. So we consider the following ways to split the data: Scaffold Split [12, 21] This strategy first sorts the molecules according to the scaffold (e.g. molecule structure), and then partition the sorted list into train/valid/test splits consecutively. Balanced Scaffold Split [1, 22] This strategy introduces the randomness in the sorting and splitting stages above, thus one can run on splits with different random seeds and report the average performance to lower the evaluation variance. |
| Hardware Specification | No | The paper states 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes]' in its checklist, but no specific hardware details (like GPU models, CPU types, or cloud instance specs) are provided in the main text or readily found within the provided context. |
| Software Dependencies | No | The paper mentions 'RDKit [15]' but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | We tune the learning rate in {1e 4, 5e 4, 1e 3, 5e 3, 1e 2, 5e 2, 1e 1} for each setup individually and select the one with best validation performance. For GNNs we fix the hidden dimension to 300 and number of layers to 5. |