On the Scalability of GNNs for Molecular Graphs

Authors: Maciej Sypetkowski, Frederik Wenkel, Farimah Poursafaei, Nia Dickson, Karush Suri, Philip Fradkin, Dominique Beaini

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Specifically, we analyze message-passing networks, graph Transformers, and hybrid architectures on the largest public collection of 2D molecular graphs for supervised pretraining. For the first time, we observe that GNNs benefit tremendously from the increasing scale of depth, width, number of molecules and associated labels.
Researcher Affiliation Collaboration Maciej Sypetkowski Valence Labs, Montreal maciej@valencelabs.com Frederik Wenkel Valence Labs, Montreal Université de Montréal, Mila Quebec frederik@valencelabs.com Farimah Poursafaei Valence Labs, Montreal Mc Gill University, Mila Quebec Nia Dickson NVIDIA Corporation Karush Suri Valence Labs, Montreal Philip Fradkin Valence Labs, Montreal University of Toronto, Vector Institute Dominique Beaini Valence Labs, Montreal Université de Montréal, Mila Quebec
Pseudocode No The paper presents mathematical equations for the architectures, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes For pretraining, we use datasets and code from the literature [7]. The code can be found at https://github.com/datamol-io/graphium
Open Datasets Yes For pretraining, we use datasets and code from the literature [7]. The code can be found at https://github.com/datamol-io/graphium, while the data can be found at https://zenodo.org/records/10797794.
Dataset Splits Yes The models are tested in 2 different settings: (1) randomly split train and test sets for pretraining and (2) finetuning/probing of pretrained models on standard benchmarks.
Hardware Specification Yes We used multi-gpu training (with up to 8 NVIDIA A100-SXM4-40GB GPUs) and gradient accumulation, while adjusting batch size to keep the effective batch size constant. Most models were trained on sigle gpus but our 300M and 1B parameter models used 4 and 8 gpus, respectively.
Software Dependencies No The paper mentions using Adam optimizer but does not specify versions for any programming languages, libraries (e.g., PyTorch, TensorFlow), or other software dependencies.
Experiment Setup Yes All models use 2-layer MLPs to encode node and edge features, respectively, followed by the core model of 16 layers of the MPNN++, Transformer or GPS++ (except for when scaling depth). ... Further, all layers use layer norm and dropout with p = 0.1. ... Our base MPNN++, Transformer and hybrid GPS++ models are trained using Adam with a base learning rate of 0.003, 0.001, and 0.001, respectively. We use 5 warm-up epochs followed by linear learning rate decay. All pretraining has been conducted with a batch size of 1024.