On the Scalability of GNNs for Molecular Graphs
Authors: Maciej Sypetkowski, Frederik Wenkel, Farimah Poursafaei, Nia Dickson, Karush Suri, Philip Fradkin, Dominique Beaini
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Specifically, we analyze message-passing networks, graph Transformers, and hybrid architectures on the largest public collection of 2D molecular graphs for supervised pretraining. For the first time, we observe that GNNs benefit tremendously from the increasing scale of depth, width, number of molecules and associated labels. |
| Researcher Affiliation | Collaboration | Maciej Sypetkowski Valence Labs, Montreal maciej@valencelabs.com Frederik Wenkel Valence Labs, Montreal Université de Montréal, Mila Quebec frederik@valencelabs.com Farimah Poursafaei Valence Labs, Montreal Mc Gill University, Mila Quebec Nia Dickson NVIDIA Corporation Karush Suri Valence Labs, Montreal Philip Fradkin Valence Labs, Montreal University of Toronto, Vector Institute Dominique Beaini Valence Labs, Montreal Université de Montréal, Mila Quebec |
| Pseudocode | No | The paper presents mathematical equations for the architectures, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | For pretraining, we use datasets and code from the literature [7]. The code can be found at https://github.com/datamol-io/graphium |
| Open Datasets | Yes | For pretraining, we use datasets and code from the literature [7]. The code can be found at https://github.com/datamol-io/graphium, while the data can be found at https://zenodo.org/records/10797794. |
| Dataset Splits | Yes | The models are tested in 2 different settings: (1) randomly split train and test sets for pretraining and (2) finetuning/probing of pretrained models on standard benchmarks. |
| Hardware Specification | Yes | We used multi-gpu training (with up to 8 NVIDIA A100-SXM4-40GB GPUs) and gradient accumulation, while adjusting batch size to keep the effective batch size constant. Most models were trained on sigle gpus but our 300M and 1B parameter models used 4 and 8 gpus, respectively. |
| Software Dependencies | No | The paper mentions using Adam optimizer but does not specify versions for any programming languages, libraries (e.g., PyTorch, TensorFlow), or other software dependencies. |
| Experiment Setup | Yes | All models use 2-layer MLPs to encode node and edge features, respectively, followed by the core model of 16 layers of the MPNN++, Transformer or GPS++ (except for when scaling depth). ... Further, all layers use layer norm and dropout with p = 0.1. ... Our base MPNN++, Transformer and hybrid GPS++ models are trained using Adam with a base learning rate of 0.003, 0.001, and 0.001, respectively. We use 5 warm-up epochs followed by linear learning rate decay. All pretraining has been conducted with a batch size of 1024. |