Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching

Authors: Shengchao Liu, Hongyu Guo, Jian Tang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our comprehensive experiments confirm the effectiveness and robustness of our proposed method. Using 22 downstream geometric molecular prediction tasks, we empirically verify that our method outperforms nine pretraining baselines.
Researcher Affiliation Collaboration Shengchao Liu Mila Québec AI Institute Université de Montréal liusheng@mila.quebec Hongyu Guo National Research Council Canada University of Ottawa hongyu.guo@uottawa.ca Jian Tang Mila Québec AI Institute HEC Montréal CIFAR AI Chair jian.tang@hec.ca
Pseudocode Yes Algorithm 1 Geo SSL-DDM pretraining
Open Source Code Yes To ensure the reproducibility of the empirical results, we provide the implementation details (hyperparameters, dataset statistics, etc.) in Section 5 and appendix D, and publically share our source code through this Git Hub link.
Open Datasets Yes The Pub Chem QC database is a large-scale database with around 4M molecules with 3D geometries... Following this, Molecule3D [73] takes the ground-state geometries from Pub Chem QC and transforms the data formats into a deep learning-friendly way. For our molecular geometry pretraining, we take a subset of 1M molecules with 3D geometries from Molecule3D. QM9 [46] is a dataset of 134K molecules consisting of 9 heavy atoms. MD17 [10] is a dataset on molecular dynamics simulation. Atom3D [66] is a recently published dataset.
Dataset Splits Yes QM9 [46] is a dataset of 134K molecules consisting of 9 heavy atoms. We take 110K for training, 10K for validation, and 11K for test. We follow the literature [31, 41, 50, 51] of using 1K for training and 1K for validation, while the test set (from 48K to 991K) is much larger. For LBA, we use split-by-sequence-identity-30: we split protein-ligand complexes such that no protein in the test dataset has more than 30% sequence identity with any protein in the training dataset. For LEP, we split the complex pairs by protein target.
Hardware Specification Yes We have around 20 V100 GPU cards for computation at an internal cluster. Each job can be finished within 3-24 hours (each job takes one single GPU card).
Software Dependencies No The paper mentions using RDKit [33] and refers to common machine learning libraries implicitly through discussions of GNNs, but it does not specify any software dependencies with version numbers.
Experiment Setup Yes We list all the detailed hyperparameters in this subsection. For all the methods, we use the same optimization strategy, i.e., with learning rate as 5e-4 and cosine annealing learning rate schedule [43]. The other hyperparameters for each pretraining method are listed in Table 8. For the other hyperparameters, we are using the default hyperparameters, as attached in the codes.