Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching
Authors: Shengchao Liu, Hongyu Guo, Jian Tang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our comprehensive experiments confirm the effectiveness and robustness of our proposed method. Using 22 downstream geometric molecular prediction tasks, we empirically verify that our method outperforms nine pretraining baselines. |
| Researcher Affiliation | Collaboration | Shengchao Liu Mila Québec AI Institute Université de Montréal liusheng@mila.quebec Hongyu Guo National Research Council Canada University of Ottawa hongyu.guo@uottawa.ca Jian Tang Mila Québec AI Institute HEC Montréal CIFAR AI Chair jian.tang@hec.ca |
| Pseudocode | Yes | Algorithm 1 Geo SSL-DDM pretraining |
| Open Source Code | Yes | To ensure the reproducibility of the empirical results, we provide the implementation details (hyperparameters, dataset statistics, etc.) in Section 5 and appendix D, and publically share our source code through this Git Hub link. |
| Open Datasets | Yes | The Pub Chem QC database is a large-scale database with around 4M molecules with 3D geometries... Following this, Molecule3D [73] takes the ground-state geometries from Pub Chem QC and transforms the data formats into a deep learning-friendly way. For our molecular geometry pretraining, we take a subset of 1M molecules with 3D geometries from Molecule3D. QM9 [46] is a dataset of 134K molecules consisting of 9 heavy atoms. MD17 [10] is a dataset on molecular dynamics simulation. Atom3D [66] is a recently published dataset. |
| Dataset Splits | Yes | QM9 [46] is a dataset of 134K molecules consisting of 9 heavy atoms. We take 110K for training, 10K for validation, and 11K for test. We follow the literature [31, 41, 50, 51] of using 1K for training and 1K for validation, while the test set (from 48K to 991K) is much larger. For LBA, we use split-by-sequence-identity-30: we split protein-ligand complexes such that no protein in the test dataset has more than 30% sequence identity with any protein in the training dataset. For LEP, we split the complex pairs by protein target. |
| Hardware Specification | Yes | We have around 20 V100 GPU cards for computation at an internal cluster. Each job can be finished within 3-24 hours (each job takes one single GPU card). |
| Software Dependencies | No | The paper mentions using RDKit [33] and refers to common machine learning libraries implicitly through discussions of GNNs, but it does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | We list all the detailed hyperparameters in this subsection. For all the methods, we use the same optimization strategy, i.e., with learning rate as 5e-4 and cosine annealing learning rate schedule [43]. The other hyperparameters for each pretraining method are listed in Table 8. For the other hyperparameters, we are using the default hyperparameters, as attached in the codes. |