Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching
Authors: Shengchao Liu, Hongyu Guo, Jian Tang
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our comprehensive experiments confirm the effectiveness and robustness of our proposed method. Using 22 downstream geometric molecular prediction tasks, we empirically verify that our method outperforms nine pretraining baselines. |
| Researcher Affiliation | Collaboration | Shengchao Liu Mila Québec AI Institute Université de Montréal EMAIL Hongyu Guo National Research Council Canada University of Ottawa EMAIL Jian Tang Mila Québec AI Institute HEC Montréal CIFAR AI Chair EMAIL |
| Pseudocode | Yes | Algorithm 1 Geo SSL-DDM pretraining |
| Open Source Code | Yes | To ensure the reproducibility of the empirical results, we provide the implementation details (hyperparameters, dataset statistics, etc.) in Section 5 and appendix D, and publically share our source code through this Git Hub link. |
| Open Datasets | Yes | The Pub Chem QC database is a large-scale database with around 4M molecules with 3D geometries... Following this, Molecule3D [73] takes the ground-state geometries from Pub Chem QC and transforms the data formats into a deep learning-friendly way. For our molecular geometry pretraining, we take a subset of 1M molecules with 3D geometries from Molecule3D. QM9 [46] is a dataset of 134K molecules consisting of 9 heavy atoms. MD17 [10] is a dataset on molecular dynamics simulation. Atom3D [66] is a recently published dataset. |
| Dataset Splits | Yes | QM9 [46] is a dataset of 134K molecules consisting of 9 heavy atoms. We take 110K for training, 10K for validation, and 11K for test. We follow the literature [31, 41, 50, 51] of using 1K for training and 1K for validation, while the test set (from 48K to 991K) is much larger. For LBA, we use split-by-sequence-identity-30: we split protein-ligand complexes such that no protein in the test dataset has more than 30% sequence identity with any protein in the training dataset. For LEP, we split the complex pairs by protein target. |
| Hardware Specification | Yes | We have around 20 V100 GPU cards for computation at an internal cluster. Each job can be finished within 3-24 hours (each job takes one single GPU card). |
| Software Dependencies | No | The paper mentions using RDKit [33] and refers to common machine learning libraries implicitly through discussions of GNNs, but it does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | We list all the detailed hyperparameters in this subsection. For all the methods, we use the same optimization strategy, i.e., with learning rate as 5e-4 and cosine annealing learning rate schedule [43]. The other hyperparameters for each pretraining method are listed in Table 8. For the other hyperparameters, we are using the default hyperparameters, as attached in the codes. |