Chemical-Reaction-Aware Molecule Representation Learning
Authors: Hongwei Wang, Weijiang Li, Xiaomeng Jin, Kyunghyun Cho, Heng Ji, Jiawei Han, Martin D. Burke
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 EXPERIMENTS Experimental results demonstrate that our method achieves state-of-the-art performance in a variety of downstream tasks, e.g., reaction product prediction, molecule property prediction, reaction classification, and graph-edit-distance prediction. |
| Researcher Affiliation | Collaboration | Hongwei Wang1, Weijiang Li1, Xiaomeng Jin1, Kyunghyun Cho2,3, Heng Ji1, Jiawei Han1, Martin D. Burke1 1University of Illinois Urbana-Champaign, 2New York University, 3Genentech {hongweiw, wl13, xjin17, hengji, hanj, mdburke}@illinois.edu, kyunghyun.cho@nyu.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks clearly labeled or formatted as such. |
| Open Source Code | Yes | The code is available at https://github.com/hwwang55/Mol R. |
| Open Datasets | Yes | We use reactions from USPTO granted patents collected by Lowe (2012) as the dataset, which is further cleaned by Zheng et al. (2019a). We evaluate Mol R on five datasets: BBBP, HIV, BACE, Tox21, and Clin Tox, proposed by Wu et al. (2018). We randomly sample 10,000 molecule pairs from the first 1,000 molecules in QM9 dataset (Wu et al., 2018). |
| Dataset Splits | Yes | The dataset contains 478,612 chemical reactions, and is split into training, validation, and test set of 408,673, 29,973, and 39,966 reactions, respectively, so we refer to this dataset as USPTO-479k. All datasets are split into training, validation, and test set by 8:1:1. |
| Hardware Specification | Yes | The average time cost per epoch and the maximal memory cost of Mol R-GCN when varying the batch size. (run on an NVIDIA V100 GPU). |
| Software Dependencies | No | The implementation of GNNs is based on Deep Graph Library (DGL). We use pysmiles to parse the SMILES strings of molecules to Network X graphs. We use Adam optimizer and a Logistic Regression model implemented in scikit-learn. However, specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | The number of layers for all GNNs is 2, the output dimension of all layers is 1,024, and the READOUT function is sum. The margin γ is set to 4. We train the model for 20 epochs with a batch size of 4,096, using Adam (Kingma & Ba, 2015) optimizer with a learning rate of 10^-4. |