Chemical-Reaction-Aware Molecule Representation Learning

Authors: Hongwei Wang, Weijiang Li, Xiaomeng Jin, Kyunghyun Cho, Heng Ji, Jiawei Han, Martin D. Burke

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 3 EXPERIMENTS Experimental results demonstrate that our method achieves state-of-the-art performance in a variety of downstream tasks, e.g., reaction product prediction, molecule property prediction, reaction classification, and graph-edit-distance prediction.
Researcher Affiliation Collaboration Hongwei Wang1, Weijiang Li1, Xiaomeng Jin1, Kyunghyun Cho2,3, Heng Ji1, Jiawei Han1, Martin D. Burke1 1University of Illinois Urbana-Champaign, 2New York University, 3Genentech {hongweiw, wl13, xjin17, hengji, hanj, mdburke}@illinois.edu, kyunghyun.cho@nyu.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks clearly labeled or formatted as such.
Open Source Code Yes The code is available at https://github.com/hwwang55/Mol R.
Open Datasets Yes We use reactions from USPTO granted patents collected by Lowe (2012) as the dataset, which is further cleaned by Zheng et al. (2019a). We evaluate Mol R on five datasets: BBBP, HIV, BACE, Tox21, and Clin Tox, proposed by Wu et al. (2018). We randomly sample 10,000 molecule pairs from the first 1,000 molecules in QM9 dataset (Wu et al., 2018).
Dataset Splits Yes The dataset contains 478,612 chemical reactions, and is split into training, validation, and test set of 408,673, 29,973, and 39,966 reactions, respectively, so we refer to this dataset as USPTO-479k. All datasets are split into training, validation, and test set by 8:1:1.
Hardware Specification Yes The average time cost per epoch and the maximal memory cost of Mol R-GCN when varying the batch size. (run on an NVIDIA V100 GPU).
Software Dependencies No The implementation of GNNs is based on Deep Graph Library (DGL). We use pysmiles to parse the SMILES strings of molecules to Network X graphs. We use Adam optimizer and a Logistic Regression model implemented in scikit-learn. However, specific version numbers for these software components are not provided.
Experiment Setup Yes The number of layers for all GNNs is 2, the output dimension of all layers is 1,024, and the READOUT function is sum. The margin γ is set to 4. We train the model for 20 epochs with a batch size of 4,096, using Adam (Kingma & Ba, 2015) optimizer with a learning rate of 10^-4.