Representing Molecules as Random Walks Over Interpretable Grammars

Authors: Michael Sun, Minghao Guo, Weize Yuan, Veronika Thost, Crystal Elaine Owens, Aristotle Franklin Grosz, Sharvaa Selvan, Katelyn Zhou, Hassan Mohiuddin, Benjamin J Pedretti, Zachary P Smith, Jie Chen, Wojciech Matusik

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate clear advantages over existing methods in terms of performance, efficiency, and synthesizability of predicted molecules, and we provide detailed insights into the method s chemical interpretability. Code is available at https://github.com/ shiningsunnyday/polymer_walk. ... Our experiments quantitatively answer the following questions: 1) How well does our method perform on property prediction for our setting of interest? 2) How well does our representation work for the generation of novel molecules, compared with both SOTA symbolic and deep molecular generative models?
Researcher Affiliation Collaboration 1MIT CSAIL 2MIT Chemistry 3MIT-IBM Watson AI Lab, IBM Research 4MIT Chemical Engineering 5MIT 6Wellesley.
Pseudocode Yes Algorithm 1: function extract walk(D,B) ... Algorithm 2: function traverse dag(Gi, G) ... Algorithm 3: function build motif graph(V) ... Algorithm 4: function re order(childs) ... Algorithm 5: function dfs walk(cur, traj) ... Algorithm 6: function algo-diffusion ... Algorithm 7: function generate
Open Source Code Yes Code is available at https://github.com/ shiningsunnyday/polymer_walk.
Open Datasets Yes Group Contribution (GC) (Wang et al., 2018; Park & Paul, 1997; Wu et al., 2021). ... The Harvard organic photovoltaic dataset (HOPV) (Lopez et al., 2016). ... Predictive Toxicology Challenge (PTC) (Helma et al., 2001).
Dataset Splits No No explicit statement of a separate validation split. The paper states: 'For each (dataset, property) pair, we perform an 80-20 train-test split over 3 random seeds and report the mean and standard deviation.'
Hardware Specification No No specific hardware details like exact GPU/CPU models, processor types, or memory amounts are provided. The only mention related to hardware is 'For example, for the datasets we study, it is done under a minute when parallelized across 100 CPU cores', which does not specify the type of CPUs.
Software Dependencies No The paper mentions 'RDKit package (Landrum, 2016)', 'XGBoost', and 'GIN (Xu et al., 2019)' but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes Table 8. Hyperparameter settings for property prediction. Hyperparameter Value: Number of layers 5, Activation Re LU, Hidden dimension 16, Motif featurization Morgan fingerprint, Motif feature dimension 2048, Input feature dimension 5 2048 + 2048 + |G|, Batch Size 1, Learning Rate 1e-3.