Representing Molecules as Random Walks Over Interpretable Grammars
Authors: Michael Sun, Minghao Guo, Weize Yuan, Veronika Thost, Crystal Elaine Owens, Aristotle Franklin Grosz, Sharvaa Selvan, Katelyn Zhou, Hassan Mohiuddin, Benjamin J Pedretti, Zachary P Smith, Jie Chen, Wojciech Matusik
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate clear advantages over existing methods in terms of performance, efficiency, and synthesizability of predicted molecules, and we provide detailed insights into the method s chemical interpretability. Code is available at https://github.com/ shiningsunnyday/polymer_walk. ... Our experiments quantitatively answer the following questions: 1) How well does our method perform on property prediction for our setting of interest? 2) How well does our representation work for the generation of novel molecules, compared with both SOTA symbolic and deep molecular generative models? |
| Researcher Affiliation | Collaboration | 1MIT CSAIL 2MIT Chemistry 3MIT-IBM Watson AI Lab, IBM Research 4MIT Chemical Engineering 5MIT 6Wellesley. |
| Pseudocode | Yes | Algorithm 1: function extract walk(D,B) ... Algorithm 2: function traverse dag(Gi, G) ... Algorithm 3: function build motif graph(V) ... Algorithm 4: function re order(childs) ... Algorithm 5: function dfs walk(cur, traj) ... Algorithm 6: function algo-diffusion ... Algorithm 7: function generate |
| Open Source Code | Yes | Code is available at https://github.com/ shiningsunnyday/polymer_walk. |
| Open Datasets | Yes | Group Contribution (GC) (Wang et al., 2018; Park & Paul, 1997; Wu et al., 2021). ... The Harvard organic photovoltaic dataset (HOPV) (Lopez et al., 2016). ... Predictive Toxicology Challenge (PTC) (Helma et al., 2001). |
| Dataset Splits | No | No explicit statement of a separate validation split. The paper states: 'For each (dataset, property) pair, we perform an 80-20 train-test split over 3 random seeds and report the mean and standard deviation.' |
| Hardware Specification | No | No specific hardware details like exact GPU/CPU models, processor types, or memory amounts are provided. The only mention related to hardware is 'For example, for the datasets we study, it is done under a minute when parallelized across 100 CPU cores', which does not specify the type of CPUs. |
| Software Dependencies | No | The paper mentions 'RDKit package (Landrum, 2016)', 'XGBoost', and 'GIN (Xu et al., 2019)' but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Table 8. Hyperparameter settings for property prediction. Hyperparameter Value: Number of layers 5, Activation Re LU, Hidden dimension 16, Motif featurization Morgan fingerprint, Motif feature dimension 2048, Input feature dimension 5 2048 + 2048 + |G|, Batch Size 1, Learning Rate 1e-3. |