Data-Efficient Graph Grammar Learning for Molecular Generation
Authors: Minghao Guo, Veronika Thost, Beichen Li, Payel Das, Jie Chen, Wojciech Matusik
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation investigates the following five questions: How do SOTA models for molecule generation perform on realistic small monomer datasets? Is our approach effective in generating specific types of monomers that are synthesizable? How do the models perform on larger monomer datasets? Can our approach learn to weigh and optimize different metrics according to user needs? Can our grammar s explainability support applications, such as functional group extraction? 5.1 EXPERIMENT SETUP Data. We use three small datasets, each representing a specific class of monomers, which we curate manually from the literature: Acrylates, Chain Extenders, and Isocyanates, containing only 32, 11, and 11 samples, respectively (printed in Appendix G). 5.2 RESULTS ON SMALL, CLASS-SPECIFIC POLYMER DATA Results. Table 1 shows the results on the Isocyanate data; due to lack of space the other two tables are in Appendix C.1. |
| Researcher Affiliation | Collaboration | 1MIT CSAIL, 2MIT-IBM Watson AI Lab, 3IBM Research |
| Pseudocode | No | The paper describes the overall pipeline and grammar construction process using figures and textual descriptions, but it does not include a formal pseudocode block or an algorithm section. |
| Open Source Code | Yes | Code is available at https://github.com/gmh14/data_efficient_grammar. |
| Open Datasets | Yes | Data. We use three small datasets, each representing a specific class of monomers, which we curate manually from the literature: Acrylates, Chain Extenders, and Isocyanates, containing only 32, 11, and 11 samples, respectively (printed in Appendix G). For comparison and for pretraining baselines, we also use a large collection of 81k monomers from St. John et al. (2019) and Jin et al. (2020).1 1https://github.com/wengong-jin/hgraph2graph |
| Dataset Splits | No | The paper specifies the total number of samples in the small datasets and states the number of training samples used for the large polymer dataset (117 or 239 samples), but it does not provide explicit train/validation/test splits (e.g., percentages, sample counts for each split, or a detailed splitting methodology) for all datasets used in the experiments. The evaluation for generated molecules refers to comparing them against the training data distribution rather than a held-out test set from the original data. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using a 'pretrained graph neural network (Hu et al., 2019) as our feature extractor' and states that it uses the 'Adam optimizer' but does not provide specific version numbers for any software dependencies like Python, PyTorch, TensorFlow, or other libraries. |
| Experiment Setup | Yes | For the potential function Fθ, we use a two-layer fully connected network with size 300 and 128. For the optimization objectives, we consider two metrics: diversity and RS. For hyperparameters, we set MC sampling size as 5. We use the Adam optimizer to train the two-layer network with learning rate 0.01. We trained for 20 epochs. |