Data-Efficient Graph Grammar Learning for Molecular Generation

Authors: Minghao Guo, Veronika Thost, Beichen Li, Payel Das, Jie Chen, Wojciech Matusik

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation investigates the following five questions: How do SOTA models for molecule generation perform on realistic small monomer datasets? Is our approach effective in generating specific types of monomers that are synthesizable? How do the models perform on larger monomer datasets? Can our approach learn to weigh and optimize different metrics according to user needs? Can our grammar s explainability support applications, such as functional group extraction? 5.1 EXPERIMENT SETUP Data. We use three small datasets, each representing a specific class of monomers, which we curate manually from the literature: Acrylates, Chain Extenders, and Isocyanates, containing only 32, 11, and 11 samples, respectively (printed in Appendix G). 5.2 RESULTS ON SMALL, CLASS-SPECIFIC POLYMER DATA Results. Table 1 shows the results on the Isocyanate data; due to lack of space the other two tables are in Appendix C.1.
Researcher Affiliation Collaboration 1MIT CSAIL, 2MIT-IBM Watson AI Lab, 3IBM Research
Pseudocode No The paper describes the overall pipeline and grammar construction process using figures and textual descriptions, but it does not include a formal pseudocode block or an algorithm section.
Open Source Code Yes Code is available at https://github.com/gmh14/data_efficient_grammar.
Open Datasets Yes Data. We use three small datasets, each representing a specific class of monomers, which we curate manually from the literature: Acrylates, Chain Extenders, and Isocyanates, containing only 32, 11, and 11 samples, respectively (printed in Appendix G). For comparison and for pretraining baselines, we also use a large collection of 81k monomers from St. John et al. (2019) and Jin et al. (2020).1 1https://github.com/wengong-jin/hgraph2graph
Dataset Splits No The paper specifies the total number of samples in the small datasets and states the number of training samples used for the large polymer dataset (117 or 239 samples), but it does not provide explicit train/validation/test splits (e.g., percentages, sample counts for each split, or a detailed splitting methodology) for all datasets used in the experiments. The evaluation for generated molecules refers to comparing them against the training data distribution rather than a held-out test set from the original data.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions using a 'pretrained graph neural network (Hu et al., 2019) as our feature extractor' and states that it uses the 'Adam optimizer' but does not provide specific version numbers for any software dependencies like Python, PyTorch, TensorFlow, or other libraries.
Experiment Setup Yes For the potential function Fθ, we use a two-layer fully connected network with size 300 and 128. For the optimization objectives, we consider two metrics: diversity and RS. For hyperparameters, we set MC sampling size as 5. We use the Adam optimizer to train the two-layer network with learning rate 0.01. We trained for 20 epochs.