Learning a Generative Model for Validity in Complex Discrete Structures

Authors: Dave Janz, Jos van der Westhuizen, Brooks Paige, Matt Kusner, José Miguel Hernández-Lobato

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate its effectiveness as a generative model of Python 3 source code for mathematical expressions, and in improving the ability of a variational autoencoder trained on SMILES strings to decode valid molecular structures.
Researcher Affiliation Academia David Janz University of Cambridge dj343@cam.ac.uk Jos van der Westhuizen University of Cambridge jv365@cam.ac.uk Brooks Paige Alan Turing Institute University of Cambridge bpaige@turing.ac.uk Matt J. Kusner Alan Turing Institute University of Warwick mkusner@turing.ac.uk Jos e Miguel Hern andez-Lobato Alan Turing Institute University of Cambridge jmh233@cam.ac.uk
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an unambiguous statement or a direct link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes The main data source considered is the ZINC data set Irwin & Shoichet (2005), as used in Kusner et al. (2017). We also use the USPTO 15k reaction products data (Lowe, 2014) and a set of molecule solubility information (Huuskonen, 2000) as withheld test data.
Dataset Splits No The paper mentions training on 'ZINC (train) data' and evaluating on a 'withheld test partition' and 'unseen molecule data sets', but it does not specify exact percentages, sample counts, or detailed splitting methodology needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions 'Python 3' and 'rdkit' but does not provide specific version numbers for these or any other key software components or libraries, making full replication difficult.
Experiment Setup Yes In our experiments we use K = 16. We choose γ = 0.05, which results in synthetic data that is approximately 50% valid. The CVAE model is trained for 100 epochs, as per previous work further training improves reconstruction accuracy. Models trained until convergence (800, 000 training points, beyond the scope of figure 2).