MADE: Masked Autoencoder for Distribution Estimation

Authors: Mathieu Germain, Karol Gregor, Iain Murray, Hugo Larochelle

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that this approach is competitive with stateof-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.
Researcher Affiliation Collaboration Mathieu Germain MATHIEU.GERMAIN2@USHERBROOKE.CA Universit e de Sherbrooke, Canada Karol Gregor KAROL.GREGOR@GMAIL.COM Google Deep Mind Iain Murray I.MURRAY@ED.AC.UK University of Edinburgh, United Kingdom Hugo Larochelle HUGO.LAROCHELLE@USHERBROOKE.CA Universit e de Sherbrooke, Canada
Pseudocode Yes Algorithm 1 Computation of p(x) and learning gradients for MADE with order and connectivity sampling.
Open Source Code Yes The code to reproduce the experiments of this paper is available at https://github.com/mgermain/MADE/releases/tag/ICML2015.
Open Datasets Yes We use the binary UCI evaluation suite that was first put together in Larochelle & Murray (2011). It s a collection of 7 relatively small datasets from the University of California, Irvine machine learning repository and the OCR-letters dataset from the Stanford AI Lab. Table 2 gives an overview of the scale of those datasets and the way they were split. The version of MNIST we used is the one binarized by Salakhutdinov & Murray (2008).
Dataset Splits Yes Table 2. Number of input dimensions and numbers of examples in the train, validation, and test splits. Name # Inputs Train Valid. Test Adult 123 5000 1414 26147 Connect4 126 16000 4000 47557 DNA 180 1400 600 1186 Mushrooms 112 2000 500 5624 NIPS-0-12 500 400 100 1240 OCR-letters 128 32152 10000 10000 RCV1 150 40000 10000 150000 Web 300 14000 3188 32561
Hardware Specification Yes These timings were obtained on a K20 NVIDIA GPU.
Software Dependencies No The paper mentions 'Theano' and cites related papers (Bastien et al., 2012; Bergstra et al., 2010), but it does not specify a version number for Theano or any other software dependencies used in the experiments.
Experiment Setup Yes All experiments were made using stochastic gradient descent (SGD) with mini-batches of size 100 and a lookahead of 30 for early stopping. The experiments were run with networks of 500 units per hidden layer, using the adadelta learning update (Zeiler, 2012) with a decay of 0.95. The other hyperparameters were varied as Table 3 indicates. We note as # of masks the number of different masks through which MADE cycles during training.