SUMO: Unbiased Estimation of Log Marginal Probability for Latent Variable Models

Authors: Yucen Luo, Alex Beatson, Mohammad Norouzi, Jun Zhu, David Duvenaud, Ryan P. Adams, Ricky T. Q. Chen

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that models trained using our estimator give better test-set likelihoods than a standard importance-sampling based approach for the same average computational cost. (Abstract); We first compare the performance of SUMO when used as a replacement to IWAE with the same expected cost on density modeling tasks. (Section 5)
Researcher Affiliation Collaboration Yucen Luo Tsinghua University luoyc15@mails.tsinghua.edu.cn; Alex Beatson Princeton University abeatson@cs.princeton.edu; Mohammad Norouzi Google Research mnorouzi@google.com; Jun Zhu Tsinghua University dcszj@tsinghua.edu.cn; David Duvenaud University of Toronto duvenaud@cs.toronto.edu; Ryan P. Adams Princeton University rpa@princeton.edu; Ricky T. Q. Chen University of Toronto rtqichen@cs.toronto.edu
Pseudocode Yes Algorithm 1 Computing SUMO, an unbiased estimator of log p(x).
Open Source Code No The paper does not provide any explicit statement about open-sourcing the code, nor does it include a link to a code repository.
Open Datasets Yes We make use of two benchmark datasets: dynamically binarized MNIST (Le Cun et al., 1998) and binarized OMNIGLOT (Lake et al., 2015).
Dataset Splits Yes The learning rate is reduced by factor 0.8 if the validation likelihood does not improve for 50 epochs. (Appendix A.8); We report the performance of models with early stopping if no improvements have been observed for 300 epochs on the validation set. (Appendix A.8)
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions optimizers like AMSGrad, RMSprop, and Adam, but does not specify version numbers for any key software libraries (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes In density modeling experiments, all the models are trained using a batch size of 100 and the AMSGrad optimizer (Reddi et al., 2018) with parameters lr = 0.001, β1 = 0.9, β2 = 0.999 and ϵ = 10 4. (Appendix A.8); We set the gradient norm to 5000 for encoder and {20, 40, 60} for decoder in SUMO. For IWAE, the gradient norm is fixed to 10 in all the experiments. (Appendix A.8)