Estimating Gradients for Discrete Random Variables by Sampling without Replacement

Authors: Wouter Kool, Herke van Hoof, Max Welling

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments with a toy problem, a categorical Variational Auto-Encoder and a structured prediction problem show that our estimator is the only estimator that is consistently among the best estimators in both high and low entropy settings.
Researcher Affiliation Collaboration Wouter Kool University of Amsterdam ORTEC w.w.m.kool@uva.nl Herke van Hoof University of Amsterdam h.c.vanhoof@uva.nl Max Welling University of Amsterdam CIFAR m.welling@uva.nl
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes 1Code available at https://github.com/wouterkool/estimating-gradients-without-replacement.
Open Datasets Yes The dataset is MNIST, statically binarized by thresholding at 0.5 (although we include results using the standard binarized dataset by Salakhutdinov & Murray (2008); Larochelle & Murray (2011) in Section G.2).
Dataset Splits Yes Figure 5 shows the -ELBO evaluated during training on the validation set.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for running its experiments.
Software Dependencies No The paper mentions 'PyTorch' and 'Adam' but does not specify version numbers for these or other key software components.
Experiment Setup Yes We optimize the ELBO using the analytic KL for 1000 epochs using the Adam (Kingma & Ba, 2015) optimizer. We use a learning rate of 10−3 for all estimators except Gumbel-Softmax and RELAX, which use a learning rate of 10−4 as we found they diverged with a higher learning rate. and We did not do any hyperparameter optimization and used the exact same training details, using the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 10−4 (no decay) for 100 epochs for all estimators. For the baselines, we used the same batch size of 512, but for estimators that use k = 4 samples, we used a batch size of 512 / 4 = 128 to compensate for the additional samples.