Estimating Gradients for Discrete Random Variables by Sampling without Replacement
Authors: Wouter Kool, Herke van Hoof, Max Welling
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with a toy problem, a categorical Variational Auto-Encoder and a structured prediction problem show that our estimator is the only estimator that is consistently among the best estimators in both high and low entropy settings. |
| Researcher Affiliation | Collaboration | Wouter Kool University of Amsterdam ORTEC w.w.m.kool@uva.nl Herke van Hoof University of Amsterdam h.c.vanhoof@uva.nl Max Welling University of Amsterdam CIFAR m.welling@uva.nl |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code available at https://github.com/wouterkool/estimating-gradients-without-replacement. |
| Open Datasets | Yes | The dataset is MNIST, statically binarized by thresholding at 0.5 (although we include results using the standard binarized dataset by Salakhutdinov & Murray (2008); Larochelle & Murray (2011) in Section G.2). |
| Dataset Splits | Yes | Figure 5 shows the -ELBO evaluated during training on the validation set. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running its experiments. |
| Software Dependencies | No | The paper mentions 'PyTorch' and 'Adam' but does not specify version numbers for these or other key software components. |
| Experiment Setup | Yes | We optimize the ELBO using the analytic KL for 1000 epochs using the Adam (Kingma & Ba, 2015) optimizer. We use a learning rate of 10−3 for all estimators except Gumbel-Softmax and RELAX, which use a learning rate of 10−4 as we found they diverged with a higher learning rate. and We did not do any hyperparameter optimization and used the exact same training details, using the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 10−4 (no decay) for 100 epochs for all estimators. For the baselines, we used the same batch size of 512, but for estimators that use k = 4 samples, we used a batch size of 512 / 4 = 128 to compensate for the additional samples. |