Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator

Authors: Max B Paulus, Chris J. Maddison, Andreas Krause

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we study the effectiveness of our gradient estimator in practice. In particular, we evaluate its performance with respect to the temperature τ, the number of MC samples K and the batch size B. We measure the variance reduction and improvements in MSE our estimator achieves in practice, and assess whether its lower variance gradient estimates accelerate the convergence on the objective or improve final test set performance. Our focus is on single-evaluation gradient estimation and we compare against other non-relaxing estimators (ST, Fou ST, ST-GS and REINFORCE with a running mean as a baseline) and relaxing estimators (GS), where permissible. Experimental details are given in Appendix D. First, we consider a toy example which allows us to explore and visualize the variance of our estimator and suggests that it is particularly effective at low temperatures. Next, we evaluate the effect of τ and K in a latent parse tree task which does not permit the use of relaxed gradient estimators. Here, our estimator facilitates training at low temperatures to improve overall performance and is effective even with few MC samples. Finally, we train variational auto-encoders with discrete latent variables (Kingma & Welling, 2013; Rezende et al., 2014). Our estimator yields improvements at small batch sizes and obtains competitive or better performance than the GS estimator at the largest arity.
Researcher Affiliation Academia Max B. Paulus ETH Z urich MPI for Intelligent Systems, T ubingen max.paulus@inf.ethz.ch Chris J. Maddison University of Toronto Vector Institute cmaddis@cs.toronto.edu Andreas Krause ETH Z urich krausea@ethz.ch
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not include an explicit statement about releasing its source code or a link to a code repository for the methodology described.
Open Datasets Yes We demonstrate the effectiveness of our estimator in unsupervised parsing on the List Ops dataset (Nangia & Bowman, 2018) and on a variational autoencoder loss (Kingma & Welling, 2013; Rezende et al., 2014). (...) Finally, we train variational auto-encoders (Kingma & Welling, 2013; Rezende et al., 2014) with discrete latent random variables on the MNIST dataset of handwritten digits (Le Cun & Cortes, 2010).
Dataset Splits Yes Best test classification accuracy on the List Ops dataset selected on the validation set. (...) We used the fixed binarization of (Salakhutdinov & Murray, 2008) and the standard split into train, validation and test sets.
Hardware Specification No The paper mentions that drawing samples 'can easily be parallelized on modern workstations (GPUs, etc.)' but does not specify any particular GPU models, CPU models, or other hardware configurations used for the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, libraries, or programming languages used.
Experiment Setup Yes We consider temperatures τ {0.01, 0.1, 1.0} and experiment with shallow and deeper trees by considering sequences of length L up to 10, 25 and 50. All models are trained with stochastic gradient descent with a batch size equal to the maximum L. (...) We experiment with different batch sizes and discrete random variables of arities in {2, 4, 8, 16} as in Maddison et al. (2017). To facilitate comparisons, we do not alter the total dimension of the latent space and train all models for 50,000 iterations using stochastic gradient descent with momentum. Hyperparameters are optimised for each estimator using random search (Bergstra & Bengio, 2012) over twenty independent runs.