CARMS: Categorical-Antithetic-REINFORCE Multi-Sample Gradient Estimator

Authors: Alek Dimitriev, Mingyuan Zhou

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate CARMS on several benchmark datasets on a generative modeling task, as well as a structured output prediction task, and find it to outperform competing methods including a strong self-control baseline.
Researcher Affiliation Academia Alek Dimitriev Mc Combs School of Business The University of Texas at Austin Austin, TX 78712 alek.dimitriev@mccombs.utexas.edu, Mingyuan Zhou Mc Combs School of Business The University of Texas at Austin Austin, TX 78712 mingyuan.zhou@mccombs.utexas.edu
Pseudocode Yes Algorithm 1 Antithetic inverse CDF categorical sampling, Algorithm 2 Antithetic Gumbel categorical sampling
Open Source Code Yes The code is publicly available.1 https://github.com/alekdimi/carms, The code for all experiments is freely available2. https://github.com/alekdimi/carms
Open Datasets Yes The task is training a categorical VAE using either a linear or nonlinear encoder/decoder pair on three different datasets: Dynamic(ally binarized) MNIST [Le Cun et al., 2010], Fashion MNIST [Xiao et al., 2017], and Omniglot [Lake et al., 2015].
Dataset Splits No The paper mentions training and testing on datasets but does not explicitly specify the proportions or methodology for dataset splits (e.g., train/validation/test percentages or sample counts).
Hardware Specification Yes The models are trained on an Intel Xeon Platinum 8280 2.7GHz CPU, and an individual run takes approximately 16 hours on one core of the machine, with a total carbon emissions estimated to be 28.19 kg of CO2 [Lacoste et al., 2019].
Software Dependencies No The paper mentions optimizers (SGD, Adam) and activation functions (Leaky ReLU) but does not provide specific version numbers for software dependencies or libraries used in the experimental setup.
Experiment Setup Yes For a fair comparison, all methods use the same learning rate, optimizer, model architecture, and number of samples. The prior logits are optimized using SGD with a learning rate of 10 2, whereas the encoder and decoder are optimized using Adam [Kingma and Ba, 2015] with a learning rate of 10 4, following Yin et al. [2019]. The optimization is run for 106 steps, with a batch size of 50, from which the global dataset mean is subtracted.