CARMS: Categorical-Antithetic-REINFORCE Multi-Sample Gradient Estimator
Authors: Alek Dimitriev, Mingyuan Zhou
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate CARMS on several benchmark datasets on a generative modeling task, as well as a structured output prediction task, and find it to outperform competing methods including a strong self-control baseline. |
| Researcher Affiliation | Academia | Alek Dimitriev Mc Combs School of Business The University of Texas at Austin Austin, TX 78712 alek.dimitriev@mccombs.utexas.edu, Mingyuan Zhou Mc Combs School of Business The University of Texas at Austin Austin, TX 78712 mingyuan.zhou@mccombs.utexas.edu |
| Pseudocode | Yes | Algorithm 1 Antithetic inverse CDF categorical sampling, Algorithm 2 Antithetic Gumbel categorical sampling |
| Open Source Code | Yes | The code is publicly available.1 https://github.com/alekdimi/carms, The code for all experiments is freely available2. https://github.com/alekdimi/carms |
| Open Datasets | Yes | The task is training a categorical VAE using either a linear or nonlinear encoder/decoder pair on three different datasets: Dynamic(ally binarized) MNIST [Le Cun et al., 2010], Fashion MNIST [Xiao et al., 2017], and Omniglot [Lake et al., 2015]. |
| Dataset Splits | No | The paper mentions training and testing on datasets but does not explicitly specify the proportions or methodology for dataset splits (e.g., train/validation/test percentages or sample counts). |
| Hardware Specification | Yes | The models are trained on an Intel Xeon Platinum 8280 2.7GHz CPU, and an individual run takes approximately 16 hours on one core of the machine, with a total carbon emissions estimated to be 28.19 kg of CO2 [Lacoste et al., 2019]. |
| Software Dependencies | No | The paper mentions optimizers (SGD, Adam) and activation functions (Leaky ReLU) but does not provide specific version numbers for software dependencies or libraries used in the experimental setup. |
| Experiment Setup | Yes | For a fair comparison, all methods use the same learning rate, optimizer, model architecture, and number of samples. The prior logits are optimized using SGD with a learning rate of 10 2, whereas the encoder and decoder are optimized using Adam [Kingma and Ba, 2015] with a learning rate of 10 4, following Yin et al. [2019]. The optimization is run for 106 steps, with a batch size of 50, from which the global dataset mean is subtracted. |