ARMS: Antithetic-REINFORCE-Multi-Sample Gradient for Binary Variables

Authors: Aleksandar Dimitriev, Mingyuan Zhou

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate ARMS on several datasets for training generative models, and our experimental results show that it outperforms competing methods. Our experimental setup follows the one in Yin & Zhou (2019) and Dong et al. (2020), and all VAE experiments are built on top of the available Dis ARM code.
Researcher Affiliation Academia 1Mc Combs School of Business, The University of Texas at Austin, Austin, Texas 78712, USA. Correspondence to: Alek Dimitriev <alekdimi@utexas.edu>, Mingyuan Zhou <mingyuan.zhou@mccombs.utexas.edu>.
Pseudocode Yes Algorithm 1 Antithetic Dirichlet copula sampling. Algorithm 2 Antithetic Gaussian copula sampling.
Open Source Code Yes The code is publicly available1. 1https://github.com/alekdimi/arms
Open Datasets Yes The comparison is done on three different benchmark datasets: dynamically binarized MNIST, Fashion MNIST, and Omniglot, with each dataset split into the training, validation, and test sets.
Dataset Splits No The paper states that each dataset is split into training, validation, and test sets, but it does not provide specific percentages, sample counts, or citations to predefined splits for these datasets. It only confirms that such splits are used.
Hardware Specification Yes All the models were trained on a K40 Nvidia GPU and Intel Xeon E5-2680 processor.
Software Dependencies No The paper mentions 'Adam' as an optimizer and 'Leaky Re LU' activations but does not specify version numbers for any software dependencies or libraries (e.g., PyTorch, TensorFlow, etc.).
Experiment Setup Yes The nonlinear network has two hidden layers of 200 units each, using Leaky Re LU (Maas et al., 2013) activations with a coefficient of 0.3. Adam (Kingma & Ba, 2015) with a learning rate of 1e 4 is used to optimize the network parameters, and SGD with learning rate 1e 2 for the prior distribution logits. The optimization is run for 106 steps with mini batches of size 50. For RELAX, the scaling factor is initialized to 1, the temperature to 0.1, and the control variate is a neural network with one hidden layer of 137 units using Leaky Re LU activations. The only data preprocessing involves subtracting the global mean of the dataset from each image before it is input to the encoder.