ARMS: Antithetic-REINFORCE-Multi-Sample Gradient for Binary Variables
Authors: Aleksandar Dimitriev, Mingyuan Zhou
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate ARMS on several datasets for training generative models, and our experimental results show that it outperforms competing methods. Our experimental setup follows the one in Yin & Zhou (2019) and Dong et al. (2020), and all VAE experiments are built on top of the available Dis ARM code. |
| Researcher Affiliation | Academia | 1Mc Combs School of Business, The University of Texas at Austin, Austin, Texas 78712, USA. Correspondence to: Alek Dimitriev <alekdimi@utexas.edu>, Mingyuan Zhou <mingyuan.zhou@mccombs.utexas.edu>. |
| Pseudocode | Yes | Algorithm 1 Antithetic Dirichlet copula sampling. Algorithm 2 Antithetic Gaussian copula sampling. |
| Open Source Code | Yes | The code is publicly available1. 1https://github.com/alekdimi/arms |
| Open Datasets | Yes | The comparison is done on three different benchmark datasets: dynamically binarized MNIST, Fashion MNIST, and Omniglot, with each dataset split into the training, validation, and test sets. |
| Dataset Splits | No | The paper states that each dataset is split into training, validation, and test sets, but it does not provide specific percentages, sample counts, or citations to predefined splits for these datasets. It only confirms that such splits are used. |
| Hardware Specification | Yes | All the models were trained on a K40 Nvidia GPU and Intel Xeon E5-2680 processor. |
| Software Dependencies | No | The paper mentions 'Adam' as an optimizer and 'Leaky Re LU' activations but does not specify version numbers for any software dependencies or libraries (e.g., PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | The nonlinear network has two hidden layers of 200 units each, using Leaky Re LU (Maas et al., 2013) activations with a coefficient of 0.3. Adam (Kingma & Ba, 2015) with a learning rate of 1e 4 is used to optimize the network parameters, and SGD with learning rate 1e 2 for the prior distribution logits. The optimization is run for 106 steps with mini batches of size 50. For RELAX, the scaling factor is initialized to 1, the temperature to 0.1, and the control variate is a neural network with one hidden layer of 137 units using Leaky Re LU activations. The only data preprocessing involves subtracting the global mean of the dataset from each image before it is input to the encoder. |