Augment and Reduce: Stochastic Inference for Large Categorical Distributions

Authors: Francisco Ruiz, Michalis Titsias, Adji Bousso Dieng, David Blei

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On several large-scale classification problems, we show that A&R provides a tighter bound on the marginal likelihood and has better predictive performance than existing approaches. We study A&R on linear classification tasks with up to 10^4 classes. On simulated and real data, we find that it provides accurate estimates of the categorical probabilities and gives better performance than existing approaches. 4. Experiments
Researcher Affiliation Academia 1University of Cambridge. 2Columbia University. 3Athens University of Economics and Business.. Correspondence to: Francisco J. R. Ruiz <f.ruiz@eng.cam.ac.uk, f.ruiz@columbia.edu>.
Pseudocode Yes Algorithm 1 Softmax A&R for classification. Algorithm 2 General A&R for classification.
Open Source Code Yes Code for A&R is available at https://github.com/ franrruiz/augment-reduce.
Open Datasets Yes We consider MNIST and Bibtex (Katakis et al., 2008; Prabhu & Varma, 2014), where we can compare against the exact softmax. We also analyze Omniglot (Lake et al., 2015), EURLex-4K (Mencia & Furnkranz, 2008; Bhatia et al., 2015), and Amazon Cat-13K (Mc Auley & Leskovec, 2013). MNIST is available at http://yann.lecun.com/ exdb/mnist. Omniglot can be found at https://github. com/brendenlake/omniglot. Bibtex, EURLex-4K, and Amazon Cat-13K are available at http://manikvarma.org/ downloads/XC/XMLRepository.html.
Dataset Splits No Table 1 lists Ntrain and Ntest for all datasets, implying training and test splits. However, the paper does not explicitly mention separate validation set sizes or percentages for any of the datasets.
Hardware Specification No The paper states 'We run each approach on one CPU core' for the synthetic dataset, but it does not specify the CPU model, GPU models, memory, or any other detailed hardware specifications used for the experiments.
Software Dependencies No The paper mentions using algorithms like RMSPROP and Adagrad for step size control but does not provide specific software names with version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x).
Experiment Setup Yes We initialize the weights and biases randomly, drawing from a Gaussian distribution with zero mean and standard deviation 0.1 (0.001 for the biases). We set the step size using the default parameters, i.e., ρ(t) = ρ0 t 1/2+10^16 / (1 + ps(t)) where s(t) = 0.1(g(t))^2 + 0.9s(t − 1). We set ρ0 = 0.02 and we additionally decrease ρ0 by a factor of 0.9 every 2000 iterations. We set the step size α(t) in Algorithm 1 as α(t) = (1+t)^−0.9, the default values suggested by Hoffman et al. (2013). For the step size α(t) in Algorithm 2, we set α(t) = 0.01(1 + t)^−0.9. We set the minibatch sizes |B| and |S| beforehand. The specific values for each dataset are also in Table 1.