Autoencoding Variational Inference For Topic Models
Authors: Akash Srivastava, Charles Sutton
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We run experiments on both the 20 Newsgroups (11,000 training instances with 2000 word vocabulary) and RCV1 Volume 2 ( 800K training instances with 10000 word vocabulary) datasets. Our preprocessing involves tokenization, removal of some non UTF-8 characters for 20 Newsgroups and English stop word removal. We first compare our AVITM inference method with the standard online mean-field variational inference (Hoffman et al., 2010) and collapsed Gibbs sampling (Griffiths & Steyvers, 2004) on the LDA model. |
| Researcher Affiliation | Academia | Akash Srivastava Informatics Forum, University of Edinburgh 10, Crichton St Edinburgh, EH89AB, UK akash.srivastava@ed.ac.uk Charles Sutton Informatics Forum, University of Edinburgh 10, Crichton St Edinburgh, EH89AB, UK csutton@inf.ed.ac.uk Additional affiliation: Alan Turing Institute, British Library, 96 Euston Road, London NW1 2DB |
| Pseudocode | Yes | Algorithm 1: LDA as a generative model. for each document w do Draw topic distribution θ Dirichlet(α); for each word at position n do Sample topic zn Multinomial(1, θ); Sample word wn Multinomial(1, βzn); end end |
| Open Source Code | Yes | Code available at https://github.com/akashgit/autoencoding_vi_for_topic_models |
| Open Datasets | Yes | We run experiments on both the 20 Newsgroups (11,000 training instances with 2000 word vocabulary) and RCV1 Volume 2 ( 800K training instances with 10000 word vocabulary) datasets. |
| Dataset Splits | No | The paper states training instances but does not provide explicit validation dataset splits (e.g., percentages or counts for a validation set). |
| Hardware Specification | No | The paper mentions training on 'a single GPU' but does not specify the model or any other specific hardware components like CPU, memory, or cloud instance types. |
| Software Dependencies | No | The paper mentions using 'scikit-learn for DMFVI and mallet (Mc Callum, 2002) for collapsed Gibbs' and the 'ADAM optimizer (Kingma & Ba, 2015)', but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Specifically, we train the network with the ADAM optimizer (Kingma & Ba, 2015) using high moment weight (β1) and learning rate (η). Through training at higher rates, early peaks in the functional space can be easily avoided. The problem is that momentum based training coupled with higher learning rate causes the optimizer to diverge. While explicit gradient clipping helps to a certain extent, we found that batch normalization (Ioffe & Szegedy, 2015) does even better by smoothing out the functional space and hence curbing sudden divergence. Finally, we also found an increase in performance with dropout units when applied to θ to force the network to use more of its capacity. (For both parameters, the precise value was chosen by Bayesian optimization. We found that these values in the with BN cases were close to the default settings in the Adam optimizer.) |