One-vs-Each Approximation to Softmax for Scalable Estimation of Probabilities

Authors: Michalis Titsias RC AUEB

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that the new bound has interesting theoretical properties and we demonstrate its use in classification problems. Figure 1 shows some estimated softmax probabilities, using a dataset of 200 points each taking one out of ten values... Here, we consider AMAZONCAT-13K... which is a large scale classification dataset.
Researcher Affiliation Academia Michalis K. Titsias Department of Informatics Athens University of Economics and Business mtitsias@aueb.gr
Pseudocode No The paper provides mathematical derivations and explanations but does not include pseudocode or an algorithm block.
Open Source Code No The paper does not mention providing access to source code for the described methodology.
Open Datasets Yes MNIST2, 20NEWS3 and BIBTEX [12]; see Table 1 for details. (Footnotes 2, 3, 4 provide URLs: 2http://yann.lecun.com/exdb/mnist, 3http://qwone.com/~jason/20Newsgroups/, 4http://research.microsoft.com/en-us/um/people/manik/downloads/XC/XMLRepository. html). [12] Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas. Multilabel text classification for automated tag suggestion. In In: Proceedings of the ECML/PKDD-08 Workshop on Discovery Challenge, 2008.
Dataset Splits No Table 1 provides 'Training examples' and 'Test examples' for the datasets, but it does not explicitly mention or quantify a separate 'validation' split.
Hardware Specification No The paper mentions that 'full training is completed in just 26 minutes in a stand-alone PC' but does not provide specific hardware details such as CPU/GPU models, memory, or cloud instance types.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks).
Experiment Setup Yes We consider minibatches of size ten to approximate the sum Pn and subsets of remaining classes of size one to approximate Pm=yn. We used a learning rate initialized to 0.5/b (and then decrease it by a factor of 0.9 after each epoch) and performed 2 × 105 iterations. We applied OVE-SGD where at each stochastic gradient update we consider a single training instance (i.e. the minibatch size was one) and for that instance we randomly select five remaining classes. We used a very small learning rate having value 10−8 and we performed five epochs across the full dataset, that is we performed in total 5 × 1186239 stochastic gradient updates. After each epoch we halve the value of the learning rate before next epoch starts.