One-vs-Each Approximation to Softmax for Scalable Estimation of Probabilities
Authors: Michalis Titsias RC AUEB
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that the new bound has interesting theoretical properties and we demonstrate its use in classification problems. Figure 1 shows some estimated softmax probabilities, using a dataset of 200 points each taking one out of ten values... Here, we consider AMAZONCAT-13K... which is a large scale classification dataset. |
| Researcher Affiliation | Academia | Michalis K. Titsias Department of Informatics Athens University of Economics and Business mtitsias@aueb.gr |
| Pseudocode | No | The paper provides mathematical derivations and explanations but does not include pseudocode or an algorithm block. |
| Open Source Code | No | The paper does not mention providing access to source code for the described methodology. |
| Open Datasets | Yes | MNIST2, 20NEWS3 and BIBTEX [12]; see Table 1 for details. (Footnotes 2, 3, 4 provide URLs: 2http://yann.lecun.com/exdb/mnist, 3http://qwone.com/~jason/20Newsgroups/, 4http://research.microsoft.com/en-us/um/people/manik/downloads/XC/XMLRepository. html). [12] Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas. Multilabel text classification for automated tag suggestion. In In: Proceedings of the ECML/PKDD-08 Workshop on Discovery Challenge, 2008. |
| Dataset Splits | No | Table 1 provides 'Training examples' and 'Test examples' for the datasets, but it does not explicitly mention or quantify a separate 'validation' split. |
| Hardware Specification | No | The paper mentions that 'full training is completed in just 26 minutes in a stand-alone PC' but does not provide specific hardware details such as CPU/GPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks). |
| Experiment Setup | Yes | We consider minibatches of size ten to approximate the sum Pn and subsets of remaining classes of size one to approximate Pm=yn. We used a learning rate initialized to 0.5/b (and then decrease it by a factor of 0.9 after each epoch) and performed 2 × 105 iterations. We applied OVE-SGD where at each stochastic gradient update we consider a single training instance (i.e. the minibatch size was one) and for that instance we randomly select five remaining classes. We used a very small learning rate having value 10−8 and we performed five epochs across the full dataset, that is we performed in total 5 × 1186239 stochastic gradient updates. After each epoch we halve the value of the learning rate before next epoch starts. |