An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family
Authors: Alexandre De Brébisson, Pascal Vincent
ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we explore several loss functions from this family as possible alternatives to the traditional log-softmax. In particular, we focus our investigation on spherical bounds of the log-softmax loss and on two spherical log-likelihood losses, namely the log-Spherical Softmax suggested by Vincent et al. (2015) and the log-Taylor Softmax that we introduce. Although these alternatives do not yield as good results as the log-softmax loss on two language modeling tasks, they surprisingly outperform it in our experiments on MNIST and CIFAR10, suggesting that they might be relevant in a broad range of applications. |
| Researcher Affiliation | Academia | Alexandre de Br ebisson and Pascal Vincent MILA, D epartement d Informatique et de Recherche Op erationnelle, University of Montr eal alexandre.de.brebisson@umontreal.ca vincentp@iro.umontreal.ca |
| Pseudocode | No | The paper provides mathematical derivations and descriptions of algorithms in prose, but it does not include any formally labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | In this Section we compare the log-softmax and different spherical alternatives on several tasks: MNIST, CIFAR10/100 and a language modeling task on the Penntree bank and the One Billion Word datasets. |
| Dataset Splits | Yes | We used early stopping on the validation dataset as our stopping criterion. |
| Hardware Specification | No | The paper describes the experimental setup and training process but does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions techniques like 'rectifiers' and 'Nesterov momentum' but does not list any specific software dependencies or libraries with version numbers (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | Yes | The networks were trained with minibatches of size 200, a Nesterov momentum (Sutskever et al. (2013)) of 0.9 and a decaying learning rate 3. The initial learning rate is the only hyperparameter that we tuned individually for each loss. We used early stopping on the validation dataset as our stopping criterion. |