reproducibilityindex.ai

An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family

Authors: Alexandre De Brébisson, Pascal Vincent

ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we explore several loss functions from this family as possible alternatives to the traditional log-softmax. In particular, we focus our investigation on spherical bounds of the log-softmax loss and on two spherical log-likelihood losses, namely the log-Spherical Softmax suggested by Vincent et al. (2015) and the log-Taylor Softmax that we introduce. Although these alternatives do not yield as good results as the log-softmax loss on two language modeling tasks, they surprisingly outperform it in our experiments on MNIST and CIFAR10, suggesting that they might be relevant in a broad range of applications.
Researcher Affiliation	Academia	Alexandre de Br ebisson and Pascal Vincent MILA, D epartement d Informatique et de Recherche Op erationnelle, University of Montr eal alexandre.de.brebisson@umontreal.ca vincentp@iro.umontreal.ca
Pseudocode	No	The paper provides mathematical derivations and descriptions of algorithms in prose, but it does not include any formally labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	In this Section we compare the log-softmax and different spherical alternatives on several tasks: MNIST, CIFAR10/100 and a language modeling task on the Penntree bank and the One Billion Word datasets.
Dataset Splits	Yes	We used early stopping on the validation dataset as our stopping criterion.
Hardware Specification	No	The paper describes the experimental setup and training process but does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions techniques like 'rectifiers' and 'Nesterov momentum' but does not list any specific software dependencies or libraries with version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup	Yes	The networks were trained with minibatches of size 200, a Nesterov momentum (Sutskever et al. (2013)) of 0.9 and a decaying learning rate 3. The initial learning rate is the only hyperparameter that we tuned individually for each loss. We used early stopping on the validation dataset as our stopping criterion.