reproducibilityindex.ai

Breaking the Activation Function Bottleneck through Adaptive Parameterization

Authors: Sebastian Flennerhag, Hujun Yin, John Keane, Mark Elliot

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments We compare the behavior of a model with adaptive feed-forward layers to standard feed-forward baselines in a controlled regression problem and on MNIST (Le Cun et al., 1998). The a LSTM is tested on the Penn Treebank and Wiki Text-2 word modeling tasks. We use the ADAM optimizer (Kingma & Ba, 2015) unless otherwise stated. Table 1: Train and test set accuracy on MNIST Table 2: Validation and test set perplexities on Penn Treebank.
Researcher Affiliation	Academia	Sebastian Flennerhag1, 2 Hujun Yin1, 2 John Keane1 Mark Elliot1 1University of Manchester 2The Alan Turing Institute sﬂennerhag@turing.ac.uk {hujun.yin, john.keane, mark.elliot}@manchester.ac.uk
Pseudocode	No	No pseudocode or algorithm blocks were found.
Open Source Code	Yes	Code available at https://github.com/flennerhag/alstm.
Open Datasets	Yes	The Penn Treebank corpus (PTB; Marcus et al., 1993; Mikolov et al., 2010) is a widely used benchmark for language modeling. Wiki Text-2 (WT2; Merity et al., 2017) is a corpus curated from Wikipedia articles with lighter processing than PTB.
Dataset Splits	Yes	We evaluate the a LSTM on word-level modeling following standard practice in training setup (e.g. Zaremba et al., 2015). Table 2: Validation and test set perplexities on Penn Treebank.
Hardware Specification	No	The paper does not provide specific hardware details (GPU/CPU models, memory, etc.) used for running its experiments.
Software Dependencies	No	The paper mentions using the ADAM optimizer but does not specify version numbers for any software dependencies like Python, specific deep learning frameworks (e.g., TensorFlow, PyTorch), or other libraries.
Experiment Setup	Yes	We train all models with Stochastic Gradient Descent with a learning rate of 0.001, a batch size of 128, and train for 50 000 steps. We use the ADAM optimizer (Kingma & Ba, 2015) unless otherwise stated. For details on hyper-parameters, see supplementary material. we ﬁx the number of layers to 2, though more layers tend to perform better, and use a policy latent variable size of 100. gradient clipping is required and dropout rates must be reduced by ~25%.