Breaking the Activation Function Bottleneck through Adaptive Parameterization

Authors: Sebastian Flennerhag, Hujun Yin, John Keane, Mark Elliot

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments We compare the behavior of a model with adaptive feed-forward layers to standard feed-forward baselines in a controlled regression problem and on MNIST (Le Cun et al., 1998). The a LSTM is tested on the Penn Treebank and Wiki Text-2 word modeling tasks. We use the ADAM optimizer (Kingma & Ba, 2015) unless otherwise stated. Table 1: Train and test set accuracy on MNIST Table 2: Validation and test set perplexities on Penn Treebank.
Researcher Affiliation Academia Sebastian Flennerhag1, 2 Hujun Yin1, 2 John Keane1 Mark Elliot1 1University of Manchester 2The Alan Turing Institute sflennerhag@turing.ac.uk {hujun.yin, john.keane, mark.elliot}@manchester.ac.uk
Pseudocode No No pseudocode or algorithm blocks were found.
Open Source Code Yes Code available at https://github.com/flennerhag/alstm.
Open Datasets Yes The Penn Treebank corpus (PTB; Marcus et al., 1993; Mikolov et al., 2010) is a widely used benchmark for language modeling. Wiki Text-2 (WT2; Merity et al., 2017) is a corpus curated from Wikipedia articles with lighter processing than PTB.
Dataset Splits Yes We evaluate the a LSTM on word-level modeling following standard practice in training setup (e.g. Zaremba et al., 2015). Table 2: Validation and test set perplexities on Penn Treebank.
Hardware Specification No The paper does not provide specific hardware details (GPU/CPU models, memory, etc.) used for running its experiments.
Software Dependencies No The paper mentions using the ADAM optimizer but does not specify version numbers for any software dependencies like Python, specific deep learning frameworks (e.g., TensorFlow, PyTorch), or other libraries.
Experiment Setup Yes We train all models with Stochastic Gradient Descent with a learning rate of 0.001, a batch size of 128, and train for 50 000 steps. We use the ADAM optimizer (Kingma & Ba, 2015) unless otherwise stated. For details on hyper-parameters, see supplementary material. we fix the number of layers to 2, though more layers tend to perform better, and use a policy latent variable size of 100. gradient clipping is required and dropout rates must be reduced by ~25%.