Breaking the Activation Function Bottleneck through Adaptive Parameterization
Authors: Sebastian Flennerhag, Hujun Yin, John Keane, Mark Elliot
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments We compare the behavior of a model with adaptive feed-forward layers to standard feed-forward baselines in a controlled regression problem and on MNIST (Le Cun et al., 1998). The a LSTM is tested on the Penn Treebank and Wiki Text-2 word modeling tasks. We use the ADAM optimizer (Kingma & Ba, 2015) unless otherwise stated. Table 1: Train and test set accuracy on MNIST Table 2: Validation and test set perplexities on Penn Treebank. |
| Researcher Affiliation | Academia | Sebastian Flennerhag1, 2 Hujun Yin1, 2 John Keane1 Mark Elliot1 1University of Manchester 2The Alan Turing Institute sflennerhag@turing.ac.uk {hujun.yin, john.keane, mark.elliot}@manchester.ac.uk |
| Pseudocode | No | No pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | Code available at https://github.com/flennerhag/alstm. |
| Open Datasets | Yes | The Penn Treebank corpus (PTB; Marcus et al., 1993; Mikolov et al., 2010) is a widely used benchmark for language modeling. Wiki Text-2 (WT2; Merity et al., 2017) is a corpus curated from Wikipedia articles with lighter processing than PTB. |
| Dataset Splits | Yes | We evaluate the a LSTM on word-level modeling following standard practice in training setup (e.g. Zaremba et al., 2015). Table 2: Validation and test set perplexities on Penn Treebank. |
| Hardware Specification | No | The paper does not provide specific hardware details (GPU/CPU models, memory, etc.) used for running its experiments. |
| Software Dependencies | No | The paper mentions using the ADAM optimizer but does not specify version numbers for any software dependencies like Python, specific deep learning frameworks (e.g., TensorFlow, PyTorch), or other libraries. |
| Experiment Setup | Yes | We train all models with Stochastic Gradient Descent with a learning rate of 0.001, a batch size of 128, and train for 50 000 steps. We use the ADAM optimizer (Kingma & Ba, 2015) unless otherwise stated. For details on hyper-parameters, see supplementary material. we fix the number of layers to 2, though more layers tend to perform better, and use a policy latent variable size of 100. gradient clipping is required and dropout rates must be reduced by ~25%. |