Noisy Activation Functions
Authors: Caglar Gulcehre, Marcin Moczulski, Misha Denil, Yoshua Bengio
ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find experimentally that replacing such saturating activation functions by noisy variants helps optimization in many contexts, yielding state-of-the-art or competitive results on different datasets and task, especially when training seems to be the most difficult, e.g., when curriculum learning is necessary to obtain good results. |
| Researcher Affiliation | Academia | Caglar Gulcehre GULCEHRC@IRO.UMONTREAL.CA Marcin Moczulski MARCIN.MOCZULSKI@STCATZ.OX.AC.UK Misha Denil MISHA.DENIL@GMAIL.COM Yoshua Bengio BENGIOY@IRO.UMONTREAL.CA University of Montreal University of Oxford |
| Pseudocode | Yes | Algorithm 1 Noisy Activations with Half-Normal Noise for Hard-Saturating Functions |
| Open Source Code | Yes | Codes for different types of noisy activation functions can be found at https://github.com/caglar/noisy_units. |
| Open Datasets | Yes | We trained a 2 layer word-level LSTM language model on Penntreebank. We used the same model proposed by Zaremba et al. (2014). |
| Dataset Splits | No | The paper mentions "validation perplexity" and uses validation sets, but it does not provide specific details on the split percentages or counts for any of the datasets used to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions Theano in the acknowledgements but does not provide specific version numbers for it or any other software libraries used in the experiments. |
| Experiment Setup | Yes | We changed the default gradient clipping to 5 from 10 in order to avoid numerical stability problems. ... In order to anneal the noise, we started training with the scale hyperparameter of the standard deviation of noise with c = 30 and annealed it down to 0.5 with the schedule of c t+1 where t is being incremented at every 200 minibatch updates. |