Dropout with Expectation-linear Regularization
Authors: Xuezhe Ma, Yingkai Gao, Zhiting Hu, Yaoliang Yu, Yuntian Deng, Eduard Hovy
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we first formulate dropout as a tractable approximation of a latent variable model... Experimentally, through three benchmark datasets we show that our regularized dropout is not only as simple and efficient as standard dropout but also consistently leads to improved performance. |
| Researcher Affiliation | Academia | Xuezhe Ma, Yingkai Gao Language Technologies Institute Carnegie Mellon University {xuezhem, yingkaig}@cs.cmu.edu Zhiting Hu, Yaoliang Yu Machine Learning Department Carnegie Mellon University {zhitinghu, yaoliang}@cs.cmu.edu Yuntian Deng School of Engineering and Applied Sciences Harvard University dengyuntian@gmail.com Eduard Hovy Language Technologies Institute Carnegie Mellon University hovy@cmu.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | Experiments on three image classification benchmark datasets demonstrate that reducing the inference gap can indeed improve the performance consistently. ... The MNIST dataset (Le Cun et al., 1998) consists of 70,000 handwritten digit images of size 28 28, where 60,000 images are used for training and the rest for testing. ... The CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009) consist of 60,000 color images of size 32 32... 50,000 images are used for training and the rest for testing. |
| Dataset Splits | Yes | For each data set We held out 10,000 random training images for validation to tune the hyper-parameters, including λ in Eq. (15). |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | For all architectures, we used dropout rate p = 0.5 for all hidden layers and p = 0.2 for the input layer. ... Neural network training in all the experiments is performed with mini-batch stochastic gradient descent (SGD) with momentum. We choose an initial learning rate of η0, and the learning rate is updated on each epoch of training as ηt = η0/(1 + ρt), where ρ is the decay rate and t is the number of epoch completed. We run each experiment with 2,000 epochs... Table 3: Hyper-parameters for all experiments. |