Adaptive Dropout with Rademacher Complexity Regularization

Authors: Ke Zhai, Huan Wang

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the task of image and document classification also show our method achieves better performance compared to the state-of-the-art dropout algorithms.
Researcher Affiliation Industry Ke Zhai Microsoft AI & Research Sunnyvale, CA kezhai@microsoft.com Huan Wang Salesforce Research Palo Alto, CA joyousprince@gmail.com
Pseudocode No The paper describes algorithmic steps and mathematical formulations but does not include a clearly labeled pseudocode block or algorithm.
Open Source Code No The paper does not provide any statement about releasing source code or include a link to a code repository for the methodology described.
Open Datasets Yes MNIST dataset is a collection of 28 28 pixel hand-written digit images in grayscale, containing 60K for training and 10K for testing.
Dataset Splits Yes For all datasets, we hold out 20% of the training data as validation set for parameter tuning and model selection.
Hardware Specification No The paper does not specify the hardware used for running the experiments, such as particular CPU or GPU models, or cloud computing instances.
Software Dependencies No The paper mentions using standard machine learning concepts and models but does not provide specific version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup Yes We optimize categorical cross-entropy loss on predicted class labels with Rademacher regularization. ... We update the parameters using mini-batch stochastic gradient descent with Nesterov momentum of 0.95. ... For Rademacher complexity term, we perform a grid search on the regularization weight λ {0.05, 0.01, 0.005, 0.001, 1e 4, 1e 5}, and update the dropout rates after every I {1, 5, 10, 50, 100} minibatches. ... We use a learning rate of 0.01 and decay it by 0.5 after every {300, 400, 500} epochs. ... initializing the retaining rates to 0.8 for input layer and 0.5 for hidden layer yields better performance for all models.