Adaptive Dropout with Rademacher Complexity Regularization
Authors: Ke Zhai, Huan Wang
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the task of image and document classification also show our method achieves better performance compared to the state-of-the-art dropout algorithms. |
| Researcher Affiliation | Industry | Ke Zhai Microsoft AI & Research Sunnyvale, CA kezhai@microsoft.com Huan Wang Salesforce Research Palo Alto, CA joyousprince@gmail.com |
| Pseudocode | No | The paper describes algorithmic steps and mathematical formulations but does not include a clearly labeled pseudocode block or algorithm. |
| Open Source Code | No | The paper does not provide any statement about releasing source code or include a link to a code repository for the methodology described. |
| Open Datasets | Yes | MNIST dataset is a collection of 28 28 pixel hand-written digit images in grayscale, containing 60K for training and 10K for testing. |
| Dataset Splits | Yes | For all datasets, we hold out 20% of the training data as validation set for parameter tuning and model selection. |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments, such as particular CPU or GPU models, or cloud computing instances. |
| Software Dependencies | No | The paper mentions using standard machine learning concepts and models but does not provide specific version numbers for any software libraries, frameworks, or programming languages used. |
| Experiment Setup | Yes | We optimize categorical cross-entropy loss on predicted class labels with Rademacher regularization. ... We update the parameters using mini-batch stochastic gradient descent with Nesterov momentum of 0.95. ... For Rademacher complexity term, we perform a grid search on the regularization weight λ {0.05, 0.01, 0.005, 0.001, 1e 4, 1e 5}, and update the dropout rates after every I {1, 5, 10, 50, 100} minibatches. ... We use a learning rate of 0.01 and decay it by 0.5 after every {300, 400, 500} epochs. ... initializing the retaining rates to 0.8 for input layer and 0.5 for hidden layer yields better performance for all models. |