On Convergence and Generalization of Dropout Training

Authors: Poorya Mianjy, Raman Arora

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 6, we present a sketch of the proofs of our main results the detailed proofs are deferred to the Appendix. We conclude the paper by providing empirical evidence for our theoretical results in Section 6. The goal of this section is to investigate if dropout indeed compresses the model, as predicted by Theorem 4.2. We train a convolutional neural network with a dropout layer on the top hidden layer, using cross-entropy loss, on the MNIST dataset.
Researcher Affiliation Academia Poorya Mianjy Department of Computer Science Johns Hopkins University mianjy@jhu.edu Raman Arora Department of Computer Science Johns Hopkins University arora@cs.jhu.edu
Pseudocode Yes Algorithm 1 Dropout in Two-Layer Networks Input: data ST = {(xt, yt)}T t=1 DT ; Bernoulli masks BT = {Bt}T t=1; dropout rate 1 q; max-norm constraint parameter c; learning rate η 1: initialize: wr,1 N(0, I) and ar Unif({+1, 1}), r [m] 2: for t = 1, . . . , T 1 do 3: forward: g(Wt; xt, Bt) = 1 ma Btσ(Wtxt) 4: backward: Lt(Wt) = ℓ(ytg(Wt; xt, Bt) = ℓ (ytg(Wt; xt, Bt)) yt g(Wt; xt, Bt) 5: update: Wt+ 1 2 Wt η Lt(Wt) 6: max-norm: Wt+1 Πc(Wt+ 1 2 ) 7: end for Test Time: re-scale the weights as Wt q Wt
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We train a convolutional neural network with a dropout layer on the top hidden layer, using cross-entropy loss, on the MNIST dataset.
Dataset Splits No The paper mentions using the MNIST dataset but does not explicitly describe the training, validation, or test split percentages or methodology beyond stating it tracks 'test accuracy'.
Hardware Specification No The paper does not specify any particular hardware (CPU, GPU models, or cloud computing instances with their specifications) used for running the experiments.
Software Dependencies No The paper mentions 'Py Torch' as a machine learning framework in a footnote, but does not provide any specific version numbers for it or any other software dependencies.
Experiment Setup Yes We use a constant learning rate η = 0.01 and batch-size equal to 64 for all the experiments. We train several networks where except for the top layer widths (m {100, 500, 1K, 5K, 10K, 50K, 100K, 250K}), all other architectural parameters are fixed. We run the experiments for several values of the dropout rate, 1 p {0.1, 0.2, 0.3, . . . , 0.9}.