Stochastic Modified Equations and Dynamics of Dropout Algorithm

Authors: Zhongwang Zhang, Yuqing Li, Tao Luo, Zhi-Qin John Xu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Meanwhile, we experimentally verify SDE s ability to approximate dropout under a wider range of settings. Subsequently, we empirically delve into the intricate mechanisms by which dropout facilitates the identification of flatter minima. Our empirical findings substantiate the ubiquitous presence of the Hessian-variance alignment relation throughout the training process of dropout.
Researcher Affiliation Collaboration Zhongwang Zhang1, Yuqing Li1,2 , Tao Luo1,2,3,4,5 , Zhi-Qin John Xu1,3,4,6 1 School of Mathematical Sciences, Shanghai Jiao Tong University 2 CMA-Shanghai, Shanghai Jiao Tong University 3 Institute of Natural Sciences, MOE-LSC, Shanghai Jiao Tong University 4 Qing Yuan Research Institute, Shanghai Jiao Tong University 5 Shanghai Artificial Intelligence Laboratory 6 Shanghai Seres Information Technology Company, Ltd
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It provides mathematical derivations and explanations of methods, but not in a code-like format.
Open Source Code No The paper does not provide concrete access to source code. There are no specific repository links, explicit code release statements, or mentions of code in supplementary materials for the methodology described in the paper.
Open Datasets Yes We train two-layer fully connected networks on MNIST. The FNN with size 784-50-50-10 is trained on the MNIST dataset using the first 10000 examples as the training dataset. The VGG16 is trained on the CFIAR-10 dataset using the full examples as the training dataset. Res Net-20... trained using full-batch GD on the CIFAR100 classification task. Multi30k dataset.
Dataset Splits No The paper does not provide specific dataset split information for validation. It mentions using parts of datasets for training (e.g., 'first 1000 images as the training set', 'first 10000 images as the training set') and discusses 'test loss' and 'test accuracy', but does not specify validation splits, percentages, or explicit cross-validation setups.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments. It mentions 'HPC of School of Mathematical Sciences and the Student Innovation Center, and the Siyuan-1 cluster supported by the Center for High Performance Computing at Shanghai Jiao Tong University', but no specific models of GPUs, CPUs, or memory configurations are given.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names). It does not mention software names like Python, PyTorch, TensorFlow, or their specific versions.
Experiment Setup Yes For Fig. 1, Fig. 2, we train the network using GD with the first 1000 images as the training set. We add a dropout layer behind the hidden layer. The dropout rate and learning rate are specified and unchanged in each experiment. Models are trained using full-batch GD on the CIFAR100 classification task for 1200 epochs. The learning rate is initialized at 0.01. The warm-up step is 4000 epochs, the training step is 10000 epochs. The learning rate is 1 10 3. For SGD, the batch size is 1. For dropout, the dropout layer is added after the hidden layer with p = 0.8. For parameter noise injection, we use the layer noise with the noise standard deviation σ = 0.001.