Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Weight Expansion: A New Perspective on Dropout and Generalization

Authors: Gaojie Jin, Xinping Yi, Pengfei Yang, Lijun Zhang, Sven Schewe, Xiaowei Huang

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a theoretical argument that dropout leads to weight expansion and extensive empirical support for the correlation between dropout and weight expansion. To support our hypothesis that weight expansion can be regarded as an indicator of the enhanced generalization capability endowed by dropout, and not just as a mere by-product, we have studied other methods that achieve weight expansion (resp. contraction), and found that they generally lead to an increased (resp. decreased) generalization ability. This suggests that dropout is an attractive regularizer, because it is a computationally cheap method for obtaining weight expansion. This insight justifies the role of dropout as a regularizer, while paving the way for identifying regularizers that promise improved generalization through weight expansion. ... We have conducted a number of experiments to test our hypotheses that (1) weight expansion helps reduce the generalization error and (2) dropout increases the weight volume. More than 900 networks are trained. First, we confirm the role of weight expansion in connecting dropout and generalization by estimating the weight volume with two independent methods (Section 5.1). Second, we extend the PAC-Bayesian theorem with the awareness to weight volume, and show that the new complexity measure can significantly improve the prediction of the generalization error for dropout networks (Section 5.2). Finally, we consider other noises that can lead to either weight expansion or weight contraction, and study their respective impact on the generalization performance (Section 5.3).
Researcher Affiliation Academia 1Department of Computer Science, University of Liverpool 2Department of Electrical Engineering and Electronics, University of Liverpool 3State Key Laboratory of Computer Science, Institute of Software Chinese Academy of Sciences *Corresponding author EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Sampling Method. Algorithm 2 Gradient update with disentanglement noise.
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes Figure 1: Visualization of weight volume and features of the last layer in a CNN on MNIST... Section 5 Experiments: All empirical results are both significant and persistent across the models and data sets we work with, and support either hypothesis (1) or hypothesis (2) or both. In our experiments, we consider VGG-like models (Simonyan & Zisserman, 2014) and Alex Net-like models (Krizhevsky et al., 2012) for CIFAR-10/100 (Krizhevsky et al., 2009), Image Net-32 (Deng et al., 2009), and SVHN (Netzer et al., 2011). ... Appendix M.1 Experiments on text classification: To make our empirical results more robust, we also implement Fast Text (Bojanowski et al., 2017; Joulin et al., 2016; 2017), Text RNN (Liu et al., 2016) and Text CNN (Kim, 2014) to classify THUCNews data set (Sun et al., 2016) (180000 training data, 10000 validation data, 10000 test data) with and without dropout.
Dataset Splits Yes Appendix M.1 Experiments on text classification: To make our empirical results more robust, we also implement Fast Text (Bojanowski et al., 2017; Joulin et al., 2016; 2017), Text RNN (Liu et al., 2016) and Text CNN (Kim, 2014) to classify THUCNews data set (Sun et al., 2016) (180000 training data, 10000 validation data, 10000 test data) with and without dropout. ... Figure 1: Visualization of weight volume and features of the last layer in a CNN on MNIST... (The use of standard datasets like MNIST, CIFAR-10/100, ImageNet-32, SVHN implies predefined standard splits, though not explicitly stated as percentages, their standard nature implies this.)
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running the experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes Appendix A Details of Figure 1: As Table 3 shows, we use the above two networks to train MNIST with Re LU activation function, 0.01 learning rate, 128 minibatch size, stochastic gradient descent (SGD), 150 epochs, with and without dropout. ... Appendix J Supplementary for Section 5.1: All models for CIFAR-10/100 are trained for 140 epochs using SGD with momentum 0.9, batch size 128, weight decay 0 or 5 10 4, and an initial learning rate of 0.1 that is divided by 2 after every 20 epochs. All models for Image Net-32 are trained for 100 epochs using SGD with momentum 0.9, batch size 1024, weight decay 0 or 5 10 4, and an initial learning rate of 0.1 that is divided by 2 for each 20 epochs, and learning rate of 0.0005 after 50th epoch. ... Section 5.1 Measuring weight volume: For sampling method (and sharpness method), we default to the settings of ϵ = 0.05 for CIFAR-10/SVHN, and ϵ = 0.1 for CIFAR-100/Image Net-32.