Rethinking Bias-Variance Trade-off for Generalization of Neural Networks

Authors: Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, Yi Ma

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a simple explanation for this by measuring the bias and variance of neural networks: while the bias is monotonically decreasing as in the classical theory, the variance is unimodal or bell-shaped: it increases then decreases with the width of the network. We vary the network architecture, loss function, and choice of dataset and confirm that variance unimodality occurs robustly for all models we considered. ... We corroborate these empirical results with a theoretical analysis of two-layer linear networks...
Researcher Affiliation Academia 1Department of Electrical Engineering and Computer Sciences, University of California, Berkeley. 2Department of Statistics, University of California, Berkeley.
Pseudocode Yes Algorithm 1 Estimating Generalized Variance
Open Source Code Yes Our code can be found at https://github.com/yaodongyu/ Rethink-Bias Variance-Tradeoff.
Open Datasets Yes We trained a Res Net34 (He et al., 2016) on the CIFAR10 dataset (Krizhevsky et al., 2009). ... In addition to CIFAR10, we study bias and variance on MNIST (Le Cun, 1998) and Fashion-MNIST (Xiao et al., 2017).
Dataset Splits No The paper describes its training and test sets but does not explicitly mention a separate 'validation' set or specific split percentages for validation.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions optimizers (SGD) and loss functions (squared error, cross-entropy) but does not specify any software library names with version numbers (e.g., TensorFlow, PyTorch, scikit-learn) required for replication.
Experiment Setup Yes We trained using stochastic gradient descent (SGD) with momentum 0.9. The initial learning rate is 0.1. We applied stage-wise training (decay learning rate by a factor of 10 every 200 epochs), and used weight decay 5 x 10^-4.