Rethinking Bias-Variance Trade-off for Generalization of Neural Networks
Authors: Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, Yi Ma
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide a simple explanation for this by measuring the bias and variance of neural networks: while the bias is monotonically decreasing as in the classical theory, the variance is unimodal or bell-shaped: it increases then decreases with the width of the network. We vary the network architecture, loss function, and choice of dataset and confirm that variance unimodality occurs robustly for all models we considered. ... We corroborate these empirical results with a theoretical analysis of two-layer linear networks... |
| Researcher Affiliation | Academia | 1Department of Electrical Engineering and Computer Sciences, University of California, Berkeley. 2Department of Statistics, University of California, Berkeley. |
| Pseudocode | Yes | Algorithm 1 Estimating Generalized Variance |
| Open Source Code | Yes | Our code can be found at https://github.com/yaodongyu/ Rethink-Bias Variance-Tradeoff. |
| Open Datasets | Yes | We trained a Res Net34 (He et al., 2016) on the CIFAR10 dataset (Krizhevsky et al., 2009). ... In addition to CIFAR10, we study bias and variance on MNIST (Le Cun, 1998) and Fashion-MNIST (Xiao et al., 2017). |
| Dataset Splits | No | The paper describes its training and test sets but does not explicitly mention a separate 'validation' set or specific split percentages for validation. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper mentions optimizers (SGD) and loss functions (squared error, cross-entropy) but does not specify any software library names with version numbers (e.g., TensorFlow, PyTorch, scikit-learn) required for replication. |
| Experiment Setup | Yes | We trained using stochastic gradient descent (SGD) with momentum 0.9. The initial learning rate is 0.1. We applied stage-wise training (decay learning rate by a factor of 10 every 200 epochs), and used weight decay 5 x 10^-4. |