A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning
Authors: Alicia Curth, Alan Jeffares, Mihaela van der Schaar
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we investigate whether the double descent behavior observed in recent empirical studies of such non-deep ML methods truly disagrees with the traditional notion of a Ushaped tradeoff between model complexity and prediction error. In two parts, we argue that once careful consideration is given to what is being plotted on the axes of these double descent plots, the originally counter-intuitive peaking behavior can be comprehensively explained under existing paradigms: Part 1: Revisiting existing experimental evidence. We show that in the experimental evidence for non-deep double descent using trees, boosting, and linear regressions there is implicitly more than one complexity axis along which the parameter count grows. |
| Researcher Affiliation | Academia | Alicia Curth University of Cambridge amc253@cam.ac.uk Alan Jeffares University of Cambridge aj659@cam.ac.uk Mihaela van der Schaar University of Cambridge mv472@cam.ac.uk |
| Pseudocode | Yes | The gradient boosted trees algorithm. We consider gradient boosting with learning rate η and the squared loss as L( , ): 1. Initialize f0(x) = 0 2. For p {1, . . . , P boost} (a) For i {1, . . . , n} compute gi,p = L(yi, f(xi)) f=fp 1 (11) (b) Fit a regression tree to {(xi, gi,p)}n i=1, giving leaves ljp for j = 1, . . . , Jp (c) Compute optimal predictions for each leaf j {1, . . . , Jp}: γjp = arg min γ R xi ljp L(yi, fp 1(xi) + γ) = 1 nljp xi ljp (yi fp 1(xi)) (12) (d) Denote by fp(x) = PJp j=1 1{x ljp}γjp the predictions of the tree built in this fashion (e) Set fp(x) = fp 1(x) + η fp(x) 3. Output f(x) = f P boost(x) |
| Open Source Code | Yes | Code is provided at https://github.com/ alanjeffares/not-double-descent. |
| Open Datasets | Yes | Throughout, we closely follow [BHMM19] s experimental setup: they use standard benchmark datasets and train all ML methods by minimizing the squared loss. Similarly to their work, we focus on results using MNIST (with ntrain = 10000) in the main text. We present additional results, including other datasets, and further discussion of the experimental setup in Appendix E. Datasets We perform our experiments in the main text on the MNIST image recognition dataset [LBBH98] with inputs of dimension 784. In this section, we also repeat these experiments on the SVHN digit recognition task [NWC+11] as well as the CIFAR-10 object classification tasks [KH+09] |
| Dataset Splits | Yes | Furthermore, we randomly sample a subset of 10,000 training examples and use the full test set in our evaluation. |
| Hardware Specification | Yes | All experiments are performed on an Azure FX4mds. This machine runs on an Intel Xeon Gold 6246R (Cascade Lake) processor with 84 Gi B of memory. |
| Software Dependencies | No | We use regression trees and random forests as implemented in scikit-learn [PVG+11] for our tree experiments... Gradient boosting is implemented as described in [HTF09, Ch. 10.10]... We rely on the scikit-learn [PVG+11] implementation Gradient Boosting Regressor... No specific version numbers for scikit-learn or other software dependencies are provided. |
| Experiment Setup | Yes | Experimental setup. We center our study around the non-neural experiments in [BHMM19] as it is the seminal paper on double descent and provides the broadest account of non-deep learning methods that exhibit double descent. Through multiple empirical studies, [BHMM19] demonstrate that double descent arises in trees, boosting and linear regression. Below, we re-analyze these experiments and highlight that in each study, as we transition from the classical U-shaped regime into the subsequent second descent regime, something else implicitly changes in the model or training definition, fundamentally changing the class of models under consideration exactly at the transition threshold between the observed regimes which is precisely the cause of the second descent phenomenon. In Sec. 2, we begin by investigating the tree and boosting experiments, where this is straightforward to show. In Sec. 3, we then investigate the linear regression example, where decomposing the underlying mechanisms is non-trivial. Throughout, we closely follow [BHMM19] s experimental setup: they use standard benchmark datasets and train all ML methods by minimizing the squared loss. In multi-class settings, [BHMM19] use a one-vs-rest strategy and report squared loss summed across classes. Their use of the squared loss is supported by recent work on squared loss for classification [HB21, MNS+21], and practically implies the use of standard regression implementations of the considered ML methods. Gradient Boosting (Sec. 2.2) Gradient boosting is implemented as described in [HTF09, Ch. 10.10] and restated in Appendix C.3.3, using trees with P leaf = 10 and a random feature subset of size sqrt(d) considered at each split, and learning rate γ = .85 as used in [BHMM19]. |