Early-stopped neural networks are consistent
Authors: Ziwei Ji, Justin Li, Matus Telgarsky
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This work studies the behavior of shallow Re LU networks trained with the logistic loss via gradient descent on binary classification data where the underlying data distribution is general, and the (optimal) Bayes risk is not necessarily zero. In this setting, it is shown that gradient descent with early stopping achieves population risk arbitrarily close to optimal in terms of not just logistic and misclassification losses, but also in terms of calibration, meaning the sigmoid mapping of its outputs approximates the true underlying conditional distribution arbitrarily finely. |
| Researcher Affiliation | Academia | Ziwei Ji Justin D. Li Matus Telgarsky <{ziweiji2,jdli3,mjt}@illinois.edu> University of Illinois, Urbana-Champaign |
| Pseudocode | No | The paper describes procedures and mathematical formulations but does not contain a structured pseudocode block or an algorithm block explicitly labeled as such. |
| Open Source Code | No | The paper does not provide any specific statement about releasing source code for the described methodology, nor does it include links to a code repository. |
| Open Datasets | Yes | For this, we conducted a simple experiment. Noting that we can freeze the initial features and train linear predictors of the form f (0)(x; V ) for weights V Rm d (cf. section 1.4), and that the performance converges to the infinite-width performance as m , we fixed a large width and trained two prediction tasks: an easy task of MNIST 1 vs 5 until Reasy/ n 1/2, and a hard task of MNIST 3 vs 5 until Rhard/ n 1/2. |
| Dataset Splits | No | The paper mentions using MNIST for a 'simple experiment' and discusses training and obtaining 'test error' but does not specify explicit training/validation/test dataset splits, percentages, or methodology for splitting the data. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific computer configurations) used for any experiments. |
| Software Dependencies | No | The paper does not list any specific software dependencies with version numbers. |
| Experiment Setup | Yes | Choose step size η := 4/ρ2, and run gradient descent for t := 1/(8ϵgd) iterations, selecting iterate W t := arg min{ b R(Wi) : i t, Wi W0 Rgd}. |