reproducibilityindex.ai

Early-stopped neural networks are consistent

Authors: Ziwei Ji, Justin Li, Matus Telgarsky

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	This work studies the behavior of shallow Re LU networks trained with the logistic loss via gradient descent on binary classification data where the underlying data distribution is general, and the (optimal) Bayes risk is not necessarily zero. In this setting, it is shown that gradient descent with early stopping achieves population risk arbitrarily close to optimal in terms of not just logistic and misclassification losses, but also in terms of calibration, meaning the sigmoid mapping of its outputs approximates the true underlying conditional distribution arbitrarily finely.
Researcher Affiliation	Academia	Ziwei Ji Justin D. Li Matus Telgarsky <{ziweiji2,jdli3,mjt}@illinois.edu> University of Illinois, Urbana-Champaign
Pseudocode	No	The paper describes procedures and mathematical formulations but does not contain a structured pseudocode block or an algorithm block explicitly labeled as such.
Open Source Code	No	The paper does not provide any specific statement about releasing source code for the described methodology, nor does it include links to a code repository.
Open Datasets	Yes	For this, we conducted a simple experiment. Noting that we can freeze the initial features and train linear predictors of the form f (0)(x; V ) for weights V Rm d (cf. section 1.4), and that the performance converges to the infinite-width performance as m , we fixed a large width and trained two prediction tasks: an easy task of MNIST 1 vs 5 until Reasy/ n 1/2, and a hard task of MNIST 3 vs 5 until Rhard/ n 1/2.
Dataset Splits	No	The paper mentions using MNIST for a 'simple experiment' and discusses training and obtaining 'test error' but does not specify explicit training/validation/test dataset splits, percentages, or methodology for splitting the data.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific computer configurations) used for any experiments.
Software Dependencies	No	The paper does not list any specific software dependencies with version numbers.
Experiment Setup	Yes	Choose step size η := 4/ρ2, and run gradient descent for t := 1/(8ϵgd) iterations, selecting iterate W t := arg min{ b R(Wi) : i t, Wi W0 Rgd}.