reproducibilityindex.ai

On the Universality of the Double Descent Peak in Ridgeless Regression

Authors: David Holzmüller

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We complement our theory with further experimental and analytic results. ... We provide code to reproduce all of our experimental results at https://github.com/dholzmueller/universal_double_descent
Researcher Affiliation	Academia	David Holzmüller University of Stuttgart Faculty of Mathematics and Physics Institute for Stochastics and Applications david.holzmueller@mathematik.uni-stuttgart.de
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	We provide code to reproduce all of our experimental results at https://github.com/dholzmueller/universal_double_descent
Open Datasets	No	The paper mentions using synthetic distributions like N(0, Id) and U(Sp-1) for its theoretical analysis and experiments, but does not refer to a publicly available dataset in the conventional sense (e.g., ImageNet, CIFAR).
Dataset Splits	No	The paper does not specify training, validation, or test dataset splits. The experiments are based on theoretical distributions and Monte Carlo estimates.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory, cloud instances).
Software Dependencies	No	The paper does not list specific software dependencies with version numbers.
Experiment Setup	Yes	In order to estimate ENoise, we proceed as follows: Recall from Section 3 that ridgeless regression is the limit of ridge regression for λ 0. We use a small regularization of λ = 10-12 in order to improve numerical stability. ... For performance reasons, we make the following modiﬁcation of step (2) and (5): Since we perform the computation for all n [256], we sample X R256 d and then, for all n [256], perform the computation for n using the ﬁrst n rows of Z. ... For our optimized feature maps in Figure 1 with p = 30, we use a neural network feature map with d0 = d = p = 30, d1 = d2 = 256, d3 = p = 30 and tanh activation function. We use NTK parameterization and zero-initialized biases... We then optimize the loss function L(θ) := EX tr((φθ(X)+) Σθφθ(X)+) using AMSGrad (Reddi et al., 2018) with a learning rate that linearly decays from 10-3 to 0 over 1000 iterations. In order to approximate L(θ) in each iteration, we approximate Σθ using 1000 Monte Carlo points and we draw 1024 different realizations of X (this can be considered as using batch size 1024).