On the Universality of the Double Descent Peak in Ridgeless Regression

Authors: David Holzmüller

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement our theory with further experimental and analytic results. ... We provide code to reproduce all of our experimental results at https://github.com/dholzmueller/universal_double_descent
Researcher Affiliation Academia David Holzmüller University of Stuttgart Faculty of Mathematics and Physics Institute for Stochastics and Applications david.holzmueller@mathematik.uni-stuttgart.de
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes We provide code to reproduce all of our experimental results at https://github.com/dholzmueller/universal_double_descent
Open Datasets No The paper mentions using synthetic distributions like N(0, Id) and U(Sp-1) for its theoretical analysis and experiments, but does not refer to a publicly available dataset in the conventional sense (e.g., ImageNet, CIFAR).
Dataset Splits No The paper does not specify training, validation, or test dataset splits. The experiments are based on theoretical distributions and Monte Carlo estimates.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory, cloud instances).
Software Dependencies No The paper does not list specific software dependencies with version numbers.
Experiment Setup Yes In order to estimate ENoise, we proceed as follows: Recall from Section 3 that ridgeless regression is the limit of ridge regression for λ 0. We use a small regularization of λ = 10-12 in order to improve numerical stability. ... For performance reasons, we make the following modification of step (2) and (5): Since we perform the computation for all n [256], we sample X R256 d and then, for all n [256], perform the computation for n using the first n rows of Z. ... For our optimized feature maps in Figure 1 with p = 30, we use a neural network feature map with d0 = d = p = 30, d1 = d2 = 256, d3 = p = 30 and tanh activation function. We use NTK parameterization and zero-initialized biases... We then optimize the loss function L(θ) := EX tr((φθ(X)+) Σθφθ(X)+) using AMSGrad (Reddi et al., 2018) with a learning rate that linearly decays from 10-3 to 0 over 1000 iterations. In order to approximate L(θ) in each iteration, we approximate Σθ using 1000 Monte Carlo points and we draw 1024 different realizations of X (this can be considered as using batch size 1024).