On the Universality of the Double Descent Peak in Ridgeless Regression
Authors: David Holzmüller
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We complement our theory with further experimental and analytic results. ... We provide code to reproduce all of our experimental results at https://github.com/dholzmueller/universal_double_descent |
| Researcher Affiliation | Academia | David Holzmüller University of Stuttgart Faculty of Mathematics and Physics Institute for Stochastics and Applications david.holzmueller@mathematik.uni-stuttgart.de |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | We provide code to reproduce all of our experimental results at https://github.com/dholzmueller/universal_double_descent |
| Open Datasets | No | The paper mentions using synthetic distributions like N(0, Id) and U(Sp-1) for its theoretical analysis and experiments, but does not refer to a publicly available dataset in the conventional sense (e.g., ImageNet, CIFAR). |
| Dataset Splits | No | The paper does not specify training, validation, or test dataset splits. The experiments are based on theoretical distributions and Monte Carlo estimates. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory, cloud instances). |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers. |
| Experiment Setup | Yes | In order to estimate ENoise, we proceed as follows: Recall from Section 3 that ridgeless regression is the limit of ridge regression for λ 0. We use a small regularization of λ = 10-12 in order to improve numerical stability. ... For performance reasons, we make the following modification of step (2) and (5): Since we perform the computation for all n [256], we sample X R256 d and then, for all n [256], perform the computation for n using the first n rows of Z. ... For our optimized feature maps in Figure 1 with p = 30, we use a neural network feature map with d0 = d = p = 30, d1 = d2 = 256, d3 = p = 30 and tanh activation function. We use NTK parameterization and zero-initialized biases... We then optimize the loss function L(θ) := EX tr((φθ(X)+) Σθφθ(X)+) using AMSGrad (Reddi et al., 2018) with a learning rate that linearly decays from 10-3 to 0 over 1000 iterations. In order to approximate L(θ) in each iteration, we approximate Σθ using 1000 Monte Carlo points and we draw 1024 different realizations of X (this can be considered as using batch size 1024). |