Rethinking Gauss-Newton for learning over-parameterized models

Authors: Michael Arbel, Romain Menegaux, Pierre Wolinski

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN s method.
Researcher Affiliation Academia Michael Arbel, Romain Menegaux , and Pierre Wolinski Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK,38000 Grenoble, France firstname.lastname@inria.fr
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code (e.g., specific repository link, explicit code release statement) for the methodology described.
Open Datasets Yes Additional experiments using MNIST dataset [16] are provided in Appendix C.4.
Dataset Splits No For the synthetic data, the paper mentions 'N training points' and '10000 test samples' but does not specify a separate validation split. For the MNIST experiments, it details the construction of 'training set' and 'testing dataset' but does not include a distinct validation set or specific split percentages across all three typical partitions.
Hardware Specification No The paper mentions that experiments were run 'on a GPU' and that 'This work was granted access to the HPC resources of IDRIS under the allocation 2023-AD011013762R1 made by GENCI.' However, it does not specify concrete hardware details such as exact GPU/CPU models, processor types, or memory amounts.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiments.
Experiment Setup Yes (GN): We use the discrete GN updates in (4) with a constant step-size λ and Hw = I. Each update is obtained using Woodbury s matrix identity by writing Φ(wk)=J wkzk with zk the solution of a linear system Jwk J wk+ϵ(wk)I zk= L(fwk) of size N. Here, we use the damping defined in (5) with α=1 and ensure it never falls below ϵ0 = 10 7 to avoid numerical instabilities. (GD): The model s parameters w are learned using gradient descent with a constant step-size λ. Initialization. We initialize the student s hidden units according to a centered Gaussian with standard deviation (std) τ0 ranging from 10 3 to 103. Finally, we initialize the weights of the last layer to be 0. Stopping criterion. For both (GD) and (GN), we perform as many iterations as needed so that the final training error is at least below 10 5. Additionally, we stop the algorithm whenever the training error drops below 10 7 or when a maximum number of iterations of KGD = 106 iterations for (GD) and KGN = 105 iterations for (GN) is performed.