reproducibilityindex.ai

Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding

Authors: Noam Itzhak Levi, Alon Beck, Yohai Bar-Sinai

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show both analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks in a simple teacher-student setup with Gaussian inputs. ... We provide empirical verification for our calculations, along with preliminary results indicating that some predictions also hold for deeper networks, with non-linear activations.
Researcher Affiliation	Academia	Raymond and Beverly Sackler School of Physics and Astronomy Tel-Aviv University Tel-Aviv 69978, Israel École Polytechnique Fédérale de Lausanne (EPFL) Switzerland
Pseudocode	No	The paper describes methods mathematically and in prose, but does not include structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing code for its methodology or a link to a code repository.
Open Datasets	No	The paper states, 'We draw Ntr training samples from a standard Gaussian distribution xi N(0, Idin din)', indicating synthetic data generation rather than the use or provision of a publicly available dataset with access details.
Dataset Splits	No	The paper mentions 'Ntr training samples' and generalization loss computed over 'an independent set' but does not specify exact train/validation/test dataset split percentages, absolute sample counts, or refer to predefined splits with citations for reproducibility.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions using 'Py Torch' but does not specify a version number or other key software components with their versions.
Experiment Setup	Yes	Here we use GD with η = η0 = 0.01, din =10^3, dout = 1, ϵ =10^-3. ... We train with full batch gradient descent, in all instances. We depart from the default weight initialization of Py Torch, using w N(0, 1/(2dl 1dl) for each layer...