Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding

Authors: Noam Itzhak Levi, Alon Beck, Yohai Bar-Sinai

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show both analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks in a simple teacher-student setup with Gaussian inputs. ... We provide empirical verification for our calculations, along with preliminary results indicating that some predictions also hold for deeper networks, with non-linear activations.
Researcher Affiliation Academia Raymond and Beverly Sackler School of Physics and Astronomy Tel-Aviv University Tel-Aviv 69978, Israel École Polytechnique Fédérale de Lausanne (EPFL) Switzerland
Pseudocode No The paper describes methods mathematically and in prose, but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing code for its methodology or a link to a code repository.
Open Datasets No The paper states, 'We draw Ntr training samples from a standard Gaussian distribution xi N(0, Idin din)', indicating synthetic data generation rather than the use or provision of a publicly available dataset with access details.
Dataset Splits No The paper mentions 'Ntr training samples' and generalization loss computed over 'an independent set' but does not specify exact train/validation/test dataset split percentages, absolute sample counts, or refer to predefined splits with citations for reproducibility.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions using 'Py Torch' but does not specify a version number or other key software components with their versions.
Experiment Setup Yes Here we use GD with η = η0 = 0.01, din =10^3, dout = 1, ϵ =10^-3. ... We train with full batch gradient descent, in all instances. We depart from the default weight initialization of Py Torch, using w N(0, 1/(2dl 1dl) for each layer...