Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding
Authors: Noam Itzhak Levi, Alon Beck, Yohai Bar-Sinai
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show both analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks in a simple teacher-student setup with Gaussian inputs. ... We provide empirical verification for our calculations, along with preliminary results indicating that some predictions also hold for deeper networks, with non-linear activations. |
| Researcher Affiliation | Academia | Raymond and Beverly Sackler School of Physics and Astronomy Tel-Aviv University Tel-Aviv 69978, Israel École Polytechnique Fédérale de Lausanne (EPFL) Switzerland |
| Pseudocode | No | The paper describes methods mathematically and in prose, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing code for its methodology or a link to a code repository. |
| Open Datasets | No | The paper states, 'We draw Ntr training samples from a standard Gaussian distribution xi N(0, Idin din)', indicating synthetic data generation rather than the use or provision of a publicly available dataset with access details. |
| Dataset Splits | No | The paper mentions 'Ntr training samples' and generalization loss computed over 'an independent set' but does not specify exact train/validation/test dataset split percentages, absolute sample counts, or refer to predefined splits with citations for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Py Torch' but does not specify a version number or other key software components with their versions. |
| Experiment Setup | Yes | Here we use GD with η = η0 = 0.01, din =10^3, dout = 1, ϵ =10^-3. ... We train with full batch gradient descent, in all instances. We depart from the default weight initialization of Py Torch, using w N(0, 1/(2dl 1dl) for each layer... |