Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Emergence of Linear Analogies in Word Embeddings

Authors: Daniel Korchinski, Dhruva Karkada, Yasaman Bahri, Matthieu Wyart

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.
Researcher Affiliation Collaboration Daniel J. Korchinski Department of Physics Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne, VD Switzerland EMAIL Dhruva Karkada Department of Physics UC Berkeley Berkeley, CA, USA EMAIL Yasaman Bahri Google Deep Mind Mountain View, CA, USA EMAIL Matthieu Wyart Johns Hopkins & EPFL Baltimore, MD, USA & Lausanne, VD Switzerland EMAIL
Pseudocode No The paper describes mathematical models and derivations but does not contain any structured pseudocode or algorithm blocks. The methodologies are presented in prose and mathematical notation.
Open Source Code Yes The code used to produce the model results, Wikipedia co-occurence statistics, and figures is available on Git Hub at https://github.com/DJKorchinski/ linear-analogies-word-embedding-reproduction and in the supplementary files.
Open Datasets Yes It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.
Dataset Splits No The paper mentions using "Wikipedia text co-occurence matrices" and the "Mikolov et al. analogy task set" for numerical validation but does not provide specific details on how these datasets were split into training, validation, or test sets for their experiments.
Hardware Specification Yes All simulations together run in under 150 minutes on an Nvidia H100.
Software Dependencies No The paper does not list specific software libraries or their version numbers required to replicate the experiments, other than mentioning a "Conda environment" in the NeurIPS Paper Checklist justification, which is not part of the main paper and not sufficiently specific for software dependencies.
Experiment Setup Yes In Figure 2c we show the emergence of linear analogical reasoning for a single realization of this model for d = 8, for matrix target Mij. ...In Figure 2f, we report results on a sparsified variation of the model in d = 12 for retention probability f = 0.15 (see Appendix for a representative sparsified co-occurrence matrix).