reproducibilityindex.ai

Language Models Represent Space and Time

Authors: Wes Gurnee, Max Tegmark

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We find evidence for the latter by analyzing the learned representations of three spatial datasets (world, US, NYC places) and three temporal datasets (historical figures, artworks, news headlines) in the Llama-2 family of models.
Researcher Affiliation	Academia	Wes Gurnee & Max Tegmark Massachusetts Institute of Technology {wesg, tegmark}@mit.edu
Pseudocode	No	No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	All datasets and code are available at https://github.com/wesg52/world-models.
Open Datasets	Yes	Specifically, we construct six datasets containing the names of places or events... Our world dataset is built from raw data queried from DBpedia Lehmann et al. (2015)... Our United States dataset is constructed from DBPedia and a census data aggregator... Our New York City dataset is adapted from the NYC Open Data points of interest dataset (NYC Open Data, 2023)... Our three temporal datasets consist of (1) the names and occupations of historical figures who died between 1000BC and 2000AD adapted from (Annamoradnejad & Annamoradnejad, 2022); (2) the titles and creators of songs, movies, and books from 1950 to 2020 constructed from DBpedia... and (3) New York Times news headlines from 2010-2020 from news desks that write about current events, adapted from (Bandy, 2021). All datasets and code are available at https://github.com/wesg52/world-models.
Dataset Splits	Yes	In all experiments, we tune λ using efficient leave-out-out cross validation (Hastie et al., 2009) on the probe training set.
Hardware Specification	No	The paper mentions using 'Llama-2 (Touvron et al., 2023) and Pythia Biderman et al. (2023) family of models', but does not specify the underlying hardware (e.g., GPU models, CPU types, memory) used for running these models or experiments.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	For each dataset, we run every entity name through the model, potentially prepended with a short prompt, and save the activations of the hidden state (residual stream) on the last entity token for each layer. For a set of n entities, this yields an n dmodel activation dataset for each layer. In all experiments, we tune λ using efficient leave-out-out cross validation (Hastie et al., 2009) on the probe training set.