Language Models Represent Space and Time
Authors: Wes Gurnee, Max Tegmark
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find evidence for the latter by analyzing the learned representations of three spatial datasets (world, US, NYC places) and three temporal datasets (historical figures, artworks, news headlines) in the Llama-2 family of models. |
| Researcher Affiliation | Academia | Wes Gurnee & Max Tegmark Massachusetts Institute of Technology {wesg, tegmark}@mit.edu |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | All datasets and code are available at https://github.com/wesg52/world-models. |
| Open Datasets | Yes | Specifically, we construct six datasets containing the names of places or events... Our world dataset is built from raw data queried from DBpedia Lehmann et al. (2015)... Our United States dataset is constructed from DBPedia and a census data aggregator... Our New York City dataset is adapted from the NYC Open Data points of interest dataset (NYC Open Data, 2023)... Our three temporal datasets consist of (1) the names and occupations of historical figures who died between 1000BC and 2000AD adapted from (Annamoradnejad & Annamoradnejad, 2022); (2) the titles and creators of songs, movies, and books from 1950 to 2020 constructed from DBpedia... and (3) New York Times news headlines from 2010-2020 from news desks that write about current events, adapted from (Bandy, 2021). All datasets and code are available at https://github.com/wesg52/world-models. |
| Dataset Splits | Yes | In all experiments, we tune λ using efficient leave-out-out cross validation (Hastie et al., 2009) on the probe training set. |
| Hardware Specification | No | The paper mentions using 'Llama-2 (Touvron et al., 2023) and Pythia Biderman et al. (2023) family of models', but does not specify the underlying hardware (e.g., GPU models, CPU types, memory) used for running these models or experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | For each dataset, we run every entity name through the model, potentially prepended with a short prompt, and save the activations of the hidden state (residual stream) on the last entity token for each layer. For a set of n entities, this yields an n dmodel activation dataset for each layer. In all experiments, we tune λ using efficient leave-out-out cross validation (Hastie et al., 2009) on the probe training set. |