Evaluating the World Model Implicit in a Generative Model

Authors: Keyon Vafa, Justin Chen, Ashesh Rambachan, Jon Kleinberg, Sendhil Mullainathan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate their utility in three domains: game playing, logic puzzles, and navigation. In all domains, the generative models we consider do well on existing diagnostics for assessing world models, but our evaluation metrics reveal their world models to be far less coherent than they appear.
Researcher Affiliation Academia Keyon Vafa Harvard University Justin Y. Chen MIT Ashesh Rambachan MIT Jon Kleinberg Cornell University Sendhil Mullainathan MIT
Pseudocode Yes Algorithm 1 Graph Reconstruction from Sequences
Open Source Code Yes We release our benchmark dataset of taxi rides in New York City along with software implementing our evaluation metrics.1 1https://github.com/keyonvafa/world-model-evaluation
Open Datasets Yes We base our analysis on a dataset of taxi rides released by the NYC Taxi & Limousine Commission, containing the latitude and longitude of each ride s pickup and dropoff location in Manhattan.
Dataset Splits Yes We randomly split data into train and test splits, ensuring no origin-destination pair is in both train and test sets. ... our validation set consists of 1,000 sequences and 54,539 tokens
Hardware Specification Yes We train models on 8 A100 GPUs.
Software Dependencies No The paper mentions software like "OSMnx library" and general programming languages (e.g., Python implementation) but does not provide specific version numbers for these dependencies, which are necessary for full reproducibility.
Experiment Setup Yes We train two types of transformers [38] from scratch using next-token prediction for each dataset: an 89.3M parameter model consisting of 12 layers, 768 hidden dimensions, and 12 heads; and a 1.5B parameter model consisting of 48 layers, 1600 hidden dimensions, and 25 heads. We follow the architecture of GPT-2 for each model [29]. ... Both metrics depend on a threshold parameter 𝜖: a prefix is only sampled or accepted if the model s assigned probability for each token is above 𝜖. Here, we consider 𝜖= 0.01 for all models and metrics.