reproducibilityindex.ai

Evaluating the World Model Implicit in a Generative Model

Authors: Keyon Vafa, Justin Chen, Ashesh Rambachan, Jon Kleinberg, Sendhil Mullainathan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate their utility in three domains: game playing, logic puzzles, and navigation. In all domains, the generative models we consider do well on existing diagnostics for assessing world models, but our evaluation metrics reveal their world models to be far less coherent than they appear.
Researcher Affiliation	Academia	Keyon Vafa Harvard University Justin Y. Chen MIT Ashesh Rambachan MIT Jon Kleinberg Cornell University Sendhil Mullainathan MIT
Pseudocode	Yes	Algorithm 1 Graph Reconstruction from Sequences
Open Source Code	Yes	We release our benchmark dataset of taxi rides in New York City along with software implementing our evaluation metrics.1 1https://github.com/keyonvafa/world-model-evaluation
Open Datasets	Yes	We base our analysis on a dataset of taxi rides released by the NYC Taxi & Limousine Commission, containing the latitude and longitude of each ride s pickup and dropoff location in Manhattan.
Dataset Splits	Yes	We randomly split data into train and test splits, ensuring no origin-destination pair is in both train and test sets. ... our validation set consists of 1,000 sequences and 54,539 tokens
Hardware Specification	Yes	We train models on 8 A100 GPUs.
Software Dependencies	No	The paper mentions software like "OSMnx library" and general programming languages (e.g., Python implementation) but does not provide specific version numbers for these dependencies, which are necessary for full reproducibility.
Experiment Setup	Yes	We train two types of transformers [38] from scratch using next-token prediction for each dataset: an 89.3M parameter model consisting of 12 layers, 768 hidden dimensions, and 12 heads; and a 1.5B parameter model consisting of 48 layers, 1600 hidden dimensions, and 25 heads. We follow the architecture of GPT-2 for each model [29]. ... Both metrics depend on a threshold parameter 𝜖: a prefix is only sampled or accepted if the model s assigned probability for each token is above 𝜖. Here, we consider 𝜖= 0.01 for all models and metrics.