Evaluating the World Model Implicit in a Generative Model
Authors: Keyon Vafa, Justin Chen, Ashesh Rambachan, Jon Kleinberg, Sendhil Mullainathan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate their utility in three domains: game playing, logic puzzles, and navigation. In all domains, the generative models we consider do well on existing diagnostics for assessing world models, but our evaluation metrics reveal their world models to be far less coherent than they appear. |
| Researcher Affiliation | Academia | Keyon Vafa Harvard University Justin Y. Chen MIT Ashesh Rambachan MIT Jon Kleinberg Cornell University Sendhil Mullainathan MIT |
| Pseudocode | Yes | Algorithm 1 Graph Reconstruction from Sequences |
| Open Source Code | Yes | We release our benchmark dataset of taxi rides in New York City along with software implementing our evaluation metrics.1 1https://github.com/keyonvafa/world-model-evaluation |
| Open Datasets | Yes | We base our analysis on a dataset of taxi rides released by the NYC Taxi & Limousine Commission, containing the latitude and longitude of each ride s pickup and dropoff location in Manhattan. |
| Dataset Splits | Yes | We randomly split data into train and test splits, ensuring no origin-destination pair is in both train and test sets. ... our validation set consists of 1,000 sequences and 54,539 tokens |
| Hardware Specification | Yes | We train models on 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions software like "OSMnx library" and general programming languages (e.g., Python implementation) but does not provide specific version numbers for these dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | We train two types of transformers [38] from scratch using next-token prediction for each dataset: an 89.3M parameter model consisting of 12 layers, 768 hidden dimensions, and 12 heads; and a 1.5B parameter model consisting of 48 layers, 1600 hidden dimensions, and 25 heads. We follow the architecture of GPT-2 for each model [29]. ... Both metrics depend on a threshold parameter 𝜖: a prefix is only sampled or accepted if the model s assigned probability for each token is above 𝜖. Here, we consider 𝜖= 0.01 for all models and metrics. |