Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Mind's Eye: Grounded Language Model Reasoning through Simulation
Authors: Ruibo Liu, Jason Wei, Shixiang Shane Gu, Te-Yen Wu, Soroush Vosoughi, Claire Cui, Denny Zhou, Andrew M. Dai
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind s Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). |
| Researcher Affiliation | Collaboration | 1Google Research, Brain Team, 2Dartmouth College |
| Pseudocode | No | The paper describes the components and their interactions in prose but does not include any formal pseudocode blocks or algorithms. |
| Open Source Code | No | For reproducibility, we run experiments mainly with publicly available LMs (e.g., GPT-3) and choose baseline methods that have open-sourced implementation. This statement refers to the use of existing open-source tools, not the release of the authors' own code. |
| Open Datasets | No | We propose a new multi-task physics alignment dataset, UTOPIA... The ground-truth answers to the questions are generated by the physics engine, which makes it easy to scale UTOPIA to larger sizes. The paper introduces a new dataset but does not provide specific access information (link, DOI, citation) to a publicly available version of the UTOPIA dataset used in their experiments. |
| Dataset Splits | No | For the convenience of benchmarking on huge LMs, we prepare 100 samples for each sub-task, resulting in a dataset with about 3,900 samples. We use this version of UTOPIA for evaluation across the paper. The paper uses a dataset for evaluation but does not specify a separate validation split or its size/percentages for model training/hyperparameter tuning. |
| Hardware Specification | Yes | The Mu Jo Co simulations can achieve 171 fps on one A6000 GPU... All experiments for Pa LM are run on TPU-v4 Pods... Training of the JAX-based text-to-code LMs runs on TPU-v3 Pods. |
| Software Dependencies | No | The paper mentions "Deep Mind s Mu Jo Co" as a physics engine and "JAX-based text-to-code LMs" but does not specify version numbers for these or other software libraries (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | The learning rates we use for training 0.3B and 1.5B LMs on C4 are {3.0e-4, 1.8e-4}, which are switched to {1.8e-4, 0.5e-4} when fine-tuning on the text-code pairs. We use cosine annealing to control learning rate over time with fixed warm-up steps (3k). |