reproducibilityindex.ai

Mind's Eye: Grounded Language Model Reasoning through Simulation

Authors: Ruibo Liu, Jason Wei, Shixiang Shane Gu, Te-Yen Wu, Soroush Vosoughi, Claire Cui, Denny Zhou, Andrew M. Dai

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind s Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average).
Researcher Affiliation	Collaboration	1Google Research, Brain Team, 2Dartmouth College
Pseudocode	No	The paper describes the components and their interactions in prose but does not include any formal pseudocode blocks or algorithms.
Open Source Code	No	For reproducibility, we run experiments mainly with publicly available LMs (e.g., GPT-3) and choose baseline methods that have open-sourced implementation. This statement refers to the use of existing open-source tools, not the release of the authors' own code.
Open Datasets	No	We propose a new multi-task physics alignment dataset, UTOPIA... The ground-truth answers to the questions are generated by the physics engine, which makes it easy to scale UTOPIA to larger sizes. The paper introduces a new dataset but does not provide specific access information (link, DOI, citation) to a publicly available version of the UTOPIA dataset used in their experiments.
Dataset Splits	No	For the convenience of benchmarking on huge LMs, we prepare 100 samples for each sub-task, resulting in a dataset with about 3,900 samples. We use this version of UTOPIA for evaluation across the paper. The paper uses a dataset for evaluation but does not specify a separate validation split or its size/percentages for model training/hyperparameter tuning.
Hardware Specification	Yes	The Mu Jo Co simulations can achieve 171 fps on one A6000 GPU... All experiments for Pa LM are run on TPU-v4 Pods... Training of the JAX-based text-to-code LMs runs on TPU-v3 Pods.
Software Dependencies	No	The paper mentions "Deep Mind s Mu Jo Co" as a physics engine and "JAX-based text-to-code LMs" but does not specify version numbers for these or other software libraries (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	The learning rates we use for training 0.3B and 1.5B LMs on C4 are {3.0e-4, 1.8e-4}, which are switched to {1.8e-4, 0.5e-4} when fine-tuning on the text-code pairs. We use cosine annealing to control learning rate over time with fixed warm-up steps (3k).