Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Physical Reasoning and Object Planning for Household Embodied Agents

Authors: Ayush Agrawal, Raghav Prabhakar, Anirudh Goyal, Dianbo Liu

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation of state-of-the-art language models on these datasets sheds light on three pivotal considerations... We evaluate and compare the performances of various large Language Models using the following metrics: 1. Accuracy: The fraction of questions answered correctly by the Language Model. 2. Bad Rate: The fraction of questions in which the chosen answer belonged to the "Bad" configuration pool.
Researcher Affiliation	Collaboration	Ayush Agrawal EMAIL National University of Singapore Raghav Prabhakar EMAIL IIIT-Hyderabad, India Anirudh Goyal EMAIL Deepmind, London Dianbo Liu EMAIL National University of Singapore
Pseudocode	No	The paper describes methods and processes in paragraph form, but does not include any explicitly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	Our contributions include insightful human preference mappings for all three factors and four extensive QA datasets (2K, 15k, 60k, 70K questions) probing the intricacies of utility dependencies, contextual dependencies and object physical states. The datasets, along with our findings, are accessible at: https://github.com/com-phy-affordance/COAT.
Open Datasets	Yes	Our contributions include insightful human preference mappings for all three factors and four extensive QA datasets (2K, 15k, 60k, 70K questions) probing the intricacies of utility dependencies, contextual dependencies and object physical states. The datasets, along with our findings, are accessible at: https://github.com/com-phy-affordance/COAT.
Dataset Splits	No	The paper describes various datasets created for evaluation and different 'variations' of these datasets based on option counts and sampling techniques (e.g., '12 distinct variations... Each of the 12 variations comprises approximately 5,000 question-answer pairs', '14 variations of this dataset... nearly 5,000 questions'). However, it does not explicitly provide standard training, validation, or test dataset splits (e.g., 80/10/10 percentages or specific sample counts for model training/evaluation) for reproducibility in the conventional machine learning sense, as it primarily evaluates pre-trained models. While there's a mention of fine-tuning a PaLM model on a 'slice of 400 examples', this isn't a comprehensive split for the entire dataset.
Hardware Specification	No	The paper mentions evaluating various Large Language Models (Pa LM, GPT3.5-Turbo, Vicuna, LLama2-13B, Mistral-7B, Chat GLM-6B, Chat GLM2-6B) and fine-tuning a Pa LM model on Vertex AI. However, it does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for these evaluations or fine-tuning. Vertex AI is a cloud platform but lacks hardware specifics.
Software Dependencies	No	The paper evaluates various pre-trained Large Language Models (e.g., Pa LM, GPT3.5-Turbo, Vicuna, Llama2-13B, Mistral-7B, Chat GLM-6B, Chat GLM2-6B). While these are specific models, the paper does not list specific software dependencies with version numbers (e.g., programming languages, libraries like PyTorch or TensorFlow, or specific API client versions) that would be needed to replicate the experimental environment.
Experiment Setup	No	The paper describes the evaluation methodology, including the types of questions and options presented to the models. However, for the primary experiments evaluating various LLMs, it does not explicitly detail specific experimental setup parameters such as hyperparameters (e.g., learning rate, batch size), specific prompt templates (beyond general examples), temperature settings, or other system-level training configurations. While fine-tuning is mentioned for PaLM with '40 training steps' on a 'slice of 400 examples', this is minimal and not for the broader experimental setup.