Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
Authors: Wenlong Huang, Pieter Abbeel, Deepak Pathak, Igor Mordatch
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation in the recent Virtual Home environment shows that the resulting method substantially improves executability over the LLM baseline. The conducted human evaluation reveals a trade-off between executability and correctness but shows a promising sign towards extracting actionable knowledge from language models. |
| Researcher Affiliation | Collaboration | 1University of California, Berkeley 2Carnegie Mellon University 3Google. Correspondence to: Wenlong Huang <EMAIL>. |
| Pseudocode | Yes | Pseudocode is in Appendix A.4. Algorithm 1 Generating Action Plans from Pre-Trained Language Models with Proposed Procedure |
| Open Source Code | Yes | Website: https: //huangwl18.github.io/language-planner/. |
| Open Datasets | Yes | For our investigation, we use the recently proposed Virtual Home environment (Puig et al., 2018). It can simulate a large variety of realistic human activities in a household environment and supports the ability to perform them via a rich set of 47522 unique embodied actions defined with a verb-object syntax. [...] We use the Activity Programs knowledge base collected by Puig et al. (2018) for evaluation. |
| Dataset Splits | No | The paper mentions a "demonstration set" used for prompting and "held-out tasks for evaluation", but does not explicitly define a separate 'validation' split with sizes or percentages for hyperparameter tuning or model selection. |
| Hardware Specification | No | The paper does not explicitly provide details about the specific hardware used for running its experiments, such as GPU models, CPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions using 'Open AI API', 'Hugging Face Transformers (Wolf et al., 2019)', and 'Sentence Transformers (Reimers & Gurevych, 2019)' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For all evaluated methods, we perform hyperparameter search over various sampling parameters, and for methods using a fixed prompt example, we report metrics averaged across three randomly chosen examples. [...] Appendix A.2. Hyperparameter Search |