reproducibilityindex.ai

Improving Knowledge Extraction from LLMs for Task Learning through Agent Analysis

Authors: James R. Kirk, Robert E. Wray, Peter Lindes, John E. Laird

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We describe the approach and experiments that show how an agent, by retrieving and evaluating a breadth of responses from the LLM, can achieve 77 94% task completion in one-shot learning without user oversight.
Researcher Affiliation	Academia	James R. Kirk, Robert E. Wray, Peter Lindes, John E. Laird Center for Integrated Cognition at IQMRI Ann Arbor, MI 48105 USA {james.kirk,robert.wray,peter.lindes,john.laird}@cic.iqmri.org
Pseudocode	No	The paper contains flow diagrams (Figures 1, 2, 3) but no structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code for the ITL agent with STARS, simulator, and data analysis are available at https://github.com/Center-for-Integrated Cognition/STARS.
Open Datasets	No	The paper describes custom simulated environments and tasks ('simulated office and kitchen', 'tidy kitchen', 'store groceries', 'organize office') with objects specific to these tasks, but does not provide access information (link, DOI, citation) for these experimental setups as a publicly available or open dataset.
Dataset Splits	No	The paper describes task completion rates and experimental conditions but does not specify any training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for reproducibility.
Hardware Specification	No	The paper mentions a 'simulated robotic environment' and the 'APRIL MAGIC simulator' but does not provide any specific hardware details (e.g., CPU, GPU models, memory) used for running the experiments or simulations.
Software Dependencies	No	The paper mentions that 'the LLM used is GPT-3 (for TBP, Search Tree, and Repair) and GPT-4 (for Selection)' but does not provide specific version numbers for these models or any other software dependencies.
Experiment Setup	Yes	For all conditions, the LLM used is GPT-3 (for TBP, Search Tree, and Repair) and GPT-4 (for Selection). In all conditions, a user provides the initial task. In the Oversight conditions, the user reviews up to 5 responses. In non-oversight conditions, the choice of the goal is based on the highest mean log probability of candidates (ST and STAR) or the Selection strategy (STS and STARS).