Improving Knowledge Extraction from LLMs for Task Learning through Agent Analysis

Authors: James R. Kirk, Robert E. Wray, Peter Lindes, John E. Laird

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We describe the approach and experiments that show how an agent, by retrieving and evaluating a breadth of responses from the LLM, can achieve 77 94% task completion in one-shot learning without user oversight.
Researcher Affiliation Academia James R. Kirk, Robert E. Wray, Peter Lindes, John E. Laird Center for Integrated Cognition at IQMRI Ann Arbor, MI 48105 USA {james.kirk,robert.wray,peter.lindes,john.laird}@cic.iqmri.org
Pseudocode No The paper contains flow diagrams (Figures 1, 2, 3) but no structured pseudocode or algorithm blocks.
Open Source Code Yes Code for the ITL agent with STARS, simulator, and data analysis are available at https://github.com/Center-for-Integrated Cognition/STARS.
Open Datasets No The paper describes custom simulated environments and tasks ('simulated office and kitchen', 'tidy kitchen', 'store groceries', 'organize office') with objects specific to these tasks, but does not provide access information (link, DOI, citation) for these experimental setups as a publicly available or open dataset.
Dataset Splits No The paper describes task completion rates and experimental conditions but does not specify any training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for reproducibility.
Hardware Specification No The paper mentions a 'simulated robotic environment' and the 'APRIL MAGIC simulator' but does not provide any specific hardware details (e.g., CPU, GPU models, memory) used for running the experiments or simulations.
Software Dependencies No The paper mentions that 'the LLM used is GPT-3 (for TBP, Search Tree, and Repair) and GPT-4 (for Selection)' but does not provide specific version numbers for these models or any other software dependencies.
Experiment Setup Yes For all conditions, the LLM used is GPT-3 (for TBP, Search Tree, and Repair) and GPT-4 (for Selection). In all conditions, a user provides the initial task. In the Oversight conditions, the user reviews up to 5 responses. In non-oversight conditions, the choice of the goal is based on the highest mean log probability of candidates (ST and STAR) or the Selection strategy (STS and STARS).