Improving Knowledge Extraction from LLMs for Task Learning through Agent Analysis
Authors: James R. Kirk, Robert E. Wray, Peter Lindes, John E. Laird
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We describe the approach and experiments that show how an agent, by retrieving and evaluating a breadth of responses from the LLM, can achieve 77 94% task completion in one-shot learning without user oversight. |
| Researcher Affiliation | Academia | James R. Kirk, Robert E. Wray, Peter Lindes, John E. Laird Center for Integrated Cognition at IQMRI Ann Arbor, MI 48105 USA {james.kirk,robert.wray,peter.lindes,john.laird}@cic.iqmri.org |
| Pseudocode | No | The paper contains flow diagrams (Figures 1, 2, 3) but no structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for the ITL agent with STARS, simulator, and data analysis are available at https://github.com/Center-for-Integrated Cognition/STARS. |
| Open Datasets | No | The paper describes custom simulated environments and tasks ('simulated office and kitchen', 'tidy kitchen', 'store groceries', 'organize office') with objects specific to these tasks, but does not provide access information (link, DOI, citation) for these experimental setups as a publicly available or open dataset. |
| Dataset Splits | No | The paper describes task completion rates and experimental conditions but does not specify any training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for reproducibility. |
| Hardware Specification | No | The paper mentions a 'simulated robotic environment' and the 'APRIL MAGIC simulator' but does not provide any specific hardware details (e.g., CPU, GPU models, memory) used for running the experiments or simulations. |
| Software Dependencies | No | The paper mentions that 'the LLM used is GPT-3 (for TBP, Search Tree, and Repair) and GPT-4 (for Selection)' but does not provide specific version numbers for these models or any other software dependencies. |
| Experiment Setup | Yes | For all conditions, the LLM used is GPT-3 (for TBP, Search Tree, and Repair) and GPT-4 (for Selection). In all conditions, a user provides the initial task. In the Oversight conditions, the user reviews up to 5 responses. In non-oversight conditions, the choice of the goal is based on the highest mean log probability of candidates (ST and STAR) or the Selection strategy (STS and STARS). |