Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Improving Knowledge Extraction from LLMs for Task Learning through Agent Analysis
Authors: James R. Kirk, Robert E. Wray, Peter Lindes, John E. Laird
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We describe the approach and experiments that show how an agent, by retrieving and evaluating a breadth of responses from the LLM, can achieve 77 94% task completion in one-shot learning without user oversight. |
| Researcher Affiliation | Academia | James R. Kirk, Robert E. Wray, Peter Lindes, John E. Laird Center for Integrated Cognition at IQMRI Ann Arbor, MI 48105 USA EMAIL |
| Pseudocode | No | The paper contains flow diagrams (Figures 1, 2, 3) but no structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for the ITL agent with STARS, simulator, and data analysis are available at https://github.com/Center-for-Integrated Cognition/STARS. |
| Open Datasets | No | The paper describes custom simulated environments and tasks ('simulated office and kitchen', 'tidy kitchen', 'store groceries', 'organize office') with objects specific to these tasks, but does not provide access information (link, DOI, citation) for these experimental setups as a publicly available or open dataset. |
| Dataset Splits | No | The paper describes task completion rates and experimental conditions but does not specify any training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for reproducibility. |
| Hardware Specification | No | The paper mentions a 'simulated robotic environment' and the 'APRIL MAGIC simulator' but does not provide any specific hardware details (e.g., CPU, GPU models, memory) used for running the experiments or simulations. |
| Software Dependencies | No | The paper mentions that 'the LLM used is GPT-3 (for TBP, Search Tree, and Repair) and GPT-4 (for Selection)' but does not provide specific version numbers for these models or any other software dependencies. |
| Experiment Setup | Yes | For all conditions, the LLM used is GPT-3 (for TBP, Search Tree, and Repair) and GPT-4 (for Selection). In all conditions, a user provides the initial task. In the Oversight conditions, the user reviews up to 5 responses. In non-oversight conditions, the choice of the goal is based on the highest mean log probability of candidates (ST and STAR) or the Selection strategy (STS and STARS). |