reproducibilityindex.ai

Hypothesis, Verification, and Induction: Grounding Large Language Models with Self-Driven Skill Learning

Authors: Shaohui Peng, Xing Hu, Qi Yi, Rui Zhang, Jiaming Guo, Di Huang, Zikang Tian, Ruizhi Chen, Zidong Du, Qi Guo, Yunji Chen, Ling Li

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Verified in the famous instruction following task set, Baby AI, HYVIN achieves comparable performance in the most challenging tasks compared with imitation learning methods that cost millions of demonstrations, proving the effectiveness of learned skills and showing the feasibility and efficiency of our framework. The main performance results are shown in Table 1. We separately compare HYVIN with baselines in each level task.
Researcher Affiliation	Academia	1 Intelligent Software Research Center, Institute of Software, CAS, Beijing, China 2 SKL of Processors, Institute of Computing Technology, CAS, Beijing, China 3 University of Science and Technology of China, USTC, Hefei, China 4 Shanghai Innovation Center for Processor Technologies, SHIC, Shanghai, China 5 University of Chinese Academy of Sciences, UCAS, Beijing, China
Pseudocode	No	The paper describes the HYVIN framework and its phases (Hypothesis, Verification, Induction, Deduction) in detail using descriptive text and figures, but does not include any explicit pseudocode blocks or algorithms labeled as such.
Open Source Code	No	The paper does not contain an explicit statement or a link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	Verified in Baby AI, a grid world platform to study language-grounded tasks... To evaluate the efficiency and effectiveness of our proposed framework that automatically discovers, learns, and applies skills, we test HYVIN on the Baby AI environment (Chevalier-Boisvert et al. 2019).
Dataset Splits	Yes	Besides, subgoals belonging to the same cluster are divided into training and verification sets to monitor the generalization ability of the trained skill to prevent overfitting.
Hardware Specification	No	The paper mentions using 'Chat GPT (GPT3.5-turbo)' and training with 'PPO algorithm' but does not provide specific details on the hardware (e.g., GPU models, CPU types, memory) used for these computations or experiments.
Software Dependencies	No	The paper states, 'In this paper, we use Chat GPT (GPT3.5-turbo) as the large language model...'. While GPT3.5-turbo is a specific version of a model/service, the paper does not list other key software components, libraries, or frameworks with their specific version numbers (e.g., Python, PyTorch, TensorFlow, or specific RL frameworks).
Experiment Setup	Yes	In the implementation, we random sample 100 instructions from each level task, and set the verification steps threshold Tverify equals to 3000. Considering the randomness of Chat GPT s answers, we repeat the experiment of each instruction 3 times to get the average results.