Hypothesis, Verification, and Induction: Grounding Large Language Models with Self-Driven Skill Learning

Authors: Shaohui Peng, Xing Hu, Qi Yi, Rui Zhang, Jiaming Guo, Di Huang, Zikang Tian, Ruizhi Chen, Zidong Du, Qi Guo, Yunji Chen, Ling Li

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Verified in the famous instruction following task set, Baby AI, HYVIN achieves comparable performance in the most challenging tasks compared with imitation learning methods that cost millions of demonstrations, proving the effectiveness of learned skills and showing the feasibility and efficiency of our framework. The main performance results are shown in Table 1. We separately compare HYVIN with baselines in each level task.
Researcher Affiliation Academia 1 Intelligent Software Research Center, Institute of Software, CAS, Beijing, China 2 SKL of Processors, Institute of Computing Technology, CAS, Beijing, China 3 University of Science and Technology of China, USTC, Hefei, China 4 Shanghai Innovation Center for Processor Technologies, SHIC, Shanghai, China 5 University of Chinese Academy of Sciences, UCAS, Beijing, China
Pseudocode No The paper describes the HYVIN framework and its phases (Hypothesis, Verification, Induction, Deduction) in detail using descriptive text and figures, but does not include any explicit pseudocode blocks or algorithms labeled as such.
Open Source Code No The paper does not contain an explicit statement or a link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes Verified in Baby AI, a grid world platform to study language-grounded tasks... To evaluate the efficiency and effectiveness of our proposed framework that automatically discovers, learns, and applies skills, we test HYVIN on the Baby AI environment (Chevalier-Boisvert et al. 2019).
Dataset Splits Yes Besides, subgoals belonging to the same cluster are divided into training and verification sets to monitor the generalization ability of the trained skill to prevent overfitting.
Hardware Specification No The paper mentions using 'Chat GPT (GPT3.5-turbo)' and training with 'PPO algorithm' but does not provide specific details on the hardware (e.g., GPU models, CPU types, memory) used for these computations or experiments.
Software Dependencies No The paper states, 'In this paper, we use Chat GPT (GPT3.5-turbo) as the large language model...'. While GPT3.5-turbo is a specific version of a model/service, the paper does not list other key software components, libraries, or frameworks with their specific version numbers (e.g., Python, PyTorch, TensorFlow, or specific RL frameworks).
Experiment Setup Yes In the implementation, we random sample 100 instructions from each level task, and set the verification steps threshold Tverify equals to 3000. Considering the randomness of Chat GPT s answers, we repeat the experiment of each instruction 3 times to get the average results.