Hypothesis, Verification, and Induction: Grounding Large Language Models with Self-Driven Skill Learning
Authors: Shaohui Peng, Xing Hu, Qi Yi, Rui Zhang, Jiaming Guo, Di Huang, Zikang Tian, Ruizhi Chen, Zidong Du, Qi Guo, Yunji Chen, Ling Li
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Verified in the famous instruction following task set, Baby AI, HYVIN achieves comparable performance in the most challenging tasks compared with imitation learning methods that cost millions of demonstrations, proving the effectiveness of learned skills and showing the feasibility and efficiency of our framework. The main performance results are shown in Table 1. We separately compare HYVIN with baselines in each level task. |
| Researcher Affiliation | Academia | 1 Intelligent Software Research Center, Institute of Software, CAS, Beijing, China 2 SKL of Processors, Institute of Computing Technology, CAS, Beijing, China 3 University of Science and Technology of China, USTC, Hefei, China 4 Shanghai Innovation Center for Processor Technologies, SHIC, Shanghai, China 5 University of Chinese Academy of Sciences, UCAS, Beijing, China |
| Pseudocode | No | The paper describes the HYVIN framework and its phases (Hypothesis, Verification, Induction, Deduction) in detail using descriptive text and figures, but does not include any explicit pseudocode blocks or algorithms labeled as such. |
| Open Source Code | No | The paper does not contain an explicit statement or a link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | Verified in Baby AI, a grid world platform to study language-grounded tasks... To evaluate the efficiency and effectiveness of our proposed framework that automatically discovers, learns, and applies skills, we test HYVIN on the Baby AI environment (Chevalier-Boisvert et al. 2019). |
| Dataset Splits | Yes | Besides, subgoals belonging to the same cluster are divided into training and verification sets to monitor the generalization ability of the trained skill to prevent overfitting. |
| Hardware Specification | No | The paper mentions using 'Chat GPT (GPT3.5-turbo)' and training with 'PPO algorithm' but does not provide specific details on the hardware (e.g., GPU models, CPU types, memory) used for these computations or experiments. |
| Software Dependencies | No | The paper states, 'In this paper, we use Chat GPT (GPT3.5-turbo) as the large language model...'. While GPT3.5-turbo is a specific version of a model/service, the paper does not list other key software components, libraries, or frameworks with their specific version numbers (e.g., Python, PyTorch, TensorFlow, or specific RL frameworks). |
| Experiment Setup | Yes | In the implementation, we random sample 100 instructions from each level task, and set the verification steps threshold Tverify equals to 3000. Considering the randomness of Chat GPT s answers, we repeat the experiment of each instruction 3 times to get the average results. |