Language Model Self-improvement by Reinforcement Learning Contemplation
Authors: Jing-Cheng Pang, Pengyuan Wang, Kaiyuan Li, Xiong-Hui Chen, Jiacheng Xu, Zongzhang Zhang, Yang Yu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through testing on various challenging reasoning tasks and text summarization task, our experiments show that RLC effectively improves language model performance without external supervision, resulting in an answering accuracy increase (31.23% 37.09%) for Big Bench-hard reasoning tasks, and a rise in BERTScore for CNN/Daily Mail summarization tasks. Furthermore, RLC can be applied to models of different sizes, showcasing its broad applicability. |
| Researcher Affiliation | Collaboration | Jing-Cheng Pang1,2, , Pengyuan Wang1,2, , Kaiyuan Li1, Xiong-Hui Chen1,2, Jiacheng Xu1, Zongzhang Zhang1 & Yang Yu1,2,3, 1 National Key Laboratory for Novel Software Technology, Nanjing University, China & School of Artificial Intelligence, Nanjing University, China; 2Polixir.ai; 3 Peng Cheng Laboratory, Shenzhen, 518055, China |
| Pseudocode | Yes | Algorithm 1 Self-Improvement by Reinforcement Learning Contemplation (RLC) |
| Open Source Code | No | The paper mentions "We implement RLC using CEP in reasoning tasks and QEP for summarization task. Unless otherwise specified, we use FLAN-T5-Large, which has 780M parameters, as our LLM in the experiments. All reported results are averaged over three random trials, except for RLAIF with one seed, and the experiments can be conducted using two GTX 3090 graphics cards with 24GB of memory. We provide specific hyper-parameters and more detailed implementation descriptions in Appendix B. All RL-based methods use the same hyper-parameters for training RL algorithms. In our experiments, we consistently employ the same prompts for all baseline models and the RLC. For each method, we utilize the Co T prompts, specifically, 'Let s think step by step.' A comprehensive list of the prompts used in our experiments can be found in Table 7 of the Appendix.", but it does not provide an explicit statement about releasing the code for the method itself or a link to a repository for the RLC method. |
| Open Datasets | Yes | Dataset for evaluation. We use the Big Bench (Srivastava et al., 2022) benchmark to conduct our experiments. Big Bench is a challenging reasoning task requiring the language models complex reasoning capabilities. The tasks in Big Bench are pretty diverse, including reasoning the final results of a sequence of actions, understanding dates, and completing tasks that require simple arithmetic calculations. In our experiments, we use 12 challenging tasks from the Big Bench-Hard datasets1, which covers judgments, multiple choices, and text generation tasks. |
| Dataset Splits | No | The paper mentions using Big Bench dataset but does not explicitly state train/validation/test splits, nor does it refer to predefined splits with citations for reproducibility. It discusses training iterations but not data partitioning. |
| Hardware Specification | Yes | All reported results are averaged over three random trials, except for RLAIF with one seed, and the experiments can be conducted using two GTX 3090 graphics cards with 24GB of memory. |
| Software Dependencies | No | The paper mentions "We utilize the open-sourced RL repository, trlx, to implement the reinforcement learning contemplation." and "We implement RLC using CEP in reasoning tasks and QEP for summarization task.", but it does not specify version numbers for these software components or any other libraries like Python, PyTorch, etc. |
| Experiment Setup | Yes | Implementation details. We utilize PPO to train the LLM for 6,000 gradient steps for each task, with a batch size of 12. The PPO implementation is from the trlx repository on Git Hub (Carper AI, 2020). We implement RLC using CEP in reasoning tasks and QEP for summarization task. Unless otherwise specified, we use FLAN-T5-Large, which has 780M parameters, as our LLM in the experiments. All reported results are averaged over three random trials, except for RLAIF with one seed, and the experiments can be conducted using two GTX 3090 graphics cards with 24GB of memory. We provide specific hyper-parameters and more detailed implementation descriptions in Appendix B. All RL-based methods use the same hyper-parameters for training RL algorithms. In our experiments, we consistently employ the same prompts for all baseline models and the RLC. For each method, we utilize the Co T prompts, specifically, 'Let s think step by step.' A comprehensive list of the prompts used in our experiments can be found in Table 7 of the Appendix. |