Enabling Lanuguage Models to Implicitly Learn Self-Improvement
Authors: Ziqi Wang, Le Hou, Tianjian Lu, Yuexin Wu, Yunxuan Li, Hongkun Yu, Heng Ji
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on two real-world datasets and one synthetic dataset show that our method significantly outperforms prompting-based methods. |
| Researcher Affiliation | Collaboration | Ziqi Wang1 , Le Hou2 , Tianjian Lu2, Yuexin Wu2, Yunxuan Li2, Hongkun Yu2, Heng Ji1 1 University of Illinois Urbana-Champaign 2 Google |
| Pseudocode | Yes | The algorithm block of PIT can be found in the Algorithm 1. ... The self-improvement inference process is summarized in the Algorithm 2. |
| Open Source Code | No | The paper does not provide an explicit statement or link to its own open-source code for the methodology described. |
| Open Datasets | Yes | Anthropic/HH-RLHF. The HH-RLHF dataset (Bai et al., 2022; Ganguli et al., 2022) is released by Anthropic and is allowed for research purposes... Open AI/Summary (Stiennon et al., 2020a), which is also allowed for research purposes... |
| Dataset Splits | Yes | Open AI/Summary... It contains 92.9K training data and 86.1K validation data. Similarly to the Anthropic/HH-RLHF dataset, we equally divide the training dataset into three folds for supervised fine-tuning, reward model training, and reinforcement learning, respectively. |
| Hardware Specification | Yes | We train our models on TPU v4 (Jouppi et al., 2023). |
| Software Dependencies | No | The paper mentions models like Pa LM 2 (Bison) and De BERTa-Large, but does not provide specific version numbers for software libraries, frameworks, or programming languages used in the implementation. |
| Experiment Setup | Yes | In the supervised fine-tuning stage, we fine-tune MSFT PIT and MSFT P for one epoch to avoid overfitting and set the learning rate to 3e 5. ... The learning rate is set to 3e 4 in the reward model training. The learning rate of reinforcement learning is set to 1e 5. We set the context window to 512 for inputs and 512 for outputs for MP, and the context window to 512 for inputs, 512 for reference outputs and 512 for outputs for MPIT. |