Enabling Lanuguage Models to Implicitly Learn Self-Improvement

Authors: Ziqi Wang, Le Hou, Tianjian Lu, Yuexin Wu, Yunxuan Li, Hongkun Yu, Heng Ji

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on two real-world datasets and one synthetic dataset show that our method significantly outperforms prompting-based methods.
Researcher Affiliation Collaboration Ziqi Wang1 , Le Hou2 , Tianjian Lu2, Yuexin Wu2, Yunxuan Li2, Hongkun Yu2, Heng Ji1 1 University of Illinois Urbana-Champaign 2 Google
Pseudocode Yes The algorithm block of PIT can be found in the Algorithm 1. ... The self-improvement inference process is summarized in the Algorithm 2.
Open Source Code No The paper does not provide an explicit statement or link to its own open-source code for the methodology described.
Open Datasets Yes Anthropic/HH-RLHF. The HH-RLHF dataset (Bai et al., 2022; Ganguli et al., 2022) is released by Anthropic and is allowed for research purposes... Open AI/Summary (Stiennon et al., 2020a), which is also allowed for research purposes...
Dataset Splits Yes Open AI/Summary... It contains 92.9K training data and 86.1K validation data. Similarly to the Anthropic/HH-RLHF dataset, we equally divide the training dataset into three folds for supervised fine-tuning, reward model training, and reinforcement learning, respectively.
Hardware Specification Yes We train our models on TPU v4 (Jouppi et al., 2023).
Software Dependencies No The paper mentions models like Pa LM 2 (Bison) and De BERTa-Large, but does not provide specific version numbers for software libraries, frameworks, or programming languages used in the implementation.
Experiment Setup Yes In the supervised fine-tuning stage, we fine-tune MSFT PIT and MSFT P for one epoch to avoid overfitting and set the learning rate to 3e 5. ... The learning rate is set to 3e 4 in the reward model training. The learning rate of reinforcement learning is set to 1e 5. We set the context window to 512 for inputs and 512 for outputs for MP, and the context window to 512 for inputs, 512 for reference outputs and 512 for outputs for MPIT.