Chain of Hindsight aligns Language Models with Feedback

Authors: Hao Liu, Carmelo Sferrazza, Pieter Abbeel

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Applying our method to large language models, we observed that Chain of Hindsight significantly surpasses previous methods in aligning language models with human preferences. We report significant improvements on summarization and dialogue benchmarks, with our approach markedly preferred in human evaluations.
Researcher Affiliation Academia Hao Liu UC Berkeley hao.liu@berkeley.edu Carmelo Sferrazza UC Berkeley csferrazza@berkeley.edu Pieter Abbeel UC Berkeley pabbeel@cs.berkeley.edu
Pseudocode Yes Algorithm 1 Aligning language models from feedback with Chain of Hindsight.
Open Source Code Yes 1https://github.com/lhao499/chain-of-hindsight
Open Datasets Yes Web GPT. The Web GPT dataset (Nakano et al., 2021)2 includes a total of 19,578 comparisons where each example comprises a question, a pair of model answers, and metadata. The answers are rated by humans with a preference score, which helps to identify the better of the two answers. 2https://huggingface.co/datasets/openai/webgpt_comparisons HH. The Anthropic s Helpful and Harmless (HH) dataset (Ganguli et al., 2022; Bai et al., 2022a) contains human rated dialogues3. Each example in this dataset consists of a pair of conversations between a human and a languages model, and one of the two conversations is labeled as preferred by human labelers. 3https://huggingface.co/datasets/Anthropic/hh-rlhf Summarization. The summarization dataset (Stiennon et al., 2020) consists of feedback from humans regarding the summarizations generated by a model4. Human evaluators were requested to choose the superior summary from two options presented to them. 4 https://huggingface.co/datasets/openai/summarize_from_feedback
Dataset Splits Yes We evaluate the performance on the validation set. We also evaluate on the validation split of the Anthropic s Helpful and Harmless (HH) dataset (Ganguli et al., 2022; Bai et al., 2022a)
Hardware Specification Yes We thank Google TPU Research Cloud for granting us access to TPUs.
Software Dependencies No The paper mentions using the Adam optimizer and base pretrained models (GPT-J 6B, OPT) but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes All models are trained with the Adam (Kingma and Ba, 2014) optimizer, with β1 = 0.9, β2 = 0.95, and an epsilon of 1.0e 8. The batch size for human feedback data is set to 512, while for pretraining data it is set to 2048. The value of λ is 1.5, which determines the relative strength of gradients from the human feedback dataset and the pretraining dataset.