Decoding-time Realignment of Language Models

Authors: Tianlin Liu, Shangmin Guo, Leonardo Bianco, Daniele Calandriello, Quentin Berthet, Felipe Llinares-López, Jessica Hoffmann, Lucas Dixon, Michal Valko, Mathieu Blondel

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To empirically assess the effectiveness of De Ra, we investigate its ability to (i) qualitatively tradeoff between the reference and aligned models and (ii) guide the search for optimal KL regularization strength. To this end, we apply De Ra in a broad range of tasks, including summarization (Stiennon et al., 2020), hallucination mitigation, and dialogue (Tunstall et al., 2023a).
Researcher Affiliation Collaboration 1University of Basel 2University of Edinburgh 3Universit e Paris-Saclay 4Google Deep Mind 5Google Research
Pseudocode Yes Algorithm 1 Decoding-time realignment (De Ra) sampling
Open Source Code No The paper mentions publicly available checkpoints for models like Zephyr-7b (https://huggingface.co/Hugging Face H4/zephyr-7b-beta) and references 'The alignment handbook' (Tunstall et al., 2023b), which is a code repository. However, it does not explicitly state that the code for their proposed methodology, De Ra, is open-sourced or provide a direct link to their implementation.
Open Datasets Yes For this experiment, we use a pretrained T5-small model (Raffel et al., 2020) provided in the T5x framework (Roberts et al., 2022). We perform SFT on the XSum dataset (Narayan et al., 2018)...
Dataset Splits Yes The dataset used to train the reward model contains 1888 examples, which are split into training, validation, and evaluation datasets with 723, 242, and 223 examples respectively.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running its experiments.
Software Dependencies No The paper mentions software components like 'T5x framework' and 'LoRA', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes The policy learning rate is 5e-6, and the value function learning rate is 1e-5.