reproducibilityindex.ai

Decoding-time Realignment of Language Models

Authors: Tianlin Liu, Shangmin Guo, Leonardo Bianco, Daniele Calandriello, Quentin Berthet, Felipe Llinares-López, Jessica Hoffmann, Lucas Dixon, Michal Valko, Mathieu Blondel

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To empirically assess the effectiveness of De Ra, we investigate its ability to (i) qualitatively tradeoff between the reference and aligned models and (ii) guide the search for optimal KL regularization strength. To this end, we apply De Ra in a broad range of tasks, including summarization (Stiennon et al., 2020), hallucination mitigation, and dialogue (Tunstall et al., 2023a).
Researcher Affiliation	Collaboration	1University of Basel 2University of Edinburgh 3Universit e Paris-Saclay 4Google Deep Mind 5Google Research
Pseudocode	Yes	Algorithm 1 Decoding-time realignment (De Ra) sampling
Open Source Code	No	The paper mentions publicly available checkpoints for models like Zephyr-7b (https://huggingface.co/Hugging Face H4/zephyr-7b-beta) and references 'The alignment handbook' (Tunstall et al., 2023b), which is a code repository. However, it does not explicitly state that the code for their proposed methodology, De Ra, is open-sourced or provide a direct link to their implementation.
Open Datasets	Yes	For this experiment, we use a pretrained T5-small model (Raffel et al., 2020) provided in the T5x framework (Roberts et al., 2022). We perform SFT on the XSum dataset (Narayan et al., 2018)...
Dataset Splits	Yes	The dataset used to train the reward model contains 1888 examples, which are split into training, validation, and evaluation datasets with 723, 242, and 223 examples respectively.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running its experiments.
Software Dependencies	No	The paper mentions software components like 'T5x framework' and 'LoRA', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	The policy learning rate is 5e-6, and the value function learning rate is 1e-5.