Decoding-time Realignment of Language Models
Authors: Tianlin Liu, Shangmin Guo, Leonardo Bianco, Daniele Calandriello, Quentin Berthet, Felipe Llinares-López, Jessica Hoffmann, Lucas Dixon, Michal Valko, Mathieu Blondel
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To empirically assess the effectiveness of De Ra, we investigate its ability to (i) qualitatively tradeoff between the reference and aligned models and (ii) guide the search for optimal KL regularization strength. To this end, we apply De Ra in a broad range of tasks, including summarization (Stiennon et al., 2020), hallucination mitigation, and dialogue (Tunstall et al., 2023a). |
| Researcher Affiliation | Collaboration | 1University of Basel 2University of Edinburgh 3Universit e Paris-Saclay 4Google Deep Mind 5Google Research |
| Pseudocode | Yes | Algorithm 1 Decoding-time realignment (De Ra) sampling |
| Open Source Code | No | The paper mentions publicly available checkpoints for models like Zephyr-7b (https://huggingface.co/Hugging Face H4/zephyr-7b-beta) and references 'The alignment handbook' (Tunstall et al., 2023b), which is a code repository. However, it does not explicitly state that the code for their proposed methodology, De Ra, is open-sourced or provide a direct link to their implementation. |
| Open Datasets | Yes | For this experiment, we use a pretrained T5-small model (Raffel et al., 2020) provided in the T5x framework (Roberts et al., 2022). We perform SFT on the XSum dataset (Narayan et al., 2018)... |
| Dataset Splits | Yes | The dataset used to train the reward model contains 1888 examples, which are split into training, validation, and evaluation datasets with 723, 242, and 223 examples respectively. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like 'T5x framework' and 'LoRA', but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | The policy learning rate is 5e-6, and the value function learning rate is 1e-5. |