Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Authors: Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, Matthieu Cord
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our approach for text-to-text (summarization, Q&A, helpful assistant, review), text-image (image captioning, text-to-image generation, visual grounding), and control (locomotion) tasks. |
| Researcher Affiliation | Collaboration | 1Sorbonne Université, CNRS, ISIR, Paris, France 2Meta AI 3Valeo.ai |
| Pseudocode | No | The paper describes procedures in text and figures but does not provide formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Implementations are released on github, and this website provides additional qualitative results. |
| Open Datasets | Yes | We conduct our experiments on COCO [94], with an Expansion Netv2 [96] network and a Swin Transformer [97] visual encoder, initialized from the state-of-the-art weights of [96] optimized on CIDEr. |
| Dataset Splits | Yes | For computational efficiency, we keep only a dataset D containing the 50% images with the best scores, and rescale rewards R linearly into r so that minx0 D r(x0) = 0 and 1 |D | P x0 D r(x0) = 1. |
| Hardware Specification | Yes | Hardware NVIDIA RTX A6000 49 Go |
| Software Dependencies | No | The paper mentions software like "trl package", "Adam optimizer", and "PPO" but does not specify their version numbers. |
| Experiment Setup | Yes | For RL training with PPO [84], we employ the trl package [85] and the setup from [86] with low-rank adapters (Lo RA) [87] for efficiency. We first consider summarization [12, 17] tasks on two datasets: Reuter news [88] in Figures 1(b) and 2(a) and Reddit TL;DR [89] in Figure 2(b). |