Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

Authors: Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, Matthieu Cord

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our approach for text-to-text (summarization, Q&A, helpful assistant, review), text-image (image captioning, text-to-image generation, visual grounding), and control (locomotion) tasks.
Researcher Affiliation Collaboration 1Sorbonne Université, CNRS, ISIR, Paris, France 2Meta AI 3Valeo.ai
Pseudocode No The paper describes procedures in text and figures but does not provide formal pseudocode or algorithm blocks.
Open Source Code Yes Implementations are released on github, and this website provides additional qualitative results.
Open Datasets Yes We conduct our experiments on COCO [94], with an Expansion Netv2 [96] network and a Swin Transformer [97] visual encoder, initialized from the state-of-the-art weights of [96] optimized on CIDEr.
Dataset Splits Yes For computational efficiency, we keep only a dataset D containing the 50% images with the best scores, and rescale rewards R linearly into r so that minx0 D r(x0) = 0 and 1 |D | P x0 D r(x0) = 1.
Hardware Specification Yes Hardware NVIDIA RTX A6000 49 Go
Software Dependencies No The paper mentions software like "trl package", "Adam optimizer", and "PPO" but does not specify their version numbers.
Experiment Setup Yes For RL training with PPO [84], we employ the trl package [85] and the setup from [86] with low-rank adapters (Lo RA) [87] for efficiency. We first consider summarization [12, 17] tasks on two datasets: Reuter news [88] in Figures 1(b) and 2(a) and Reddit TL;DR [89] in Figure 2(b).