reproducibilityindex.ai

Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

Authors: Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, Matthieu Cord

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our approach for text-to-text (summarization, Q&A, helpful assistant, review), text-image (image captioning, text-to-image generation, visual grounding), and control (locomotion) tasks.
Researcher Affiliation	Collaboration	1Sorbonne Université, CNRS, ISIR, Paris, France 2Meta AI 3Valeo.ai
Pseudocode	No	The paper describes procedures in text and figures but does not provide formal pseudocode or algorithm blocks.
Open Source Code	Yes	Implementations are released on github, and this website provides additional qualitative results.
Open Datasets	Yes	We conduct our experiments on COCO [94], with an Expansion Netv2 [96] network and a Swin Transformer [97] visual encoder, initialized from the state-of-the-art weights of [96] optimized on CIDEr.
Dataset Splits	Yes	For computational efficiency, we keep only a dataset D containing the 50% images with the best scores, and rescale rewards R linearly into r so that minx0 D r(x0) = 0 and 1 \|D \| P x0 D r(x0) = 1.
Hardware Specification	Yes	Hardware NVIDIA RTX A6000 49 Go
Software Dependencies	No	The paper mentions software like "trl package", "Adam optimizer", and "PPO" but does not specify their version numbers.
Experiment Setup	Yes	For RL training with PPO [84], we employ the trl package [85] and the setup from [86] with low-rank adapters (Lo RA) [87] for efficiency. We first consider summarization [12, 17] tasks on two datasets: Reuter news [88] in Figures 1(b) and 2(a) and Reddit TL;DR [89] in Figure 2(b).