Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Authors: Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, Matthieu Cord
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our approach for text-to-text (summarization, Q&A, helpful assistant, review), text-image (image captioning, text-to-image generation, visual grounding), and control (locomotion) tasks. |
| Researcher Affiliation | Collaboration | 1Sorbonne Université, CNRS, ISIR, Paris, France 2Meta AI 3Valeo.ai |
| Pseudocode | No | The paper describes procedures in text and figures but does not provide formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Implementations are released on github, and this website provides additional qualitative results. |
| Open Datasets | Yes | We conduct our experiments on COCO [94], with an Expansion Netv2 [96] network and a Swin Transformer [97] visual encoder, initialized from the state-of-the-art weights of [96] optimized on CIDEr. |
| Dataset Splits | Yes | For computational efficiency, we keep only a dataset D containing the 50% images with the best scores, and rescale rewards R linearly into r so that minx0 D r(x0) = 0 and 1 |D | P x0 D r(x0) = 1. |
| Hardware Specification | Yes | Hardware NVIDIA RTX A6000 49 Go |
| Software Dependencies | No | The paper mentions software like "trl package", "Adam optimizer", and "PPO" but does not specify their version numbers. |
| Experiment Setup | Yes | For RL training with PPO [84], we employ the trl package [85] and the setup from [86] with low-rank adapters (Lo RA) [87] for efficiency. We first consider summarization [12, 17] tasks on two datasets: Reuter news [88] in Figures 1(b) and 2(a) and Reddit TL;DR [89] in Figure 2(b). |