Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment

Authors: Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, Jianshu Chen

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evidence demonstrates the efficacy of our method in aligning both Large Language Models (LLMs) and diffusion models to accommodate diverse rewards with only around 10% GPU hours compared with multi-objective RL baseline.In this section, we aim to evaluate the performance of our Ri C algorithm on two text generation tasks and one text-to-image task that involve diverse rewards. Furthermore, we will conduct ablation studies to analyze the individual contributions of the components within Ri C.
Researcher Affiliation Collaboration 1Tencent AI Lab 2The Hong Kong University of Science and Technology 3Peking University.
Pseudocode No The paper describes algorithmic steps in paragraph form, such as 'The online training stage consists of three steps...', but does not present them as structured pseudocode blocks or explicitly labeled algorithms.
Open Source Code Yes Code is available at https://github.com/Yang Rui2015/Ri C
Open Datasets Yes The Helpful Assistant task uses the HH-RLHF dataset comprising 160k prompts and corresponding responses, annotated with human preferences. For this task, we utilize three open-sourced reward models on Huggingface (Wolf et al., 2020)... Regarding the Reddit Summary task, it consists of 14.9k posts and corresponding summaries... We utilize the Stable Diffusion v1.5 (Rombach et al., 2022) with 1B parameters as the base model and fine-tune with Ri C on a random subset of LAION-5B(Schuhmann et al., 2022) with 120k images.Table 3: Anthropic/hh-rlhf (Bai et al., 2022) and openai/summarize from feedback (Stiennon et al., 2020).
Dataset Splits No The paper refers to 'training set' and 'test set' but does not specify validation dataset splits or explicit proportions for any of the datasets used for general reproducibility beyond specific uses like online training data augmentation.
Hardware Specification Yes Hardware NVIDIA Tesla V100 32 GB
Software Dependencies No The paper mentions software like 'trl' and 'Hugging Face diffusers library' along with their respective foundational papers, but does not specify exact version numbers for these libraries or other critical software components (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Finetuning steps 20000 Initial learning rate 1.41e-4 Learning rate scheduler Linear ... Optimizer Adam Batch size 8 ... Lo RA r 64 Lo RA alpha 128 Lo RA dropout 0.05 ... RL algorithm PPO (Schulman et al., 2017) Implementation trl (von Werra et al., 2020) KL regulaization 0.2 Epochs 1 learning rate 1e-5 lambda for GAE 0.95 gamma 1 cliprange 0.2 Number of optimisation epochs per batch 4 Target KL 3