Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment
Authors: Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, Jianshu Chen
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evidence demonstrates the efficacy of our method in aligning both Large Language Models (LLMs) and diffusion models to accommodate diverse rewards with only around 10% GPU hours compared with multi-objective RL baseline.In this section, we aim to evaluate the performance of our Ri C algorithm on two text generation tasks and one text-to-image task that involve diverse rewards. Furthermore, we will conduct ablation studies to analyze the individual contributions of the components within Ri C. |
| Researcher Affiliation | Collaboration | 1Tencent AI Lab 2The Hong Kong University of Science and Technology 3Peking University. |
| Pseudocode | No | The paper describes algorithmic steps in paragraph form, such as 'The online training stage consists of three steps...', but does not present them as structured pseudocode blocks or explicitly labeled algorithms. |
| Open Source Code | Yes | Code is available at https://github.com/Yang Rui2015/Ri C |
| Open Datasets | Yes | The Helpful Assistant task uses the HH-RLHF dataset comprising 160k prompts and corresponding responses, annotated with human preferences. For this task, we utilize three open-sourced reward models on Huggingface (Wolf et al., 2020)... Regarding the Reddit Summary task, it consists of 14.9k posts and corresponding summaries... We utilize the Stable Diffusion v1.5 (Rombach et al., 2022) with 1B parameters as the base model and fine-tune with Ri C on a random subset of LAION-5B(Schuhmann et al., 2022) with 120k images.Table 3: Anthropic/hh-rlhf (Bai et al., 2022) and openai/summarize from feedback (Stiennon et al., 2020). |
| Dataset Splits | No | The paper refers to 'training set' and 'test set' but does not specify validation dataset splits or explicit proportions for any of the datasets used for general reproducibility beyond specific uses like online training data augmentation. |
| Hardware Specification | Yes | Hardware NVIDIA Tesla V100 32 GB |
| Software Dependencies | No | The paper mentions software like 'trl' and 'Hugging Face diffusers library' along with their respective foundational papers, but does not specify exact version numbers for these libraries or other critical software components (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Finetuning steps 20000 Initial learning rate 1.41e-4 Learning rate scheduler Linear ... Optimizer Adam Batch size 8 ... Lo RA r 64 Lo RA alpha 128 Lo RA dropout 0.05 ... RL algorithm PPO (Schulman et al., 2017) Implementation trl (von Werra et al., 2020) KL regulaization 0.2 Epochs 1 learning rate 1e-5 lambda for GAE 0.95 gamma 1 cliprange 0.2 Number of optimisation epochs per batch 4 Target KL 3 |