Decoding-Time Language Model Alignment with Multiple Objectives
Authors: Ruizhe Shi, Yifang Chen, Yushi Hu, Alisa Liu, Hanna Hajishirzi, Noah A. Smith, Simon S. Du
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results demonstrate the effectiveness of the algorithm. For example, compared to a parameter-merging baseline, MOD achieves 12.8% overall reward improvement when equally optimizing towards 3 objectives. Moreover, we experiment with MOD on combining three fully-finetuned LMs of different model sizes, each aimed at different objectives such as safety, coding, and general user preference. Unlike traditional methods that require careful curation of a mixture of datasets to achieve comprehensive improvement, we can quickly experiment with preference weightings using MOD to find the best combination of models. Our best combination reduces toxicity on Toxigen to nearly 0% and achieves 7.9 33.3% improvement across three other metrics (i.e., Codex@1, GSM-COT, BBH-COT). |
| Researcher Affiliation | Collaboration | Ruizhe Shi1 Yifang Chen2 Yushi Hu2,3 Alisa Liu2 Hannaneh Hajishirzi2,3 Noah A. Smith2,3 Simon S. Du2 1IIIS, Tsinghua University 2University of Washington 3Allen Institute for AI |
| Pseudocode | Yes | C Main Algorithm C.1 Pipeline Data: Alphabet set Σ, prompt x0, number of beams K, maximum length L, divergence function f, preference weightings w M 1, and policies πref, π1, π2, . . . , πM Result: Optimal sequence of tokens Squeue {(seq : bos , f-score : 0)}; Snext ; Scompleted ; for d = 1 to L do foreach s Squeue do if s.seq[ 1] = eos or d = L then Scompleted Scompleted {s}; continue; end Ssuccessors ; foreach t Σ do y cat(s.seq, t); v πref(y|x0)( f)( 1) PM i=1 wi f πi(y|x0) πref(y|x0) ; Ssuccessors Ssuccessors {(seq : y, f-score : v)}; end Snext Snext Ssuccessors; end Sort Snext by descending f-score; Squeue top-k(Snext, K); Snext ; end return sequence with the highest f-score in Scompleted. |
| Open Source Code | Yes | We release the code at https://github.com/srzer. |
| Open Datasets | Yes | For Reddit Summary, we adopt the Summarize-from-Feedback dataset (https:// huggingface.co/datasets/openai/summarize_from_feedback); For Helpful Assistant, we adopt the Anthropics-HH dataset (https://huggingface.co/datasets/Anthropic/hh-rlhf); For Safety Alignment, we adopt a 10-k subset (https://huggingface.co/datasets/ PKU-Alignment/PKU-Safe RLHF-10K); For Helpsteer, we adopt the Helpsteer dataset (https: //huggingface.co/datasets/nvidia/Help Steer). |
| Dataset Splits | Yes | For Reddit Summary and Helpfull Assistant, we uniformly sample a subset of 2k prompts from the test set, following [56]; for Safety Alignment and Help Steer, we randomly sample of subset of 200 prompts from the validation set. |
| Hardware Specification | Yes | Compute resources. Our main experiments are conducted on NVIDIA RTX A6000. For training RLHF, MORLHF models, the number of workers are set as 3, each taking up 20, 000M of memory, running for 18 hours; for training DPO, MODPO models, the number of workers are set as 2, each taking up 40, 000M of memory, running for 3 hours. |
| Software Dependencies | No | Our codebase is mainly based on trl [46] (https://github.com/ huggingface/trl), MODPO [62] (https://github.com/ZHZis ZZ/modpo), Ri C [56] (https://github.com/Yang Rui2015/Ri C) and Finegrained RLHF [52] (https: //github.com/allenai/Fine Grained RLHF), and has referred to f-divergence DPO [47] (https://github.com/alecwangcq/f-divergence-dpo), Pack LLM [32] (https: //github.com/cmavro/Pack LLM), and DPA [48] (https://github.com/Haoxiang-Wang/ directional-preference-alignment). We release the code at https://github.com/srzer. While specific libraries are mentioned, no version numbers are provided. |
| Experiment Setup | Yes | Training hyper-parameters. For PPO, we follow the settings of [56] and train for 100 batches; for DPO, we follow [62] with minimal modifications as BATCH_SIZE= 1 and MAX_LENGTH= 256. Inference hyper-parameters. For PPO, we follow the settings of [56] with NUM_BEAMS= 1; for DPO, we follow [62] with BATCH_SIZE= 4, MAX_LENGTH= 200 and NUM_BEAMS= 1. |