Improving Context-Aware Preference Modeling for Language Models
Authors: Silviu Pitis, Ziang Xiao, Nicolas Le Roux, Alessandro Sordoni
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we conduct experiments to benchmark the context-specific performance of various models and investigate the potential value of context-aware preference modeling. |
| Researcher Affiliation | Collaboration | Silviu Pitisa,b Ziang Xiaoc Nicolas Le Rouxb,d Alessandro Sordonib,d a University of Toronto b Microsoft Research c Johns Hopkins University d MILA |
| Pseudocode | No | The following template is used for llm-as-a-judge models (Llama 3 and GPT-4 Turbo). Llama-3 uses the logit_template (see Appendix D.1 for how the score is computed). GPT-4 Turbo uses the argmax_score_template_no_cot and runs inference with temperature = 0. |
| Open Source Code | No | Unfortunately we unable to release code, but are happy to clarify any details over email. |
| Open Datasets | Yes | We open-source high quality context-conditioned preference datasets that disentangle contextspecific preference from general preference, which we use for finetuning and evaluation. The datasets can be found at https://huggingface.co/datasets/microsoft/rpr. |
| Dataset Splits | No | We divide the dataset into a training set of 10,167 paired samples, and a test set of 1,000 paired samples, with no overlap between train and test prompts. |
| Hardware Specification | Yes | Finetuning took approximately 8 hours on a single A100. Experiments were run on an internal cluster of GPUs with between 24GB and 48GB VRAM each. |
| Software Dependencies | No | optim = adamw_hf lr_scheduler_type = linear PEFT_CONFIG = Lora Config( task_type=Task Type.SEQ_CLS, inference_mode=False, r=16, lora_alpha=32, lora_dropout=0.05, target_modules=[ 'q_proj', 'k_proj', 'v_proj', 'dense' ], ) |
| Experiment Setup | Yes | epochs = 1 per_device_train_batch_size = 2 gradient_accumulation_steps = 1 learning_rate = 1e-5 weight_decay = 1e-2 optim = adamw_hf lr_scheduler_type = linear |