Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improving Context-Aware Preference Modeling for Language Models

Authors: Silviu Pitis, Ziang Xiao, Nicolas Le Roux, Alessandro Sordoni

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we conduct experiments to benchmark the context-specific performance of various models and investigate the potential value of context-aware preference modeling.
Researcher Affiliation Collaboration Silviu Pitisa,b Ziang Xiaoc Nicolas Le Rouxb,d Alessandro Sordonib,d a University of Toronto b Microsoft Research c Johns Hopkins University d MILA
Pseudocode No The following template is used for llm-as-a-judge models (Llama 3 and GPT-4 Turbo). Llama-3 uses the logit_template (see Appendix D.1 for how the score is computed). GPT-4 Turbo uses the argmax_score_template_no_cot and runs inference with temperature = 0.
Open Source Code No Unfortunately we unable to release code, but are happy to clarify any details over email.
Open Datasets Yes We open-source high quality context-conditioned preference datasets that disentangle contextspecific preference from general preference, which we use for finetuning and evaluation. The datasets can be found at https://huggingface.co/datasets/microsoft/rpr.
Dataset Splits No We divide the dataset into a training set of 10,167 paired samples, and a test set of 1,000 paired samples, with no overlap between train and test prompts.
Hardware Specification Yes Finetuning took approximately 8 hours on a single A100. Experiments were run on an internal cluster of GPUs with between 24GB and 48GB VRAM each.
Software Dependencies No optim = adamw_hf lr_scheduler_type = linear PEFT_CONFIG = Lora Config( task_type=Task Type.SEQ_CLS, inference_mode=False, r=16, lora_alpha=32, lora_dropout=0.05, target_modules=[ 'q_proj', 'k_proj', 'v_proj', 'dense' ], )
Experiment Setup Yes epochs = 1 per_device_train_batch_size = 2 gradient_accumulation_steps = 1 learning_rate = 1e-5 weight_decay = 1e-2 optim = adamw_hf lr_scheduler_type = linear