Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs

Authors: Siyan Zhao, Mingyi Hong, Yang Liu, Devamanyu Hazarika, Kaixiang Lin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce PREFEVAL, a benchmark for evaluating LLMs ability to infer, memorize and adhere to user preferences in a long-context conversational setting. PREFEVAL comprises 3,000 manually curated user preference and query pairs spanning 20 topics. PREFEVAL contains user personalization or preference information in both explicit and implicit forms, and evaluates LLM performance using a generation and a classification task. With PREFEVAL, we evaluated the aforementioned preference following capabilities of 10 open-source and proprietary LLMs in multi-session conversations with varying context lengths up to 100k tokens. We benchmark with various prompting, iterative feedback, and retrieval-augmented generation methods.
Researcher Affiliation	Collaboration	Siyan Zhao2 , Mingyi Hong1,3, Yang Liu1, Devamanyu Hazarika1, Kaixiang Lin1 1Amazon AGI, 2UCLA, 3University of Minnesota EMAIL, EMAIL, EMAIL EMAIL
Pseudocode	No	The paper describes methods and evaluation protocols in text sections such as 2.1, 2.5, 3.1, and Appendix A.4, but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and dataset are available at https://prefeval.github.io/.
Open Datasets	Yes	Our code and dataset are available at https://prefeval.github.io/. ... PREFEVAL consists of 1,000 unique preference-query pairs, each with three preference forms (...), resulting in 3000 preference-query pairs. These pairs were manually curated with the assistance of GPT-4, Claude 3 Sonnet, and Claude 3.5 Sonnet (see Appendix A.14 for detailed data construction methodology). ... We incorporate multi-session turns from the LMSYSChat-1M dataset (Zheng et al., 2023).
Dataset Splits	Yes	We fine-tuned the Mistral-7B model using SFT on 80% of the topics in PREFEVAL and evaluated it on the remaining unseen 20% topics for the generation task.
Hardware Specification	No	The paper evaluates various state-of-the-art LLMs, including proprietary models like Claude and GPT-4o, and open-source models like Mistral and LLaMA. While it describes fine-tuning the Mistral-7B model, specific hardware details such as GPU or CPU models used for these experiments are not mentioned in the text.
Software Dependencies	Yes	We extensively evaluate a variety of state-of-the-art LLMs, including Claude 3 Sonnet, Claude 3 Haiku, Mistral 7b Instruct, Mistral 8x7b Instruct, LLa MA 3 8b Instruct, and LLa MA 3 70b Instruct. We also assess more recent models Claude 3.5 Sonnet, GPT-o1-preview, and Gemini-1.5-pro in specific settings. The specific versions used are listed in Table 3, for example: 'Claude 3 Sonnet anthropic.claude-3-sonnet-20240229-v1:0', 'Mistral 7b Instruct mistral.mistral-7b-instruct-v0:2', 'LLa MA 3 8b Instruct meta.llama3-8b-instruct-v1:0'.
Experiment Setup	Yes	We investigate methods to explicitly help LLMs focus on the preference-following task... (4) Few-Shot Chain-of-Thought (CoT): The LLM is given several few-shot examples (in our experiments, we used 5-shot) of chain-of-thoughts... (5) Retrieval-Augmented Generation (RAG): A sentence embedding model is used to retrieve the most similar conversation exchanges to the question, which are then provided to the LLM in the prompt. ... During training, to simulate conversational preference following, we inserted 0, 5, or 10 contextual turns between the preference and query, resulting in training data of 2, 7, or 12 turns (where the preference, query, and response constitute 2 turns).