Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning

Authors: Marwa Abdulhai, Ryan Cheng, Donovan Clay, Tim Althoff, Sergey Levine, Natasha Jaques

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that models optimized in this way reduce inconsistency by over 55%, resulting in more coherent, faithful, and trustworthy simulated users. We investigate the consistency of LLM-based simulated human agents across three interactive domains: open-ended conversation, education, and mental health. As shown in Table 3, consistency varies substantially across both models and tasks. Multi-turn RL substantially increases prompt-to-line consistency across all tasks. As shown in Figure 4, PPO consistently outperforms the baseline Llama-8BInstruct model, SFT and KTO. Human evaluation of conversations from fine-tuned PPO model corroborate these improvements as described in Section 5.
Researcher Affiliation Collaboration 1UC Berkeley 2University of Washington 3Google Deep Mind
Pseudocode No The paper describes the methods and procedures in narrative text and does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code Yes Our code is available at https://github.com/abdulhaim/consistent-LLMs and project page at https://sites.google.com/ view/consistent-llms. We use Open RLHF and provide implementation details sufficient to reproduce experiments in the Appendix. We have also released our code and synthetic data (anonymized).
Open Datasets Yes Inspired by the Persona Chat dataset [73], we generate natural, openended dialog between two LLM agents each assigned with rich, compositional personas from [34]. A random sample of 100 synthetically generated personas from prior work [34] were used to generate the conversations. Student personas were generated from gpt-4o-mini through random sampling of an education level and a variety of learning styles (detailed in 4). Patient personas were generated by a random sampling of different dimensions [4, 11, 9, 45, 46, 55] as shown in Table 5. We have also released our code and synthetic data (anonymized).
Dataset Splits No For each model-task pair, we generate a total of 800 dialogues per task at varying lengths (10, 20, 40, and 60 turns). We perform multi-turn RL fine-tuning of Llama-3.1-8B-Instruct on the full dataset of conversations ( 39K lines of dialogue) from three models for each task. After training, we evaluate model performance by generating new conversations based on existing user simulator backgrounds/personas between the fine-tuned Usim and the original Task Agent. The paper describes generating data and fine-tuning on a 'full dataset' and evaluating on 'new conversations', but does not provide explicit train/test/validation splits (e.g., percentages or fixed counts) for a single, static dataset.
Hardware Specification Yes Training was done with access to a cluster of 8 NVIDIA H100 GPUs as well as a cluster of 8 NVIDIA H200 GPUs.
Software Dependencies No We implement this training setup using Open RLHF [23], extending it to support turn-level rewards and multi-turn rollout generation. We use Open RLHF to fine-tune Meta-Llama-3-8B-Instruct using Supervised Fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Proximal Policy Optimization (PPO). The paper mentions 'Open RLHF' as a tool but does not provide a specific version number for it or any other software library or programming language.
Experiment Setup No We fine-tune the User Simulator with Proximal Policy Optimization (PPO) [61], with rewards derived from our consistency metrics. The training data is structured so that the model is trained to predict the next line of conversation given the input generation prompt containing a scenario, background, and the conversation history up to that point the conversation. SFT training is performed first on the dataset, after which PPO or KTO are then used to fine-tune the model further using the consistency metrics as rewards. Policy updates alternate with rollout phases, during which full dialogues are generated and scored for consistency. The paper describes the general training methodology (PPO, KTO, SFT, sequence of operations) and some high-level details, but it does not specify concrete hyperparameters like learning rates, batch sizes, number of epochs, or detailed optimizer settings for reproducing the experiments.