Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

Authors: Yanming Wan, Jiaxing Wu, Marwa Abdulhai, Lior Shani, Natasha Jaques

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate CURIO on two conversational tasks Education Dialogue [18] and Exercise Recommendation. Given the considerable challenge of applying theoretical RL concepts to practical LLM fine-tuning, we selected these tasks as well-defined and controlled benchmarks. Our experiments clearly demonstrate CURIO s superior performance in rapidly adapting to individual users.
Researcher Affiliation Collaboration Yanming Wan2* , Jiaxing Wu1* , Marwa Abdulhai4, Lior Shani3, Natasha Jaques12 1Google Deep Mind 2University of Washington 3Google Research 4University of California, Berkeley
Pseudocode No The paper includes mathematical formulations and diagrams to illustrate the framework (e.g., Figure 2), but does not present any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The access to the code is not available yet. We plan to release it in the future.
Open Datasets Yes We use the Education Dialog dataset introduced by Shani et al. [18], which simulates an educational setting where an LLM agent teaches students a given topic. To enhance realism, we designed a comprehensive list of user attributes [29] encompassing multiple aspects such as lifestyle, socioeconomic status, and health conditions etc. Consequently, to make personalized recommendations, the agent must elicit user information and preferences through multiple rounds of dialogue before choosing a strategy at the end of the conversation. (3) User Backstory Generation: We utilize the Gemini model to generate a detailed backstory for each user based on their attribute values. Each simulated user is prompted only with the backstory. Please refer to Appendix D for more details. Appendix D.3 Code for User Generation
Dataset Splits Yes We generate 1000 simulated users, and split them into 800 for training and 200 for evaluation. Each user is mapped to one particular ground truth strategy among 8 different exercise strategies.
Hardware Specification No The paper mentions models like "Gemma 2B model", "Gemma 7B model", and "Gemini 1.5 Pro" which are language models, but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for the experiments.
Software Dependencies No The paper mentions the use of specific LLMs such as "Gemma 2B model" and "Gemma 7B model" for various components (policy, environment, value, user model) and "Gemini 1.5 Pro" for data generation, but does not list any specific software libraries, frameworks, or programming languages with their version numbers.
Experiment Setup Yes A.3 Hyperparameters We followed the training recipe and hyperparameters from Shani et al. [18]. On top of the original extrinsic reward, we added intrinsic reward to each turn of the conversation as described above, with a coefficient coefficient weight αint on intrinsic reward when adding to the extrinsic reward to balance the scale of extrinsic and intrinsic rewards. For Education Dialogue, we choose αint = 9.0 for all the settings in Education Dialogue, with other hyperparameters listed in Table 4. For all the settings, we select several checkpoints that has the highest intrinsic rewards before 30k steps, and then choose the one that performs the best on conversation quality. For Exercise Recommendation, we choose αint = 5.0 for Diff Acc and Diff Ent, αint = 1.0 for Acc, Ent, and Diff Log Acc, and αint = 0.1 for Info Gain. The other hyperparameters are listed in Table 5. Table 4: Hyperparameters for Education Dialogue. Policy Model Learning Rate ηpolicy 4e-7 Value Model Learning Rate ηvaluey 4e-7 Batch Size B 16 KL Regularization Coefficient β 0.01 GAE Coefficient λ 0.95 Turn Discount γ 0.95 Max Number of Turns T 10 Extrinsic Reward Weight αext 1.0 User Classifier Temperature τ 5.0 Table 5: Hyperparameters for Exercise Recommendation. Policy Model Learning Rate ηpolicy 4e-7 Value Model Learning Rate ηvaluey 4e-7 Batch Size B 16 KL Regularization Coefficient β 0.02 GAE Coefficient λ 0.95 Turn Discount γ 0.95 Max Number of Turns T 6 Extrinsic Reward Weight αext 3.0 User Classifier Temperature τ 5.0