Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MTRec: Learning to Align with User Preferences via Mental Reward Models

Authors: Mengchen Zhao, Yifan Gao, Yaqing Hou, Xiangyang Li, Pengjie Gu, Zhenhua Dong, Ruiming Tang, Yi Cai

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we report the performance of MTRec in both offline and online settings, with focuses on answering the following research questions (RQs). (RQ1:) How does MTRec improve classification-based recommendation models? (RQ2:) How does MTRec improve RL-based recommendation models? (RQ3:) Does the learned mental reward model provide useful information? (RQ4:) How does MTRec perform in online A/B test? ... We conduct extensive offline and online experiments to demonstrate the improvements brought by MTRec.
Researcher Affiliation Collaboration 1School of Software Engineering, South China University of Technology 2School of Computer Science and Technology, Dalian University of Technology 3Huawei Noah s Ark Lab 4Nanyang Technological University EMAIL EMAIL EMAIL
Pseudocode Yes Algorithm 1 QR-IQL Optimization Steps Input: Interaction (expert) data DE ... Algorithm 2 Overall Implementation of MTRec Input: Interaction (expert) data DE
Open Source Code No We provide implementation details in Appendix A.3 and will release the code upon acceptance of this paper.
Open Datasets Yes The Amazon dataset Mc Auley et al. [2015] collects user review data from Amazon e-commerce platform. ... Both training and testing of the algorithms are conducted in simulated interactive recommendation environments on Virtual Taobao Shi et al. [2019].
Dataset Splits No The paper mentions using two subsets of the Amazon dataset: Books and Electronics, and describes how user reviews were processed chronologically. It also mentions constructing an expert dataset for Virtual Taobao with 100,000 trajectories. However, it does not provide specific details on how these datasets were split into training, validation, and test sets (e.g., percentages, sample counts, or methodology for creating the splits).
Hardware Specification Yes Our experiments are run on a server with 2 AMD EPYC 7542 32-Core Processor CPU and 2 NVIDIA RTX 3090 graphics.
Software Dependencies No The paper mentions using 'Adam Kingma and Ba [2015]' for optimization, which refers to an algorithm rather than a specific software library with a version number. No other software libraries or tools are specified with version numbers.
Experiment Setup Yes All the hyper-parameters of the backbone models follow their official codes. For implementation of MTRec, we select the number of quantiles N = 10 and the weight α = 0.5 in Problem 6.