Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Think before Recommendation: Autonomous Reasoning-enhanced Recommender
Authors: Xiaoyu Kong, Junguang Jiang, Bin Liu, Ziru Xu, Han Zhu, Jian Xu, Bo Zheng, Jiancan Wu, Xiang Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that Rec Zero and Rec One significantly outperform existing baseline methods on multiple benchmark datasets, validating the superiority of the RL paradigm in achieving autonomous reasoning-enhanced recommender systems. Our codes are available at https://github.com/Akali Kong/Rec Zero. We assess the effectiveness of Rec Zero and Rec One through extensive experiments on four benchmark datasets (e.g., Amazon-book, Amazon-music [34], Yelp [35], IMDb [36]), showcasing the RL paradigm s superiority over distillation methods (e.g., Rec-SAVER [20], EXP3RT [22], Reason4Rec [21]). |
| Researcher Affiliation | Collaboration | 1Taobao & Tmall Group of Alibaba, China 2National University of Singapore 3Institute of Dataspace, Hefei Comprehensive National Science Center 4Shanghai Key Laboratory of Data Science EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Figure 5: System Prompt. A System Prompt. As illustrated in Fig 5, we guide the early outputs of the LLM through a structured process in the system prompt to achieve faster training convergence and superior model performance. Specifically, within the <analyze user> and </analyze user> tags, we directed the LLM to first list the features [like] and disliked features [dislike] of each product based on the user s historical interaction records, and then summarize the user s complete preferences using [pos] and [neg] tags. Subsequently, for the target item, we encourage the LLM to use [like] and [dislike] tags within the <analyze item> and </analyze item> tags to summarize the features that the user might like and dislike about the target item. Following this, the LLM engaged in a thoughtful analysis of the match between the user and the target item within the tags <match></match>, and finally provided the predicted user rating within the tags <rate></rate>. |
| Open Source Code | Yes | Our codes are available at https://github.com/Akali Kong/Rec Zero. |
| Open Datasets | Yes | We assess the effectiveness of Rec Zero and Rec One through extensive experiments on four benchmark datasets (e.g., Amazon-book, Amazon-music [34], Yelp [35], IMDb [36]), showcasing the RL paradigm s superiority over distillation methods (e.g., Rec-SAVER [20], EXP3RT [22], Reason4Rec [21]). |
| Dataset Splits | No | We partition the dataset and select 1000 data samples that are not included in either the training or test sets for the cold-start experiments. We requested the dataset from Reason4Rec [21] and conducted our experiments based on it. |
| Hardware Specification | Yes | For traditional baseline experiments, we utilize a single H20 GPU, while the LLM-based baselines and our Rec Zero and Rec One frameworks are executed on an 8-card H20 GPU setup. |
| Software Dependencies | Yes | All experiments are conducted using Python 3.9. |
| Experiment Setup | Yes | For both Rec Zero and Rec One, we utilize the Qwen2.5-7B-Instruct-1M model as the starting point for RL. During training, the batch size is set to 8, and the learning rate is 2e-6. Each data sample undergoes 8 rollouts during the training process. We set the sampling temperature to 1.0, the training epoch to 1, and the KL divergence to 0. |