Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

R$^2$ec: Towards Large Recommender Models with Reasoning

Authors: Runyang You, Yongqi Li, Xinyu Lin, Xin Zhang, Wenjie Wang, Wenjie Li, Liqiang Nie

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on three datasets demonstrate that R2ec outperforms traditional, LLMbased, and reasoning-augmented recommender baselines, while further analyses validate its competitive efficiency among conventional LLM-based recommender baselines and strong adaptability to diverse recommendation scenarios.
Researcher Affiliation Academia 1The Hong Kong Polytechnic University 2National University of Singapore 3University of Science and Technology of China 4Harbin Institute of Technology (Shenzhen)
Pseudocode Yes Algorithm 1 Training Process
Open Source Code Yes Code and checkpoints available at https://github.com/YRYangang/RRec.
Open Datasets Yes we conducted experiments using realworld datasets sourced from Amazon, including CDs and Vinyl (CDs), Video Games (Games), and Musical Instruments (Instruments). Dataset statistics and preprocessing steps detailed in Appendix B. [...] from the lastest public Amazon review dataset2 spanning from May 1996 to October 2023. For each domain, we begin with the most recent year of interactions (October 2022 October 2023) and, if the resulting number of valid items is insufficient, iteratively roll the time window backward monthby-month until 10k items are obtained. We omit the 5-core filter so as to retain the nature behaviour characteristic of recommendation scenarios. Each user s interaction history is then chronologically sorted and truncated to the latest 20 actions, yielding fixed-length sequences for all subsequent 2https://amazon-reviews-2023.github.io/index.html. modelling and evaluation.
Dataset Splits Yes Finally the dataset is split with 80% training 10% validation 10% test. The resulting statistics are listed in Table 5.
Hardware Specification Yes Hardware: 4x NVIDIA 3090 (24GB) GPUs
Software Dependencies Yes Framework: Py Torch 2.5.0, Transformers 4.48.1, Deep Speed 0.15.7
Experiment Setup Yes The learning rate is set as 1e 5 with a batch size of 24, and apply linear warm-up for the first 32 steps. For trajectory generation, we set the group size G as four, utilized VLLM3 (tensor-parallelism = 1, target GPU utilisation = 95 %) for efficient generation. VLLM reserves one GPU for inference and leaves three for training. Sampling uses temperature = 1.5 and top-K = 200. Policy optimization follows the clipping ratio ϵ = 0.2 and omits KL regularization. Rewards are computed with NDCG@1000 and β = 0.05. Unless stated otherwise, these settings are used throughout.