Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
R$^2$ec: Towards Large Recommender Models with Reasoning
Authors: Runyang You, Yongqi Li, Xinyu Lin, Xin Zhang, Wenjie Wang, Wenjie Li, Liqiang Nie
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on three datasets demonstrate that R2ec outperforms traditional, LLMbased, and reasoning-augmented recommender baselines, while further analyses validate its competitive efficiency among conventional LLM-based recommender baselines and strong adaptability to diverse recommendation scenarios. |
| Researcher Affiliation | Academia | 1The Hong Kong Polytechnic University 2National University of Singapore 3University of Science and Technology of China 4Harbin Institute of Technology (Shenzhen) |
| Pseudocode | Yes | Algorithm 1 Training Process |
| Open Source Code | Yes | Code and checkpoints available at https://github.com/YRYangang/RRec. |
| Open Datasets | Yes | we conducted experiments using realworld datasets sourced from Amazon, including CDs and Vinyl (CDs), Video Games (Games), and Musical Instruments (Instruments). Dataset statistics and preprocessing steps detailed in Appendix B. [...] from the lastest public Amazon review dataset2 spanning from May 1996 to October 2023. For each domain, we begin with the most recent year of interactions (October 2022 October 2023) and, if the resulting number of valid items is insufficient, iteratively roll the time window backward monthby-month until 10k items are obtained. We omit the 5-core filter so as to retain the nature behaviour characteristic of recommendation scenarios. Each user s interaction history is then chronologically sorted and truncated to the latest 20 actions, yielding fixed-length sequences for all subsequent 2https://amazon-reviews-2023.github.io/index.html. modelling and evaluation. |
| Dataset Splits | Yes | Finally the dataset is split with 80% training 10% validation 10% test. The resulting statistics are listed in Table 5. |
| Hardware Specification | Yes | Hardware: 4x NVIDIA 3090 (24GB) GPUs |
| Software Dependencies | Yes | Framework: Py Torch 2.5.0, Transformers 4.48.1, Deep Speed 0.15.7 |
| Experiment Setup | Yes | The learning rate is set as 1e 5 with a batch size of 24, and apply linear warm-up for the first 32 steps. For trajectory generation, we set the group size G as four, utilized VLLM3 (tensor-parallelism = 1, target GPU utilisation = 95 %) for efficient generation. VLLM reserves one GPU for inference and leaves three for training. Sampling uses temperature = 1.5 and top-K = 200. Policy optimization follows the clipping ratio ϵ = 0.2 and omits KL regularization. Rewards are computed with NDCG@1000 and β = 0.05. Unless stated otherwise, these settings are used throughout. |