Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Tree of Preferences for Diversified Recommendation

Authors: Hanyang Yuan, Ning Tang, Tongya Zheng, Jiarong Xu, Xintong Hu, Renhong Huang, Shunyu Liu, Jiacong Hu, Jiawei Chen, Mingli Song

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations of both diversity and relevance show that our approach outperforms existing methods in most cases and achieves near-optimal performance in others, with reasonable inference latency. Extensive experiments on three real-world datasets show that To P-Rec achieves advantages in both diversity and relevance in most cases, with a dominant trade-off and efficient inference latency compared to baselines. In this section, we evaluate the performance of To P-Rec through extensive experiments. Datasets. We use the Twitter [11], Weibo [12], and Amazon [14] datasets. Evaluation metrics. To evaluate the relevance of recommendations, we follow [16] and adopt the metric Recall@k (R@k), indicating the proportion of relevant items retrieved in the top-k recommendation list. To assess diversity, we use the Category-Entropy@k (CE@k), which measures the distribution of different categories within the top-k list. We report k = 50 and 100 in this work. Baselines. We adopt nine baselines to compare with the proposed approach, categorized into three types: (1) Heuristic methods: Random, MMR [35], and DPP [36]; (2) Conventional diversityenhancing methods: Box/LCD-UC [6] and CDM [18]; (3) LLM-based diversified recommender: LLM4Rerank-A/LLM4Rerank-AD [20] and LLMRec-MMR [24]. Detailed descriptions and configurations for all baselines are provided in Appendix A.3. Implementation details. We implement Light GCN with 2 hidden layers and a hidden size of 32, which is optimized using Adam optimizer with a learning rate of 5e-3. We also evaluate the performance of To P-Rec on other backbones (see Appendix A.4). We employ a random negative sampling with a 1:50 ratio and use early stopping. For hyperparameters affecting diversity and relevance, we search the number of selected leaves in [4, 7] (step size 1), number of augmentations per user in [3, 9](step size 2), and the item sampling weight λ in [0.2, 0.8] (step size 0.2). We utilize Qwen2.5-32B-Instruct [37] to complete tasks involving LLMs. To ensure fairness, we employ the same LLM for our approach and all baselines involving LLMs. Experiments are repeated 5 times to report the average performance with standard deviation. All experiments are conducted on a machine of Ubuntu 20.04 system with AMD EPYC 7763 (756GB memory) and NVIDIA RTX3090 GPU (24GB memory). All models are implemented in Py Torch version 2.5.1 with CUDA version 11.8 and Python 3.10.15. Our code is publicly available at https://github.com/xxx08796/To P_Rec_NIPS.
Researcher Affiliation	Academia	Hanyang Yuan1 , Ning Tang2 , Tongya Zheng3 , Jiarong Xu2 , Xintong Hu1 Renhong Huang1, Shunyu Liu4, Jiacong Hu1, Jiawei Chen1, Mingli Song1 1Zhejiang University, 2Fudan University 3Hangzhou City University, 4Nanyang Technological University EMAIL EMAIL EMAIL, EMAIL EMAIL, EMAIL
Pseudocode	No	The paper describes methods and processes (e.g., "breadth-first search algorithm is used", "Prompt To P means the instructions for constructing To P (see Appendix A.2)", "Illustration of prompts. We provide an illustration of the prompts used to complete the essential processes of To P-Rec, as summarized in Figure 4.") but does not present any formal pseudocode blocks or algorithms with structured, code-like steps in the main text or appendices.
Open Source Code	Yes	Our code is publicly available at https://github.com/xxx08796/To P_Rec_NIPS.
Open Datasets	Yes	Datasets. We use the Twitter [11], Weibo [12], and Amazon [14] datasets. Twitter [11]: This dataset is originally collected from Twitter for bot detection. Weibo [12]: This dataset is collected from Weibo, one of China s largest social media platforms. Amazon [14]: Amazon is an e-commerce dataset.
Dataset Splits	Yes	For each user, we split their interactions into train, validation, and test sets with a ratio of 0.6:0.2:0.2.
Hardware Specification	Yes	All experiments are conducted on a machine of Ubuntu 20.04 system with AMD EPYC 7763 (756GB memory) and NVIDIA RTX3090 GPU (24GB memory).
Software Dependencies	Yes	All models are implemented in Py Torch version 2.5.1 with CUDA version 11.8 and Python 3.10.15.
Experiment Setup	Yes	We implement Light GCN with 2 hidden layers and a hidden size of 32, which is optimized using Adam optimizer with a learning rate of 5e-3. We also evaluate the performance of To P-Rec on other backbones (see Appendix A.4). We employ a random negative sampling with a 1:50 ratio and use early stopping. For hyperparameters affecting diversity and relevance, we search the number of selected leaves in [4, 7] (step size 1), number of augmentations per user in [3, 9](step size 2), and the item sampling weight λ in [0.2, 0.8] (step size 0.2).