Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Looking into User’s Long-term Interests through the Lens of Conservative Evidential Learning

Authors: Dingrong Wang, Krishna Neupane, Ervine Zheng, Qi Yu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on multiple real-world dynamic datasets demonstrate the state-of-the-art performance of ECQL and its capability to capture users long-term interests. In this paper, we propose a novel evidential conservative Q-learning framework (ECQL) that learns an effective and conservative recommendation policy by integrating evidence-based uncertainty and conservative learning. We conduct extensive experiments over four real-world datasets and compare with state-of-the-art baselines to demonstrate the effectiveness of the proposed model.
Researcher Affiliation	Collaboration	Dingrong Wang1, Krishna Prasad Neupane2, Ervine Zheng3, Qi Yu1 1Rochester Institute of Technology, 2Amazon, 3Samsung Research EMAIL
Pseudocode	Yes	Algorithm 1 Evidential Conservative Q-Learning
Open Source Code	Yes	The source code and processed datasets can be accessed here. https://github.com/ritmininglab/ECQL
Open Datasets	Yes	We conduct experiments on multiple real-world datasets: Movielens-1M, Movielens-100K, Netflix, and Yahoo! Music. Movielens-1M1: This dataset includes 1M ratings provided by 6,040 anonymous users... Movielens-100K2: This dataset contains 100,000 ratings from 943 users... Netflix (Bennett et al., 2007): This dataset has around 100 million interactions... Yahoo! Music rating (Dror et al., 2012): The dataset includes approximately 300,000 user-supplied ratings...
Dataset Splits	Yes	For training, given a user interaction history Hu, we continuously capture most recent N items after current time step into a sliding window Wt... We consider each user an episode for the RL setting and split users into 70% as training users and 30% as test users.
Hardware Specification	Yes	We implement the experiments based on the Py Torch framework with two A-100 GPUs.
Software Dependencies	No	We implement the experiments based on the Py Torch framework with two A-100 GPUs.
Experiment Setup	Yes	We set discounted factor γ = 1 and set τ = 3 as a threshold to identify if an item is positive, i.e., whether its ground-truth rating is larger than or equal to the threshold (ratingu,i τ). In testing, the agent may recommend items not interacted by the user. In such cases, we assign a neutral rating τ for those non-interacted items, where we set τ = 3. For training, we conduct 5 RL epochs, each with full training of all training users (episodes). Each epoch is equipped with an annealing λ ranging from 1 to 0.1 to adjust the emphasis from exploration to exploitation as the knowledge of training users increases. For testing, we conduct only one RL epoch containing all test users, and we use an annealing λ ranging from 0.5 to 0.1 across variable step sizes in four different (minimum session length) data sets.