Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-Objective Intrinsic Reward Learning for Conversational Recommender Systems

Authors: Zhendong Chu, Nan Wang, Hongning Wang

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the effectiveness of our approach, we conduct extensive experiments on three public CRS benchmarks. The results show that our algorithm significantly improves CRS performance by exploiting informative learned intrinsic rewards.
Researcher Affiliation Collaboration Zhendong Chu University of Virginia EMAIL Charlottesville, VA, USA Nan Wang Netflix Inc. EMAIL Los Gatos, CA, USA Hongning Wang University of Virginia EMAIL Charlottesville, VA, USA
Pseudocode Yes Algorithm 1: Optimization algorithm of CRSIRL
Open Source Code No No statement about open-sourcing the code or a link to a code repository was found.
Open Datasets Yes We evaluate CRSIRL on three multi-round CRS benchmarks [Lei et al., 2020a, Deng et al., 2021]. The Last FM dataset is for music artist recommendation. Lei et al. [2020a] manually grouped the original attributes into 33 coarse-grained attributes. The Last FM* dataset is the version where attributes are not grouped. The Yelp* dataset is for local business recommendation. We summarize their statistics in Table 1.
Dataset Splits Yes All datasets are split by 7:1.5:1.5 ratio for training, validation and testing.
Hardware Specification Yes All experiments are run on an NVIDIA Geforce RTX 3080Ti GPU with 12 GB memory.
Software Dependencies No The paper mentions 'Adam optimizer' and 'Transformer-based state encoder' but does not specify versions for any programming languages, libraries, or frameworks (e.g., Python, PyTorch, etc.).
Experiment Setup Yes The learning rates in the inner and outer loop are searched from {1e 5, 5e 5, 1e 4} with Adam optimizer. The coefficient of intrinsic reward λ is searched from {0.05, 0.1, 0.5, 1.0}. The discount factor γ is set to 0.999. All experiments are run on an NVIDIA Geforce RTX 3080Ti GPU with 12 GB memory. RL-based baselines rely on handcrafted rewards, we follow Lei et al. [2020a] to set (1) rrec_suc = 1 for successful recommendation; (2) rrec_fail = 0.1 for failed recommendation; (3) rask_suc = 0.1 when the inquired attribute is confirmed by the user; (4) rrec_fail = 0.1 when the inquired attribute is dismissed by the user; (5) rquit = 0.3 when the user quits the conversation without a successful recommendation. We set the maximum turn T as 15 and the size K of the recommendation list as 10.