Multi-Objective Intrinsic Reward Learning for Conversational Recommender Systems
Authors: Zhendong Chu, Nan Wang, Hongning Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the effectiveness of our approach, we conduct extensive experiments on three public CRS benchmarks. The results show that our algorithm significantly improves CRS performance by exploiting informative learned intrinsic rewards. |
| Researcher Affiliation | Collaboration | Zhendong Chu University of Virginia zc9uy@virginia.edu Charlottesville, VA, USA Nan Wang Netflix Inc. nanw@netflix.com Los Gatos, CA, USA Hongning Wang University of Virginia hw5x@virginia.edu Charlottesville, VA, USA |
| Pseudocode | Yes | Algorithm 1: Optimization algorithm of CRSIRL |
| Open Source Code | No | No statement about open-sourcing the code or a link to a code repository was found. |
| Open Datasets | Yes | We evaluate CRSIRL on three multi-round CRS benchmarks [Lei et al., 2020a, Deng et al., 2021]. The Last FM dataset is for music artist recommendation. Lei et al. [2020a] manually grouped the original attributes into 33 coarse-grained attributes. The Last FM* dataset is the version where attributes are not grouped. The Yelp* dataset is for local business recommendation. We summarize their statistics in Table 1. |
| Dataset Splits | Yes | All datasets are split by 7:1.5:1.5 ratio for training, validation and testing. |
| Hardware Specification | Yes | All experiments are run on an NVIDIA Geforce RTX 3080Ti GPU with 12 GB memory. |
| Software Dependencies | No | The paper mentions 'Adam optimizer' and 'Transformer-based state encoder' but does not specify versions for any programming languages, libraries, or frameworks (e.g., Python, PyTorch, etc.). |
| Experiment Setup | Yes | The learning rates in the inner and outer loop are searched from {1e 5, 5e 5, 1e 4} with Adam optimizer. The coefficient of intrinsic reward λ is searched from {0.05, 0.1, 0.5, 1.0}. The discount factor γ is set to 0.999. All experiments are run on an NVIDIA Geforce RTX 3080Ti GPU with 12 GB memory. RL-based baselines rely on handcrafted rewards, we follow Lei et al. [2020a] to set (1) rrec_suc = 1 for successful recommendation; (2) rrec_fail = 0.1 for failed recommendation; (3) rask_suc = 0.1 when the inquired attribute is confirmed by the user; (4) rrec_fail = 0.1 when the inquired attribute is dismissed by the user; (5) rquit = 0.3 when the user quits the conversation without a successful recommendation. We set the maximum turn T as 15 and the size K of the recommendation list as 10. |