Fine-Tuning Large Language Model Based Explainable Recommendation with Explainable Quality Reward

Authors: Mengyuan Yang, Mengying Zhu, Yan Wang, Linxun Chen, Yilei Zhao, Xiuyuan Wang, Bing Han, Xiaolin Zheng, Jianwei Yin

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments conducted on three real-world datasets demonstrate that our model can generate fluent, diverse, informative, and highly personalized explanations.
Researcher Affiliation Collaboration Mengyuan Yang1, Mengying Zhu1*, Yan Wang2, Linxun Chen3, Yilei Zhao1, Xiuyuan Wang1, Bing Han3, Xiaolin Zheng1, Jianwei Yin1 1 Zhejiang University, China 2 School of Computing, Macqaurie University, Australia 3 MYbank, Ant Group, China
Pseudocode No The paper includes an architecture diagram (Figure 2) and describes the training steps, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions a GitHub repository for collecting explanation data from a previous study (Li, Zhang, and Chen 2021) which is used as part of their dataset, but it does not provide any link or statement indicating that the source code for their own proposed methodology (LLM2ER-EQR) is open-source or publicly available.
Open Datasets Yes To evaluate the effectiveness of LLM2ER-EQR, we adopt three benchmark recommendation datasets, which are publicly available explainable contents and vary in terms of domain, size, and sparsity. The three datasets are from Amazon (Movie & TV)1, and Yelp (2019)2, and Trip Advisor3, respectively, and their corresponding recommendation explanation data are collected from the Git Hub repository4 of (Li, Zhang, and Chen 2021). 1http://jmcauley.ucsd.edu/data/amazon 2https://www.yelp.com/dataset 3https://www.tripadvisor.com 4https://github.com/lileipisces/PETER
Dataset Splits Yes Following the previous study (Li, Zhang, and Chen 2020, 2021; Wang et al. 2023), each dataset is randomly divided into training, validation, and testing sets with a ratio of 8:1:1. We repeat all experiments 5 times independently, with each iteration involving a re-division of the dataset. The mean of test performance is reported.
Hardware Specification No The paper does not provide specific details about the hardware used to run its experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions using 'BERT (Kenton and Toutanova 2019)' and 'Pretrained Causal Language Model (e.g., GPT-2, )' but does not specify version numbers for these or other key software dependencies required for reproducibility.
Experiment Setup No The paper describes the overall training process, including balancing coefficients for different loss terms (λr, λe, λc, λd, λCCR, λHQAR) and an empirically set threshold δ for HQAR. However, it does not provide the concrete numerical values for these hyperparameters (e.g., specific learning rates, batch sizes, number of epochs, optimizer details, or the actual values of the lambdas or delta) which are crucial for reproducing the experimental setup.