Personalizing a Dialogue System With Transfer Reinforcement Learning

Authors: Kaixiang Mo, Yu Zhang, Shuangyin Li, Jiajun Li, Qiang Yang

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on a real-world coffee-shopping data and simulation data show that the proposed PETAL system can learn optimal policies for different users, and thus effectively improve the dialogue quality under the personalized setting.
Researcher Affiliation Academia Kaixiang Mo, Yu Zhang, Shuangyin Li, Jiajun Li, Qiang Yang Department of Computer Science and Engineering Hong Kong University of Science and Technology, Hong Kong, China {kxmo, yuzhangcse, shuangyinli, qyang}@cse.ust.hk {jiajun.li}@alumni.ust.hk
Pseudocode Yes Algorithm 1 The PETAL Algorithm
Open Source Code No The paper does not contain any explicit statement about providing open-source code for the described methodology or a link to a code repository.
Open Datasets No This dataset, which is collected between July 2015 and April 2016 from an O2O coffee ordering service in a major instant message platform in China, contains 2,185 coffee dialogues between 72 consumers and coffee makers. Note that the popular DSTC datasets do not have personalized preferences and thus could not be used in this paper.
Dataset Splits No 221 earlier dialogues in the target domain are used as the training set and the remaining 108 dialogues form the test set.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments.
Software Dependencies No The paper mentions algorithms and methods (e.g., 'online stochastic gradient descent algorithm', 'State-Action-Reward-State-Action (SARSA) algorithm', 'word2vec method') but does not specify any software dependencies with version numbers.
Experiment Setup Yes We adopt an online stochastic gradient descent algorithm (Bottou 2010) with a learning rate 0.0001 to optimize our model. Specifically, we use the State-Action-Reward-State-Action (SARSA) algorithm. In the on-policy training with the simulation, the model has a decreasing probability η = 0.2e β 1000 of choosing a random reply in the candidate set so as to ensure the sufficient exploration, where β is the number of training dialogues seen by the algorithm.