reproducibilityindex.ai

Personalizing a Dialogue System With Transfer Reinforcement Learning

Authors: Kaixiang Mo, Yu Zhang, Shuangyin Li, Jiajun Li, Qiang Yang

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on a real-world coffee-shopping data and simulation data show that the proposed PETAL system can learn optimal policies for different users, and thus effectively improve the dialogue quality under the personalized setting.
Researcher Affiliation	Academia	Kaixiang Mo, Yu Zhang, Shuangyin Li, Jiajun Li, Qiang Yang Department of Computer Science and Engineering Hong Kong University of Science and Technology, Hong Kong, China {kxmo, yuzhangcse, shuangyinli, qyang}@cse.ust.hk {jiajun.li}@alumni.ust.hk
Pseudocode	Yes	Algorithm 1 The PETAL Algorithm
Open Source Code	No	The paper does not contain any explicit statement about providing open-source code for the described methodology or a link to a code repository.
Open Datasets	No	This dataset, which is collected between July 2015 and April 2016 from an O2O coffee ordering service in a major instant message platform in China, contains 2,185 coffee dialogues between 72 consumers and coffee makers. Note that the popular DSTC datasets do not have personalized preferences and thus could not be used in this paper.
Dataset Splits	No	221 earlier dialogues in the target domain are used as the training set and the remaining 108 dialogues form the test set.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments.
Software Dependencies	No	The paper mentions algorithms and methods (e.g., 'online stochastic gradient descent algorithm', 'State-Action-Reward-State-Action (SARSA) algorithm', 'word2vec method') but does not specify any software dependencies with version numbers.
Experiment Setup	Yes	We adopt an online stochastic gradient descent algorithm (Bottou 2010) with a learning rate 0.0001 to optimize our model. Speciﬁcally, we use the State-Action-Reward-State-Action (SARSA) algorithm. In the on-policy training with the simulation, the model has a decreasing probability η = 0.2e β 1000 of choosing a random reply in the candidate set so as to ensure the sufﬁcient exploration, where β is the number of training dialogues seen by the algorithm.