Reinforcement Learning for Turn-Taking Management in Incremental Spoken Dialogue Systems

Authors: Hatim Khouzaimi, Romain Laroche, Fabrice Lefèvre

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this article, reinforcement learning is used to learn an optimal turn-taking strategy for vocal human-machine dialogue. The Orange Labs Majordomo dialogue system, which allows the users to have conversations within a smart home, has been upgraded to an incremental version. First, a user simulator is built in order to generate a dialogue corpus which thereafter is used to optimise the turn-taking strategy from delayed rewards with the Fitted-Q reinforcement learning algorithm. Real users test and evaluate the new learnt strategy, versus a non-incremental and a handcrafted incremental strategies. The data-driven strategy is shown to significantly improve the task completion ratio and to be preferred by the users according to subjective metrics.
Researcher Affiliation Collaboration Hatim Khouzaimi Orange Labs, CERI-LIA hatim.khouzaimi@gmail.com Romain Laroche Orange Labs, Chˆatillon, France romain.laroche@orange.com Fabrice Lef evre CERI-LIA, Avignon, France fabrice.lefevre@univ-avignon.fr
Pseudocode No The paper describes the Fitted-Q algorithm but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper does not include any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No First, a user simulator is built in order to generate a dialogue corpus which thereafter is used to optimise the turn-taking strategy from delayed rewards with the Fitted-Q reinforcement learning algorithm. For more details, the reader can check [Khouzaimi et al., 2016]. The paper does not provide concrete access information (link, DOI, repository, or formal citation) for this or any other training dataset.
Dataset Splits No The paper describes the training process with exploration and exploitation phases, and a final evaluation phase, but it does not specify a distinct 'validation' dataset split in the conventional sense (e.g., for hyperparameter tuning separate from training and testing).
Hardware Specification No The paper does not specify any hardware details such as GPU/CPU models, memory, or specific computing environments used for the experiments.
Software Dependencies No The paper mentions 'Google's solution' for ASR and TTS, and 'Kaldi [Povey et al., 2011]' as an alternative, but it does not provide specific version numbers for these or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes This model was trained in simulation (on 3 different dialogue scenarios with WER = 0.15 and γ = 0.99) and after each 500 new episodes, Fitted-Q was run on the whole collected batch in order to update the i(a) parameters (Equation 3). The Scheduler learns for 2500 episodes. During the first 500 episodes, the Scheduler randomly picks the action WAIT 90% of the time and the action SPEAK with a 10% probability (if no bias is introduced in favour of the WAIT action, the Scheduler interrupts the user too often which hurts the exploration process) and between episodes 500 and 2500, the Scheduler is greedy with a 0.9 probability and chooses randomly between the two actions the rest of the time (this time, they are picked uniformly).