Automatic Curriculum Learning With Over-repetition Penalty for Dialogue Policy Learning

Authors: Yangyang Zhao, Zhenyu Wang, Zhenhua Huang14540-14548

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that the ACL-DQN significantly improves the effectiveness and stability of dialogue tasks with a statistically significant margin. Furthermore, the framework can be further improved by equipping with different curriculum schedules, which demonstrates that the framework has strong generalizability.
Researcher Affiliation Academia Yangyang Zhao, Zhenyu Wang *, Zhenhua Huang School of software engineering, South China University of Technology msyyz@mail.scut.edu.cn, wangzy@scut.edu.cn, sezhhuangscut@mail.scut.edu.cn
Pseudocode Yes Algorithm 1 ACL-DQN with Curriculum schedule A; Algorithm 2 ACL-DQN with Curriculum schedule B; Algorithm 3 ACL-DQN with Curriculum schedule C
Open Source Code No The paper does not contain an explicit statement about the release of source code or a link to a code repository for the methodology described.
Open Datasets No Our ACL-DQN was evaluated on movie-booking tasks in both simulation and human-in-the-loop settings. Raw conversational data in the movie-ticket booking task was collected via Amazon Mechanical Turk with annotations provided by domain experts. The annotated data consists of 11 dialogue acts and 29 slots. In total, the dataset contains 280 annotated dialogues, the average length of which is approximately 11 turns. The paper describes the dataset but does not provide any concrete access information (link, DOI, repository, or citation to a public dataset).
Dataset Splits No The paper mentions testing on a certain number of dialogues ('each run tested on 50 dialogues') but does not specify explicit training, validation, or test dataset splits (e.g., percentages or absolute counts for each split).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using MLPs and DQN but does not specify version numbers for any software dependencies or libraries (e.g., Python version, TensorFlow/PyTorch version).
Experiment Setup Yes For all the models, we use MLPs to parameterize the value networks Q( ) with one hidden layer of size 80 and tanh activation. ϵ-greedy is always applied for exploration. We set the discount factor γ = 0.9. The buffer size of DT and DS is set to 2000 and 5000, respectively. The batch size is 16, and the learning rate is 0.001. We applied gradient clipping on all the model parameters with a maximum norm of 1 to prevent gradient explosion. The target network is updated at the beginning of each training episode. The maximum length of a simulated dialogue is 40 turns.