A Student-Teacher Architecture for Dialog Domain Adaptation Under the Meta-Learning Setting

Authors: Kun Qian, Wei Wei, Zhou Yu13692-13700

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our model on two multi-domain datasets, Multi WOZ and Google Schema-Guided Dialogue, and achieve state-of-the-art performance. Experimental results show that our model is effective in extracting domain-specific features and achieves a better domain adaptation performance.
Researcher Affiliation Collaboration 1 University of California, Davis 2 Google Inc.
Pseudocode Yes Algorithm DAST
Open Source Code No We will release the code base upon acceptance.
Open Datasets Yes We evaluate our model on two multi-domain datasets, Multi WOZ (Budzianowski et al. 2018) and Schema-Guided Dataset (Rastogi et al. 2019).
Dataset Splits Yes For the adaptation, we randomly choose nine dialogs (2% of source domain) in the target domain as adaptation data and leave the rest for testing. The learning rate decays by half if no improvement is observed on validation data for 3 successive epochs and the training process would stop early when no improvement is observed on validation data for 5 successive epochs.
Hardware Specification No The paper does not explicitly describe the specific hardware used (e.g., GPU/CPU models, memory) to run its experiments.
Software Dependencies No The paper mentions software components and algorithms like GloVe, Adam, GRU, and Transformer, but does not provide specific version numbers for any of them.
Experiment Setup Yes We adopt Glo Ve (Pennington, Socher, and Manning 2014) as the initialized value for word embeddings, with an embedding size of 50. For the student model, each GRU from encoders and decoders contains one layer and the hidden size is set as 100. Furthermore, the GRU models of two encoders are bi-directional. As for the teacher model, it contains 2 self-attention layers with 5 heads for each. We use Adam (Kingma and Ba 2014) for optimization and set an initialized learning rate as 0.005 for both student and teacher model, as well as the meta optimizer. The learning rate decays by half if no improvement is observed on validation data for 3 successive epochs and the training process would stop early when no improvement is observed on validation data for 5 successive epochs. We adopt the batch normalization (Ioffe and Szegedy 2015) and use a batch size of 32.