A Student-Teacher Architecture for Dialog Domain Adaptation Under the Meta-Learning Setting
Authors: Kun Qian, Wei Wei, Zhou Yu13692-13700
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our model on two multi-domain datasets, Multi WOZ and Google Schema-Guided Dialogue, and achieve state-of-the-art performance. Experimental results show that our model is effective in extracting domain-speciļ¬c features and achieves a better domain adaptation performance. |
| Researcher Affiliation | Collaboration | 1 University of California, Davis 2 Google Inc. |
| Pseudocode | Yes | Algorithm DAST |
| Open Source Code | No | We will release the code base upon acceptance. |
| Open Datasets | Yes | We evaluate our model on two multi-domain datasets, Multi WOZ (Budzianowski et al. 2018) and Schema-Guided Dataset (Rastogi et al. 2019). |
| Dataset Splits | Yes | For the adaptation, we randomly choose nine dialogs (2% of source domain) in the target domain as adaptation data and leave the rest for testing. The learning rate decays by half if no improvement is observed on validation data for 3 successive epochs and the training process would stop early when no improvement is observed on validation data for 5 successive epochs. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used (e.g., GPU/CPU models, memory) to run its experiments. |
| Software Dependencies | No | The paper mentions software components and algorithms like GloVe, Adam, GRU, and Transformer, but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | We adopt Glo Ve (Pennington, Socher, and Manning 2014) as the initialized value for word embeddings, with an embedding size of 50. For the student model, each GRU from encoders and decoders contains one layer and the hidden size is set as 100. Furthermore, the GRU models of two encoders are bi-directional. As for the teacher model, it contains 2 self-attention layers with 5 heads for each. We use Adam (Kingma and Ba 2014) for optimization and set an initialized learning rate as 0.005 for both student and teacher model, as well as the meta optimizer. The learning rate decays by half if no improvement is observed on validation data for 3 successive epochs and the training process would stop early when no improvement is observed on validation data for 5 successive epochs. We adopt the batch normalization (Ioffe and Szegedy 2015) and use a batch size of 32. |