MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation
Authors: Chen Zhang, Luis Fernando D'Haro, Thomas Friedrichs, Haizhou Li11657-11666
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | MDD-Eval is extensively assessed on six dialogue evaluation benchmarks. Empirical results show that the MDD-Eval framework achieves a strong performance with an absolute improvement of 7% over the state-of-the-art ADMs in terms of mean Spearman correlation scores across all the evaluation benchmarks. |
| Researcher Affiliation | Collaboration | Chen Zhang1,2, Luis Fernando D Haro3, Thomas Friedrichs2, Haizhou Li1,4,5 1 National University of Singapore, Singapore 2 Robert Bosch (SEA), Singapore 3 Universidad Polit ecnica de Madrid, Spain 4 Kriston AI Lab, China 5 The Chinese University of Hong Kong (Shenzhen), China |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode or algorithm blocks. It describes the methodology in prose and mathematical equations. |
| Open Source Code | Yes | MDD-Data, MDD-Eval implementation, and pretrained checkpoints will be released to the public1. 1https://github.com/e0397123/MDD-Eval |
| Open Datasets | Yes | We make use of four publicly-available, high-quality and human-written conversation corpora to form a multi-domain synthetic dataset: Daily Dialog (Li et al. 2017), Conv AI2 (Dinan et al. 2020), Empathetic Dialogues (Rashkin et al. 2019) and Topical Chat (Gopalakrishnan et al. 2019). |
| Dataset Splits | Yes | We only use the training and validation splits of the dialogue corpora since some dialogue contexts in the evaluation benchmarks are sampled from their test sets. ... The detailed statistics of the four dialogue corpora are presented in Table 2... Daily Dialog training validation ... Empathetic Dialog training validation ... Conv AI2 training validation ... Topical Chat training validation |
| Hardware Specification | No | The paper does not specify any particular GPU models, CPU models, or other specific hardware specifications used for running its experiments. It only mentions the use of RoBERTa-Large as the model architecture. |
| Software Dependencies | No | The paper mentions RoBERTa-Large (Liu et al. 2019) and BERT (Devlin et al. 2019) as backbone models and frameworks like ILM (Donahue, Lee, and Liang 2020) but does not provide specific version numbers for software dependencies such as Python, PyTorch, TensorFlow, or other libraries. |
| Experiment Setup | Yes | We choose Ro BERTa-Large (Liu et al. 2019) for both the teacher and the student model in MDD-Eval. ... A confidence threshold of 70% is applied to exclude pairs classified by Mteacher with low confidence. ... LCE is the cross-entropy loss, LKL is the KL divergence and LMLM is the self-supervised masked language modeling (MLM) loss. ... In our experiment, for quick turn-around, we sub-sample 600K context-response pairs from MDD-Data to train the final student model. |