MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation

Authors: Chen Zhang, Luis Fernando D'Haro, Thomas Friedrichs, Haizhou Li11657-11666

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental MDD-Eval is extensively assessed on six dialogue evaluation benchmarks. Empirical results show that the MDD-Eval framework achieves a strong performance with an absolute improvement of 7% over the state-of-the-art ADMs in terms of mean Spearman correlation scores across all the evaluation benchmarks.
Researcher Affiliation Collaboration Chen Zhang1,2, Luis Fernando D Haro3, Thomas Friedrichs2, Haizhou Li1,4,5 1 National University of Singapore, Singapore 2 Robert Bosch (SEA), Singapore 3 Universidad Polit ecnica de Madrid, Spain 4 Kriston AI Lab, China 5 The Chinese University of Hong Kong (Shenzhen), China
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks. It describes the methodology in prose and mathematical equations.
Open Source Code Yes MDD-Data, MDD-Eval implementation, and pretrained checkpoints will be released to the public1. 1https://github.com/e0397123/MDD-Eval
Open Datasets Yes We make use of four publicly-available, high-quality and human-written conversation corpora to form a multi-domain synthetic dataset: Daily Dialog (Li et al. 2017), Conv AI2 (Dinan et al. 2020), Empathetic Dialogues (Rashkin et al. 2019) and Topical Chat (Gopalakrishnan et al. 2019).
Dataset Splits Yes We only use the training and validation splits of the dialogue corpora since some dialogue contexts in the evaluation benchmarks are sampled from their test sets. ... The detailed statistics of the four dialogue corpora are presented in Table 2... Daily Dialog training validation ... Empathetic Dialog training validation ... Conv AI2 training validation ... Topical Chat training validation
Hardware Specification No The paper does not specify any particular GPU models, CPU models, or other specific hardware specifications used for running its experiments. It only mentions the use of RoBERTa-Large as the model architecture.
Software Dependencies No The paper mentions RoBERTa-Large (Liu et al. 2019) and BERT (Devlin et al. 2019) as backbone models and frameworks like ILM (Donahue, Lee, and Liang 2020) but does not provide specific version numbers for software dependencies such as Python, PyTorch, TensorFlow, or other libraries.
Experiment Setup Yes We choose Ro BERTa-Large (Liu et al. 2019) for both the teacher and the student model in MDD-Eval. ... A confidence threshold of 70% is applied to exclude pairs classified by Mteacher with low confidence. ... LCE is the cross-entropy loss, LKL is the KL divergence and LMLM is the self-supervised masked language modeling (MLM) loss. ... In our experiment, for quick turn-around, we sub-sample 600K context-response pairs from MDD-Data to train the final student model.