Dialogues Are Not Just Text: Modeling Cognition for Dialogue Coherence Evaluation

Authors: Xue Li, Jia Su, Yang Yang, Zipeng Gao, Xinyu Duan, Yi Guan

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate the necessity of modeling human cognition for dialogue evaluation, and our DCGEval presents stronger correlations with human judgments compared to other state-of-the-art evaluation metrics.
Researcher Affiliation Collaboration Xue Li1*, Jia Su2, Yang Yang1 , Zipeng Gao3, Xinyu Duan2, Yi Guan1 1Faculty of Computing, Harbin Institute of Technology 2Huawei Cloud 3School of Computer Science and Technology, University of Science and Technology of China
Pseudocode No The paper contains mathematical formulations and descriptions of processes, but it does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper references a third-party AMR parser ('1https://github.com/bjascob/amrlib.') but does not provide any concrete access to the source code for the methodology described in this paper by the authors themselves.
Open Datasets Yes We use two daily dialogue datasets, Daily Dialog++ (Sai et al. 2020) and Daily Dialog EVAL (Huang et al. 2020), as training data. To evaluate model performance, we use Conv AI2 (Huang et al. 2020) and Empathetic Dialogues (Huang et al. 2020) as unseen datasets, including substantial human scoring.
Dataset Splits No The paper mentions using 'Daily Dialog++' and 'Daily Dialog EVAL' as training data, and 'Conv AI2' and 'Empathetic Dialogues' as unseen evaluation datasets. However, it does not provide specific dataset split information (e.g., exact percentages, sample counts, or explicit predefined splits for training, validation, and testing within these datasets) needed to reproduce the data partitioning.
Hardware Specification No The paper describes the experimental setup and results but does not specify any particular hardware used for running the experiments, such as specific GPU or CPU models.
Software Dependencies No The paper mentions using certain software components like an 'AMR parser' and 'Transformer' models, but it does not provide specific version numbers for these or other ancillary software dependencies, such as 'Python 3.8' or 'PyTorch 1.9'.
Experiment Setup No The paper describes the overall framework, training objectives (MLR loss, KD-MSE loss), and architectural components (GCN, Transformer, MLP). However, it does not explicitly provide specific hyperparameter values such as learning rate, batch size, number of epochs, or optimizer settings for training their model within the main text.