Dialogues Are Not Just Text: Modeling Cognition for Dialogue Coherence Evaluation
Authors: Xue Li, Jia Su, Yang Yang, Zipeng Gao, Xinyu Duan, Yi Guan
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate the necessity of modeling human cognition for dialogue evaluation, and our DCGEval presents stronger correlations with human judgments compared to other state-of-the-art evaluation metrics. |
| Researcher Affiliation | Collaboration | Xue Li1*, Jia Su2, Yang Yang1 , Zipeng Gao3, Xinyu Duan2, Yi Guan1 1Faculty of Computing, Harbin Institute of Technology 2Huawei Cloud 3School of Computer Science and Technology, University of Science and Technology of China |
| Pseudocode | No | The paper contains mathematical formulations and descriptions of processes, but it does not include explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper references a third-party AMR parser ('1https://github.com/bjascob/amrlib.') but does not provide any concrete access to the source code for the methodology described in this paper by the authors themselves. |
| Open Datasets | Yes | We use two daily dialogue datasets, Daily Dialog++ (Sai et al. 2020) and Daily Dialog EVAL (Huang et al. 2020), as training data. To evaluate model performance, we use Conv AI2 (Huang et al. 2020) and Empathetic Dialogues (Huang et al. 2020) as unseen datasets, including substantial human scoring. |
| Dataset Splits | No | The paper mentions using 'Daily Dialog++' and 'Daily Dialog EVAL' as training data, and 'Conv AI2' and 'Empathetic Dialogues' as unseen evaluation datasets. However, it does not provide specific dataset split information (e.g., exact percentages, sample counts, or explicit predefined splits for training, validation, and testing within these datasets) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper describes the experimental setup and results but does not specify any particular hardware used for running the experiments, such as specific GPU or CPU models. |
| Software Dependencies | No | The paper mentions using certain software components like an 'AMR parser' and 'Transformer' models, but it does not provide specific version numbers for these or other ancillary software dependencies, such as 'Python 3.8' or 'PyTorch 1.9'. |
| Experiment Setup | No | The paper describes the overall framework, training objectives (MLR loss, KD-MSE loss), and architectural components (GCN, Transformer, MLP). However, it does not explicitly provide specific hyperparameter values such as learning rate, batch size, number of epochs, or optimizer settings for training their model within the main text. |