Open-Domain Text Evaluation via Contrastive Distribution Methods
Authors: Sidi Lu, Hongyi Liu, Asli Celikyilmaz, Tianlu Wang, Nanyun Peng
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on coherence evaluation for multi-turn dialogue and commonsense evaluation for controllable generation demonstrate CDM s superior correlate with human judgment than existing automatic evaluation metrics, highlighting the strong performance and generalizability of our approach. |
| Researcher Affiliation | Collaboration | Sidi Lu 1 Hongyi Liu 2 Asli Celikyilmaz 3 Tianlu Wang 3 Nanyun Peng 1 1Department of Computer Science, University of California, Los Angeles 2Shanghai Jiao Tong University 3Meta FAIR. |
| Pseudocode | Yes | Algorithm 1 Generative CDM |
| Open Source Code | Yes | 1Code: https://github.com/Plus Lab NLP/CDM |
| Open Datasets | Yes | We evaluate our methods on annotated dialogues from the FED (Mehri & Eskenazi, 2020) and DSTC9 (Gunasekara et al., 2020) datasets. Topical Chat (Gopalakrishnan et al., 2019) + Persona Chat (Zhang et al., 2018). Common Gen (Lin et al., 2020) is a generative commonsense reasoning dataset. |
| Dataset Splits | Yes | The models are selected using the validation set of Topical-Personal chat. All hyperparameters for training the likelihood functions amateur and expert are determined with the best log-likelihood on the in-domain validation set from Topical-Persona Chat dataset. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used, such as GPU models (e.g., NVIDIA A100), CPU models, or specific TPU versions for running experiments. |
| Software Dependencies | No | The paper mentions general software components like 'Python' and models like 'Pythia', 'LLa Ma 2', 'Flan-T5', and 'GPT-4' but does not specify version numbers for any software dependencies or libraries. |
| Experiment Setup | No | The paper mentions adopting experimental settings from DEAM (Ghazarian et al., 2022) and selecting hyperparameters using a validation set, but it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations within the main text. |