Open-Domain Text Evaluation via Contrastive Distribution Methods

Authors: Sidi Lu, Hongyi Liu, Asli Celikyilmaz, Tianlu Wang, Nanyun Peng

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on coherence evaluation for multi-turn dialogue and commonsense evaluation for controllable generation demonstrate CDM s superior correlate with human judgment than existing automatic evaluation metrics, highlighting the strong performance and generalizability of our approach.
Researcher Affiliation Collaboration Sidi Lu 1 Hongyi Liu 2 Asli Celikyilmaz 3 Tianlu Wang 3 Nanyun Peng 1 1Department of Computer Science, University of California, Los Angeles 2Shanghai Jiao Tong University 3Meta FAIR.
Pseudocode Yes Algorithm 1 Generative CDM
Open Source Code Yes 1Code: https://github.com/Plus Lab NLP/CDM
Open Datasets Yes We evaluate our methods on annotated dialogues from the FED (Mehri & Eskenazi, 2020) and DSTC9 (Gunasekara et al., 2020) datasets. Topical Chat (Gopalakrishnan et al., 2019) + Persona Chat (Zhang et al., 2018). Common Gen (Lin et al., 2020) is a generative commonsense reasoning dataset.
Dataset Splits Yes The models are selected using the validation set of Topical-Personal chat. All hyperparameters for training the likelihood functions amateur and expert are determined with the best log-likelihood on the in-domain validation set from Topical-Persona Chat dataset.
Hardware Specification No The paper does not provide specific details on the hardware used, such as GPU models (e.g., NVIDIA A100), CPU models, or specific TPU versions for running experiments.
Software Dependencies No The paper mentions general software components like 'Python' and models like 'Pythia', 'LLa Ma 2', 'Flan-T5', and 'GPT-4' but does not specify version numbers for any software dependencies or libraries.
Experiment Setup No The paper mentions adopting experimental settings from DEAM (Ghazarian et al., 2022) and selecting hyperparameters using a validation set, but it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations within the main text.