reproducibilityindex.ai

Open-Domain Text Evaluation via Contrastive Distribution Methods

Authors: Sidi Lu, Hongyi Liu, Asli Celikyilmaz, Tianlu Wang, Nanyun Peng

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on coherence evaluation for multi-turn dialogue and commonsense evaluation for controllable generation demonstrate CDM s superior correlate with human judgment than existing automatic evaluation metrics, highlighting the strong performance and generalizability of our approach.
Researcher Affiliation	Collaboration	Sidi Lu 1 Hongyi Liu 2 Asli Celikyilmaz 3 Tianlu Wang 3 Nanyun Peng 1 1Department of Computer Science, University of California, Los Angeles 2Shanghai Jiao Tong University 3Meta FAIR.
Pseudocode	Yes	Algorithm 1 Generative CDM
Open Source Code	Yes	1Code: https://github.com/Plus Lab NLP/CDM
Open Datasets	Yes	We evaluate our methods on annotated dialogues from the FED (Mehri & Eskenazi, 2020) and DSTC9 (Gunasekara et al., 2020) datasets. Topical Chat (Gopalakrishnan et al., 2019) + Persona Chat (Zhang et al., 2018). Common Gen (Lin et al., 2020) is a generative commonsense reasoning dataset.
Dataset Splits	Yes	The models are selected using the validation set of Topical-Personal chat. All hyperparameters for training the likelihood functions amateur and expert are determined with the best log-likelihood on the in-domain validation set from Topical-Persona Chat dataset.
Hardware Specification	No	The paper does not provide specific details on the hardware used, such as GPU models (e.g., NVIDIA A100), CPU models, or specific TPU versions for running experiments.
Software Dependencies	No	The paper mentions general software components like 'Python' and models like 'Pythia', 'LLa Ma 2', 'Flan-T5', and 'GPT-4' but does not specify version numbers for any software dependencies or libraries.
Experiment Setup	No	The paper mentions adopting experimental settings from DEAM (Ghazarian et al., 2022) and selecting hyperparameters using a validation set, but it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations within the main text.