reproducibilityindex.ai

FACE: Evaluating Natural Language Generation with Fourier Analysis of Cross-Entropy

Authors: Zuhao Yang, Yingfang Yuan, Yang Xu, SHUO ZHAN, Huajun Bai, Kefan Chen

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Based on an open-ended generation task and the experimental data from previous studies, we find that FACE can effectively identify the human-model gap, scales with model size, reflects the outcomes of different sampling methods for decoding, correlates well with other evaluation metrics and with human judgment scores.
Researcher Affiliation	Collaboration	1 School of Computer Science and Engineering, Nanyang Technological University 2 School of Mathematical and Computer Sciences, Heriot-Watt University 3 Department of Computer Science, Southern University of Science and Technology 4 Genify
Pseudocode	No	The paper describes its methods using text and mathematical formulas but does not include any explicit pseudocode blocks or algorithms labeled as such.
Open Source Code	Yes	Implementation and experiments code are available in this public repository: https://github.com/CLCS-SUSTech/FACE.
Open Datasets	Yes	We consider such a text completion task in three domains: Wiki text, News, and Stories. ... For Wiki Text-103 [27] and Real News [1] datasets, we cleaned them before extracting the texts corresponding to the first 35 tokens (tokenized by GPT2Tokenizer) to form our prompt sets.
Dataset Splits	No	The paper describes splitting human data into two folds for a sanity test but does not provide specific training/validation/test splits for the datasets used in its main evaluation experiments.
Hardware Specification	Yes	For the text generation task, we use the remote workstation that has two NVIDIA RTX A6000 graphics cards. ... All of the above measurements take place on an AMD Ryzen Threadripper PRO 3995WX 64-Cores CPU (frequency range [2200.00MHz, 4308.40MHz]).
Software Dependencies	Yes	Our experiments were performed on Ubuntu 20.04.1 system with Python 3.9.16. The versions of key Python libraries include: Transformers 4.27.4, Py Torch-CUDA 11.6, Py Torch 1.13.1, Scipy 1.5.4.
Experiment Setup	Yes	In our research, we set the maximum generation length to 1024 for all models on three datasets. ... For both conditional and unconditional generation, we preset a random seed integer (32 by default). ... Furthermore, the maximum length of each text (1024 by default) as well as the batch size (which varies according to GPUs capacity) for perplexity computation have to be determined before automatic evaluation.