FACE: Evaluating Natural Language Generation with Fourier Analysis of Cross-Entropy

Authors: Zuhao Yang, Yingfang Yuan, Yang Xu, SHUO ZHAN, Huajun Bai, Kefan Chen

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Based on an open-ended generation task and the experimental data from previous studies, we find that FACE can effectively identify the human-model gap, scales with model size, reflects the outcomes of different sampling methods for decoding, correlates well with other evaluation metrics and with human judgment scores.
Researcher Affiliation Collaboration 1 School of Computer Science and Engineering, Nanyang Technological University 2 School of Mathematical and Computer Sciences, Heriot-Watt University 3 Department of Computer Science, Southern University of Science and Technology 4 Genify
Pseudocode No The paper describes its methods using text and mathematical formulas but does not include any explicit pseudocode blocks or algorithms labeled as such.
Open Source Code Yes Implementation and experiments code are available in this public repository: https://github.com/CLCS-SUSTech/FACE.
Open Datasets Yes We consider such a text completion task in three domains: Wiki text, News, and Stories. ... For Wiki Text-103 [27] and Real News [1] datasets, we cleaned them before extracting the texts corresponding to the first 35 tokens (tokenized by GPT2Tokenizer) to form our prompt sets.
Dataset Splits No The paper describes splitting human data into two folds for a sanity test but does not provide specific training/validation/test splits for the datasets used in its main evaluation experiments.
Hardware Specification Yes For the text generation task, we use the remote workstation that has two NVIDIA RTX A6000 graphics cards. ... All of the above measurements take place on an AMD Ryzen Threadripper PRO 3995WX 64-Cores CPU (frequency range [2200.00MHz, 4308.40MHz]).
Software Dependencies Yes Our experiments were performed on Ubuntu 20.04.1 system with Python 3.9.16. The versions of key Python libraries include: Transformers 4.27.4, Py Torch-CUDA 11.6, Py Torch 1.13.1, Scipy 1.5.4.
Experiment Setup Yes In our research, we set the maximum generation length to 1024 for all models on three datasets. ... For both conditional and unconditional generation, we preset a random seed integer (32 by default). ... Furthermore, the maximum length of each text (1024 by default) as well as the batch size (which varies according to GPUs capacity) for perplexity computation have to be determined before automatic evaluation.