FACE: Evaluating Natural Language Generation with Fourier Analysis of Cross-Entropy
Authors: Zuhao Yang, Yingfang Yuan, Yang Xu, SHUO ZHAN, Huajun Bai, Kefan Chen
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Based on an open-ended generation task and the experimental data from previous studies, we find that FACE can effectively identify the human-model gap, scales with model size, reflects the outcomes of different sampling methods for decoding, correlates well with other evaluation metrics and with human judgment scores. |
| Researcher Affiliation | Collaboration | 1 School of Computer Science and Engineering, Nanyang Technological University 2 School of Mathematical and Computer Sciences, Heriot-Watt University 3 Department of Computer Science, Southern University of Science and Technology 4 Genify |
| Pseudocode | No | The paper describes its methods using text and mathematical formulas but does not include any explicit pseudocode blocks or algorithms labeled as such. |
| Open Source Code | Yes | Implementation and experiments code are available in this public repository: https://github.com/CLCS-SUSTech/FACE. |
| Open Datasets | Yes | We consider such a text completion task in three domains: Wiki text, News, and Stories. ... For Wiki Text-103 [27] and Real News [1] datasets, we cleaned them before extracting the texts corresponding to the first 35 tokens (tokenized by GPT2Tokenizer) to form our prompt sets. |
| Dataset Splits | No | The paper describes splitting human data into two folds for a sanity test but does not provide specific training/validation/test splits for the datasets used in its main evaluation experiments. |
| Hardware Specification | Yes | For the text generation task, we use the remote workstation that has two NVIDIA RTX A6000 graphics cards. ... All of the above measurements take place on an AMD Ryzen Threadripper PRO 3995WX 64-Cores CPU (frequency range [2200.00MHz, 4308.40MHz]). |
| Software Dependencies | Yes | Our experiments were performed on Ubuntu 20.04.1 system with Python 3.9.16. The versions of key Python libraries include: Transformers 4.27.4, Py Torch-CUDA 11.6, Py Torch 1.13.1, Scipy 1.5.4. |
| Experiment Setup | Yes | In our research, we set the maximum generation length to 1024 for all models on three datasets. ... For both conditional and unconditional generation, we preset a random seed integer (32 by default). ... Furthermore, the maximum length of each text (1024 by default) as well as the batch size (which varies according to GPUs capacity) for perplexity computation have to be determined before automatic evaluation. |