Benchmarking Large Language Models on Controllable Generation under Diversified Instructions
Authors: Yihan Chen, Benfeng Xu, Quan Wang, Yi Liu, Zhendong Mao
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address this vacancy, we propose a new benchmark Co DI-Eval to systematically and comprehensively evaluate LLMs responses to instructions with various constraints... We provide extensive evaluations of representative LLMs (e.g., Chat GPT, Vicuna) on Co DI-Eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between opensource and commercial closed-source LLMs. |
| Researcher Affiliation | Academia | 1University of Science and Technology of China 2MOE Key Laboratory of Trustworthy Distributed Computing and Service, Beijing University of Posts and Telecommunications 3State Key Laboratory of Communication Content Cognition, People s Daily Online, Beijing, China |
| Pseudocode | No | The paper describes procedures but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our data and code are available at https://github.com/Xt-cyh/Co DI-Eval. |
| Open Datasets | Yes | Our data and code are available at https://github.com/Xt-cyh/Co DI-Eval. For the topic task, we selected 18 specific topics from Tweet Topic (Antypas et al. 2022) as the control attributes in the topic CTG tasks. We randomly selected keywords from the Common Gen dataset (Lin et al. 2020) to fill these instructions. Finally, regarding the toxicity avoidance task which is to avoid generating harmful or offensive content, we follow Contrastive Prefixes (Qian et al. 2022) by selecting 203 prompts labeled as challenge from Real Toxic Prompts (Gehman et al. 2020) with toxicity scores below 0.5. |
| Dataset Splits | No | The paper describes constructing an "evaluation instruction set" and using it for zero-shot and few-shot testing, but it does not specify explicit training, validation, and test dataset splits for model training, as the paper evaluates existing LLMs rather than training a new model. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments. |
| Software Dependencies | No | The paper mentions using "GPT-3.5-turbo (0301)", "Hugging Face" models, and "Perspective API", but it does not list specific version numbers for other software dependencies or libraries used in the experimental setup. |
| Experiment Setup | Yes | Our benchmark does not impose any restrictions on the decoding method of the models. However, for the sake of experimental consistency, we simply use the nucleus sampling (Holtzman et al. 2019) and set the top-p parameter to 0.9, as well as the temperature to 1.0. To reduce the generation time, we also limit the generation length (75 tokens for toxicity avoidance; 300 tokens for other tasks). |