Can Large Language Models Understand Real-World Complex Instructions?

Authors: Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin Xiao, Qianxi He, Xunzhe Zhou, Jiaqing Liang, Yanghua Xiao

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments.
Researcher Affiliation Collaboration 1Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University 2School of Data Science, Fudan University 3Fudan-Aishu Cognitive Intelligence Joint Research Center, Shanghai, China
Pseudocode No The paper defines scoring functions using mathematical notation (e.g., fword-max, fword-min, fkeywords, Sq) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO.
Open Datasets Yes We construct a complex instruction dataset from real-world scenarios, containing 523 samples encompassing nine tasks, effectively covering our specified features. Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO.
Dataset Splits No The paper describes its benchmark dataset with 523 samples but does not explicitly detail specific training, validation, or test dataset splits for this benchmark to reproduce its evaluation experiments.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific computing environments) used for running its experiments.
Software Dependencies No The paper mentions using 'jieba' for Chinese word counting and 'nltk' for English word counting but does not specify their versions or other software dependencies with version numbers for reproducibility.
Experiment Setup No The paper evaluates pre-existing large language models using its proposed benchmark and evaluation criteria, but it does not specify experimental setup details such as hyperparameter values or training configurations for the evaluated models or its own evaluation system.