reproducibilityindex.ai

Can Large Language Models Understand Real-World Complex Instructions?

Authors: Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin Xiao, Qianxi He, Xunzhe Zhou, Jiaqing Liang, Yanghua Xiao

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments.
Researcher Affiliation	Collaboration	1Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University 2School of Data Science, Fudan University 3Fudan-Aishu Cognitive Intelligence Joint Research Center, Shanghai, China
Pseudocode	No	The paper defines scoring functions using mathematical notation (e.g., fword-max, fword-min, fkeywords, Sq) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO.
Open Datasets	Yes	We construct a complex instruction dataset from real-world scenarios, containing 523 samples encompassing nine tasks, effectively covering our specified features. Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO.
Dataset Splits	No	The paper describes its benchmark dataset with 523 samples but does not explicitly detail specific training, validation, or test dataset splits for this benchmark to reproduce its evaluation experiments.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific computing environments) used for running its experiments.
Software Dependencies	No	The paper mentions using 'jieba' for Chinese word counting and 'nltk' for English word counting but does not specify their versions or other software dependencies with version numbers for reproducibility.
Experiment Setup	No	The paper evaluates pre-existing large language models using its proposed benchmark and evaluation criteria, but it does not specify experimental setup details such as hyperparameter values or training configurations for the evaluated models or its own evaluation system.