Can Large Language Models Understand Real-World Complex Instructions?
Authors: Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin Xiao, Qianxi He, Xunzhe Zhou, Jiaqing Liang, Yanghua Xiao
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments. |
| Researcher Affiliation | Collaboration | 1Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University 2School of Data Science, Fudan University 3Fudan-Aishu Cognitive Intelligence Joint Research Center, Shanghai, China |
| Pseudocode | No | The paper defines scoring functions using mathematical notation (e.g., fword-max, fword-min, fkeywords, Sq) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO. |
| Open Datasets | Yes | We construct a complex instruction dataset from real-world scenarios, containing 523 samples encompassing nine tasks, effectively covering our specified features. Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO. |
| Dataset Splits | No | The paper describes its benchmark dataset with 523 samples but does not explicitly detail specific training, validation, or test dataset splits for this benchmark to reproduce its evaluation experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific computing environments) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'jieba' for Chinese word counting and 'nltk' for English word counting but does not specify their versions or other software dependencies with version numbers for reproducibility. |
| Experiment Setup | No | The paper evaluates pre-existing large language models using its proposed benchmark and evaluation criteria, but it does not specify experimental setup details such as hyperparameter values or training configurations for the evaluated models or its own evaluation system. |