Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Can Large Language Models Understand Real-World Complex Instructions?
Authors: Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin Xiao, Qianxi He, Xunzhe Zhou, Jiaqing Liang, Yanghua Xiao
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments. |
| Researcher Affiliation | Collaboration | 1Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University 2School of Data Science, Fudan University 3Fudan-Aishu Cognitive Intelligence Joint Research Center, Shanghai, China |
| Pseudocode | No | The paper defines scoring functions using mathematical notation (e.g., fword-max, fword-min, fkeywords, Sq) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO. |
| Open Datasets | Yes | We construct a complex instruction dataset from real-world scenarios, containing 523 samples encompassing nine tasks, effectively covering our specified features. Resources of CELLO are publicly available at https://github.com/Abbey4799/CELLO. |
| Dataset Splits | No | The paper describes its benchmark dataset with 523 samples but does not explicitly detail specific training, validation, or test dataset splits for this benchmark to reproduce its evaluation experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific computing environments) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'jieba' for Chinese word counting and 'nltk' for English word counting but does not specify their versions or other software dependencies with version numbers for reproducibility. |
| Experiment Setup | No | The paper evaluates pre-existing large language models using its proposed benchmark and evaluation criteria, but it does not specify experimental setup details such as hyperparameter values or training configurations for the evaluated models or its own evaluation system. |