Benchmarking Large Language Models in Retrieval-Augmented Generation
Authors: Jiawei Chen, Hongyu Lin, Xianpei Han, Le Sun
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. |
| Researcher Affiliation | Academia | 1Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China 2State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China 3University of Chinese Academy of Sciences, Beijing, China {jiawei2020,hongyu,xianpei,sunle}@iscas.ac.cn |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code&data: https://github.com/chen700564/RGB. |
| Open Datasets | Yes | To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. Our code&data: https://github.com/chen700564/RGB. |
| Dataset Splits | No | The paper describes how its benchmark (RGB) is divided into four 'testbeds' for evaluation of different abilities, but it does not provide explicit train/validation/test dataset splits needed for model training or reproduction of data partitioning in the traditional sense, as it evaluates pre-trained LLMs. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like 'Chat GPT (gpt-3.5-turbo)', 'Google’s API', and 'an open-source dense retriever' but does not specify version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Task formats. We provide 5 external documents for each question. In our experiments on noise robustness, we evaluate scenarios with noise ratios ranging from 0 to 0.8. |