Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation
Authors: Gauthier Guinet, Behrooz Omidvar-Tehrani, Anoop Deoras, Laurent Callot
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate our approach on four new open-ended Question-Answering tasks based on Arxiv abstracts, Stack Exchange questions, AWS Dev Ops troubleshooting guides, and SEC filings. In addition, our experiments reveal more general insights into factors impacting RAG performance like size, retrieval mechanism, prompting and fine-tuning. |
| Researcher Affiliation | Industry | 1AWS AI Labs. Correspondence to: Gauthier Guinet <guinetgg@amazon.com>. |
| Pseudocode | Yes | Algorithm 1 Iterative Exam Improvement with IRT Model |
| Open Source Code | Yes | The source code is available at https://github.com/amazon-science/auto-rag-eval. |
| Open Datasets | Yes | We illustrate and evaluate our approach on open-ended question-answering tasks using 4 different knowledge corpora: AWS Dev Ops troubleshooting guides, Arxiv abstracts, Stack Exchange questions, and SEC Filings. We provide benchmark datasets for RAG systems evaluation, by creating four new tasks based on public datasets from diverse domains. |
| Dataset Splits | No | The paper describes the datasets used but does not explicitly provide training, validation, or test splits with specific percentages or sample counts for these datasets. |
| Hardware Specification | No | The paper mentions using 'pre-trained LLMs' and 'LLama V2-70B' for question generation, but it does not specify the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or train the models. |
| Software Dependencies | No | The paper mentions using 'NLKT word tokenizer' and 'L-BFGS-B solver' but does not provide specific version numbers for these or other key software components used in their framework implementation. It mentions 'LLama V2-70B' as a model used, but not as a general software dependency with version. |
| Experiment Setup | Yes | To minimize the negative log-likelihood, we leverage L-BFGS-B solver. We initialize at 0 all the values of θm (either for RAG model in the classical IRT or latent variables for the hierarchical model), at 1 the discrimination (ai)i Q, at 0 the difficulty (bi)i Q and at 0.25 the guessing (ci)i Q. We enforce the following constraints: 0.1 ai 1.5, 0.01 bi 1, 0.2 ci 0.4 and 3 θk 3. |