Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

Authors: Gauthier Guinet, Behrooz Omidvar-Tehrani, Anoop Deoras, Laurent Callot

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate our approach on four new open-ended Question-Answering tasks based on Arxiv abstracts, Stack Exchange questions, AWS Dev Ops troubleshooting guides, and SEC filings. In addition, our experiments reveal more general insights into factors impacting RAG performance like size, retrieval mechanism, prompting and fine-tuning.
Researcher Affiliation Industry 1AWS AI Labs. Correspondence to: Gauthier Guinet <guinetgg@amazon.com>.
Pseudocode Yes Algorithm 1 Iterative Exam Improvement with IRT Model
Open Source Code Yes The source code is available at https://github.com/amazon-science/auto-rag-eval.
Open Datasets Yes We illustrate and evaluate our approach on open-ended question-answering tasks using 4 different knowledge corpora: AWS Dev Ops troubleshooting guides, Arxiv abstracts, Stack Exchange questions, and SEC Filings. We provide benchmark datasets for RAG systems evaluation, by creating four new tasks based on public datasets from diverse domains.
Dataset Splits No The paper describes the datasets used but does not explicitly provide training, validation, or test splits with specific percentages or sample counts for these datasets.
Hardware Specification No The paper mentions using 'pre-trained LLMs' and 'LLama V2-70B' for question generation, but it does not specify the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or train the models.
Software Dependencies No The paper mentions using 'NLKT word tokenizer' and 'L-BFGS-B solver' but does not provide specific version numbers for these or other key software components used in their framework implementation. It mentions 'LLama V2-70B' as a model used, but not as a general software dependency with version.
Experiment Setup Yes To minimize the negative log-likelihood, we leverage L-BFGS-B solver. We initialize at 0 all the values of θm (either for RAG model in the classical IRT or latent variables for the hierarchical model), at 1 the discrimination (ai)i Q, at 0 the difficulty (bi)i Q and at 0.25 the guessing (ci)i Q. We enforce the following constraints: 0.1 ai 1.5, 0.01 bi 1, 0.2 ci 0.4 and 3 θk 3.