reproducibilityindex.ai

Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

Authors: Gauthier Guinet, Behrooz Omidvar-Tehrani, Anoop Deoras, Laurent Callot

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate our approach on four new open-ended Question-Answering tasks based on Arxiv abstracts, Stack Exchange questions, AWS Dev Ops troubleshooting guides, and SEC filings. In addition, our experiments reveal more general insights into factors impacting RAG performance like size, retrieval mechanism, prompting and fine-tuning.
Researcher Affiliation	Industry	1AWS AI Labs. Correspondence to: Gauthier Guinet <guinetgg@amazon.com>.
Pseudocode	Yes	Algorithm 1 Iterative Exam Improvement with IRT Model
Open Source Code	Yes	The source code is available at https://github.com/amazon-science/auto-rag-eval.
Open Datasets	Yes	We illustrate and evaluate our approach on open-ended question-answering tasks using 4 different knowledge corpora: AWS Dev Ops troubleshooting guides, Arxiv abstracts, Stack Exchange questions, and SEC Filings. We provide benchmark datasets for RAG systems evaluation, by creating four new tasks based on public datasets from diverse domains.
Dataset Splits	No	The paper describes the datasets used but does not explicitly provide training, validation, or test splits with specific percentages or sample counts for these datasets.
Hardware Specification	No	The paper mentions using 'pre-trained LLMs' and 'LLama V2-70B' for question generation, but it does not specify the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or train the models.
Software Dependencies	No	The paper mentions using 'NLKT word tokenizer' and 'L-BFGS-B solver' but does not provide specific version numbers for these or other key software components used in their framework implementation. It mentions 'LLama V2-70B' as a model used, but not as a general software dependency with version.
Experiment Setup	Yes	To minimize the negative log-likelihood, we leverage L-BFGS-B solver. We initialize at 0 all the values of θm (either for RAG model in the classical IRT or latent variables for the hierarchical model), at 1 the discrimination (ai)i Q, at 0 the difficulty (bi)i Q and at 0.25 the guessing (ci)i Q. We enforce the following constraints: 0.1 ai 1.5, 0.01 bi 1, 0.2 ci 0.4 and 3 θk 3.