Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Authors: Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that SELF-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, SELF-RAG outperforms Chat GPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.
Researcher Affiliation Collaboration University of Washington Allen Institute for AI IBM Research AI {akari,zeqiuwu1,yizhongw,hannaneh}@cs.washington.edu, avi@us.ibm.com
Pseudocode Yes Algorithm 1 SELF-RAG Inference
Open Source Code Yes Our code and trained models are available at https://selfrag.github.io/.
Open Datasets Yes In particular, we sample instances from Open-Instruct processed data (Wang et al., 2023) and knowledge-intensive datasets (Petroni et al., 2021; Stelmakh et al., 2022; Mihaylov et al., 2018).
Dataset Splits Yes We conduct evaluations of our SELF-RAG and diverse baselines on a range of downstream tasks, holistically evaluating outputs with metrics designed to assess overall correctness, factuality, and fluency. Throughout these experiments, we conduct zero-shot evaluations, where we provide instructions describing tasks without few-shot demonstrations (Wei et al., 2022; Sanh et al., 2022).
Hardware Specification Yes We use 4 Nvidia A100 with 80GB memory to train our models. ... We run inference of our trained models using 1-2 Quadro RTX 6000 GPUs with 24GB memory.
Software Dependencies No The paper mentions 'Deepspeed stage 3' and 'Flash Attention (Dao et al., 2022)' and 'vllm (Kwon et al., 2023)' but does not provide specific version numbers for these software components.
Experiment Setup Yes All models are trained for 3 epochs with a batch size of 128, a peak learning rate of 2e-5 with 3% warmup steps, and linear decay afterward. We set the maximum token length to be 2,048 for the 7B model, and 1,524 for the 13B model due to the memory constraint.